---
license: apache-2.0
tags:
  - language-model
  - pretraining
  - transformer
  - agillm
library_name: pytorch
---

# AGILLM-3-Large v2

Continuation of [AGILLM-3-Large](https://huggingface.co/OpenTransformer/AGILLM-3-large) training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

**Demo Space:** [OpenTransformer/AGILLM-3-large-v2-demo](https://huggingface.co/spaces/OpenTransformer/AGILLM-3-large-v2-demo)

## Current recommended checkpoint (2026-05-21)

**Use [sft_chat_1024_220k_20260520/final.pt](sft_chat_1024_220k_20260520/final.pt) in AR mode.** This is the current canonical AGILLM-3 v2 chat checkpoint: it answers 1+1=2, 2+2=4, and 47+28=75 in the smoke rerun.

**Do not use [sft_sat_repair_v3_20260521/final.pt](sft_sat_repair_v3_20260521/final.pt) except as a failed-experiment archive.** That run regressed AR arithmetic and did not repair SAT. SAT mode remains experimental/broken pending a separate objective/inference fix.

## What happened to v1?

A `transformers` library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

- **Root cause:** The tokenizer's `Metaspace` pre-tokenizer was configured to use `▁` (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses `Ġ` (U+0120, GPT-2 convention). This mismatch caused:
  - **Encoding:** All spaces were silently dropped. `"Water boils"` encoded to `['Water', 'bo', 'ils']` instead of `['ĠWater', 'Ġboils']`
  - **Decoding:** `tok.decode()` lost all spaces. Round-trip `encode→decode` of `"The meaning of life"` returned `"Themeaningoflife"`
  - **Training data corruption:** ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade

- **Detection:** Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.

- **Fix:** Pinned `transformers==4.48.0` (+ `tokenizers==0.21.4`), which correctly handles the `Ġ` space prefix. Also added a runtime fix in `n.py` that patches the `▁→Ġ` mismatch if detected.

## This repo

Resumes training from **step 12,373,125** (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.

### Model

| Parameter | Value |
|-----------|-------|
| Parameters | 698M |
| Hidden dim | 1024 |
| Layers | 24 |
| Heads | 16 |
| Rank | 128 |
| Expansion ratio | 2.0x |
| Vocab | 128,815 (DeepSeek-V3.2 tokenizer) |
| Architecture | Joint AR + SAT (autoregressive + span-aware transformer) |
| Training target | 35B tokens |
| Tokens seen (at restart) | ~10.83B (30.9%) |

### Training setup

- **GPU:** RTX 4090 (Vast.ai, ~$0.27/hr)
- **Speed:** ~20,000 tok/s
- **Block size:** 1122
- **Batch size:** 1
- **Mixed precision:** AMP (BF16)
- **Optimizer:** AdamW (LR core=5e-5, LR head=2e-4)
- **Data:** Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

### Important: Tokenizer compatibility

This model requires `transformers<=4.48.0` for correct tokenizer behavior:

```bash
pip install transformers==4.48.0 tokenizers==0.21.4
```

Or use the runtime fix in `n.py` which auto-patches the `▁/Ġ` mismatch.

## Sample output (step 12,373,125)

**Prompt:** "Water boils at one hundred degrees"

> Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

## Latest smoke-test output (step 30,028,112)

Generated from `pretrain_delta_step30028112.pt` on 2026-05-16 after the tokenizer health check passed with `transformers==4.48.0` and `tokenizers==0.21.4`.

- **Tokens seen at checkpoint:** saved between 33,185,096,055 and 33,186,096,759 / 35,000,000,000 tokens (~94.8%).
- **Live ETA snapshot:** as of 2026-05-16 04:00:55 UTC, training was at 33,217,118,583 / 35,000,000,000 tokens (94.906053%) at 5,221 tok/s, with 1,782,881,417 tokens remaining. Pretraining was projected to finish at 2026-05-20 02:52:17 UTC (03:52:17 UK) if speed holds.

**Prompt:** "Water boils at one hundred degrees"

> Water boils at one hundred degrees, and there is a challenge to the Doctor to receive the two-week term, before finally waiting for their upcoming term. The case comes as the justices are poised against it for an exemption because there is not what a Mississippi law that bars most

## Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20)

After pretraining hit 35B tokens and saved `pretrain_final.pt`, a chat SFT pass was run on top, producing `sft_chat/final.pt` (step 32,440,025; "All Training Complete" at 2026-05-20 04:55 UTC). It is now uploaded to this repo under `sft_chat/`. Raw smoke-test log: [`inference_results/sft_chat_final_smoke_20260520T084912Z.txt`](inference_results/sft_chat_final_smoke_20260520T084912Z.txt).

**Result, honestly:** the `User:` / `Assistant:` turn template is obeyed across AR, SAT-fixed, and SAT-variable sampling paths, so the SFT fine-tune is reaching the model. Content quality is not yet usable in any of the three.

Decode speeds on RTX 4090, ~698M params:

| Mode | Flags | tok/s |
|------|-------|-------|
| AR | `--mode ar` | ~51.6 |
| SAT fixed-stride | `--mode sat --no-var` | ~81.7 |
| SAT variable-stride | `--mode sat --var` | ~80.7 |

### AR — chat prompt
```
User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: ```python
import pandas as pd
from sklearn.linear_model import LinearRegression

def task_func(df):
    if not isinstance(df, columns):
        raise ValueError("The function must be a positive integer.")
    ...
```

The model immediately drops into BigCodeBench-style "code task" boilerplate instead of answering. Suggests the SFT mix was too code/task-heavy.

### SAT fixed-stride — same chat prompt
```
User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: that -a In to In-:.k a18 the) In in : Ininx The (43 as of:0 In1012 ...
```

Decodes as token-salad.

### SAT variable-stride — same chat prompt
```
User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: -v" -) the.k8410kl inx have :9 a1 of$0in J [art to?ert200 only V612 ...
```

Also token-salad. The variable-stride path needs `--no_inference_mode` to run today, due to `alibi_plus_mask` reading `_version` on an inference tensor — flag docstring says math is unchanged, just a perf/version-tracking tradeoff.

### AR — arithmetic smoke
```
2+2=0
    return $($($($($($($($($$))).join('')).join('')).join('')).join('')).join('')).join('')).
```

Wrong answer plus PowerShell-flavoured nonsense.

### Read of this

- Inference plumbing, tokenizer health check, turn template, and checkpoint load are all fine across AR / SAT-fixed / SAT-variable.
- The chat capability itself is not there yet at this SFT pass — needs a cleaner, more balanced instruct/chat mix (less code-task data, more direct-answer dialogue).
- Next step is a continuation SFT from `sft_chat/final.pt` into `sft_chat_v2/` on a UltraChat + SlimOrca + OpenHermes + UltraFeedback-chosen mix.

## Long-SFT chat run — `sft_chat_1024_220k_20260520/final.pt` (2026-05-21)

A second SFT pass on top of `pretrain_final.pt`, ran for ~220k steps on a cleaner chat/math mix at block 1024. Completed at step **32,820,025** on **2026-05-21 04:09 UTC** and uploaded the same day.

**This is the first chat checkpoint from the v2 lineage that actually answers arithmetic correctly in AR mode.**

| File | Size | SHA256 |
|------|------|--------|
| `sft_chat_1024_220k_20260520/final.pt` | 8,385,662,500 B | `01bc728b0e03ef0d2f2661162d5b65f6aac75251c305a20cb4fde0de5388a7de` |

Plus 8 intermediate delta checkpoints (`sft_step32643775.pt` → `sft_step32796883.pt`) and 3 raw smoke logs under [`sft_chat_1024_220k_20260520/inference/`](sft_chat_1024_220k_20260520/inference).

### AR-mode smoke results (rerun, 2026-05-21 04:27 UTC)

| Prompt | Completion | Verdict |
|---|---|---|
| `User: What is 1+1?
Assistant:` | ` 2` | ✓ |
| `User: What is 2+2?
Assistant:` | ` 4` | ✓ |
| `User: What is 47 + 28?
Assistant:` | ` 75` | ✓ |
| `User: In one short paragraph, say what you are and answer: what is 2+2?` | `I am a small experimental language model. 1) The distance between the two points (x-3)/6=<<0/5*7=21>>20 kmph...` | self-identifies, then drifts into a math-template |
| `User: Hello, can you chat normally for one sentence?` | `Sure! Here's a simple example of how to use the word "hello" in your poem. The first line is 'I'm not good enough'...` | coherent English, drifts after one sentence |

Decode speed in AR mode: 47–50 tok/s on RTX 4090 at 698M params. Direct numerical answers come back in 3 tokens (`[0.26s | 3 tokens | 11.4 tok/s]`).

### SAT-mode smoke results — still collapsed

Both fixed-stride (`--mode sat --no-var`) and variable-stride (`--mode sat --var`) produce the same stopword-cloud output regardless of prompt:
```
.
 and, value
 the step of to
```
~28–30 tok/s, 8–10 tokens before stopping. The SAT head wasn't repaired by this SFT pass — AR is the only usable inference path on this checkpoint.

### Read of this

- **Use:** `sft_chat_1024_220k_20260520/final.pt` in AR mode for direct-answer prompts. Limit max_new to keep it from drifting into the math-word-problem template it was trained on.
- **Skip:** `sft_math_v1/final.pt` — the dedicated math SFT actually came out worse (`1+1=5`, `2+2=5`). The long chat SFT subsumed it.
- **Don't bother with SAT mode** on this checkpoint until a future SFT pass repairs the SAT head.

## v3 SAT-repair SFT — **failed experiment, do not use** (2026-05-21)

The v3 SAT-repair pass at [`sft_sat_repair_v3_20260521/`](sft_sat_repair_v3_20260521) **regressed the model** vs its `sft_chat_1024_220k_20260520/final.pt` warm-start and is archived here only as a negative result.

### Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC)

| Prompt | 220k base (working) | v3 (regressed) |
|---|---|---|
| `User: What is 1+1?
Assistant:` | ` 2` ✓ | ` The answer to the question "What's the deal with the movie?" is: 2*3.` ✗ |
| `User: What is 2+2?
Assistant:` | ` 4` ✓ | ` The answer to the question "statement 1": A man standing in front of a square, then one side is placed on his shoulder...` ✗ |
| `User: What is 47 + 28?
Assistant:` | ` 75` ✓ | ` Let's denote the number of ways to arrange a square in one place... (A + B) = 47 - 28 = 0` ✗ |
| Short-chat prompt | "Sure! Here's a simple example..." (coherent intro, drifts) | "Here's a short story about a person who has a secret gift..." (immediate novel-drift) |
| SAT-fixed 2+2 | stopword salad ✗ | identical stopword salad ✗ |
| SAT-var 2+2 | stopword salad ✗ | **byte-for-byte identical** to sat-fixed (gate not engaging) ✗ |

### What went wrong

1. **AR regressed.** The 220k could answer simple arithmetic in 3 tokens. v3 lost that — the new mix (heavy on OpenMathInstruct-2 + OpenR1-Math-220k + Magpie) overfit the model to "frame any number question as a multi-step word problem," destroying the direct-answer behavior.
2. **SAT did not recover.** Same stopword-collapse pattern. The unilaterally-applied SAT-target patch in `nB300.py` did not fix the head; it trained it to a different broken equilibrium. The fact that `--mode sat --var` produces byte-identical output to `--mode sat --no-var` means the SAT gate isn't engaging — a regression vs the 220k state.
3. **The loss-spike was permanent damage, not re-equilibration.** Loss trajectory: `3.014 → 1.231 (12%)` → spike to `8.181 (16%)` → bouncing 5-9 → final **5.557**. Final loss higher than starting loss = the run made the model worse on its own objective. Not a re-find of equilibrium.

### Recommendation

**Use [`sft_chat_1024_220k_20260520/final.pt`](sft_chat_1024_220k_20260520) as the canonical chat checkpoint.** It still answers `1+1=2`, `2+2=4`, `47+28=75` in AR mode and respects EOS. SAT mode remains broken on both checkpoints and needs a separate inference/objective investigation, not another SFT pass with the same broken patch.

The full v3 final.pt + 8 intermediate deltas + train/upload logs + the trainer code snapshot ([`code/nB300_20260521T192736Z.py`](sft_sat_repair_v3_20260521/code)) are preserved in `sft_sat_repair_v3_20260521/` as a record of the failed experiment.

## RECOMMENDED CHECKPOINT — `sat_current_turn_v18_20260522` (2026-05-22)

**First AGILLM-3 checkpoint where AR mode AND both SAT modes (fixed + variable stride) produce correct arithmetic AND coherent chat in the same model.** Promoted to canonical on 2026-05-22.

- Path: [`sat_current_turn_v18_20260522/final.pt`](sat_current_turn_v18_20260522)
- Size: 4,923,609,336 bytes
- SHA256: `50a4dedb6bd768fbfa99815e2ab2af5ef0b7debcfad08f548236b19099e514c6`
- Lineage: head-only current-turn repair, warm-start from `sat_chat_math_v13_20260522`, core frozen, ~20 min on a single RTX 4090

### Smoke results

| Prompt | AR | SAT-fixed | SAT-var |
|---|---|---|---|
| `What is 1+1?` | ` 2` ✓ | ` 2` ✓ | ` 2` ✓ |
| `What is 2+2?` | ` 4` ✓ | **` 4` ✓** | **` 4` ✓** |
| `What is 47+28?` | ` 75` ✓ | ` 75` ✓ | ` 75` ✓ |
| Short chat | "Yes, I can chat normally in short natural text." | "Yes, I am working." | "Yes, I am working." |
| Prev-turn trap | "...The result is 4+1." (mild AR drift) | "...The answer is 4." ✓ | "...The answer is 4." ✓ |

### Architectural significance

First checkpoint that vindicates the dual-head AR+SAT thesis. Lineage:
- `sft_chat_1024_220k_20260520`: AR worked, SAT collapsed to stopword salad
- v3 → v9 → v17: progressive partial fixes (gate fix, head reset, MLP head, AR distillation, per-position gate)
- **v18: AR and both SAT modes hit the bar simultaneously.**

### Demoted

`sft_chat_1024_220k_20260520/final.pt` is no longer the recommended canonical — preserved as a record of the AR-only-working era.

## License

Apache 2.0