--- license: apache-2.0 tags: - language-model - pretraining - transformer - agillm library_name: pytorch --- # AGILLM-3-Large v2 Continuation of [AGILLM-3-Large](https://huggingface.co/OpenTransformer/AGILLM-3-large) training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered. **Demo Space:** [OpenTransformer/AGILLM-3-large-v2-demo](https://huggingface.co/spaces/OpenTransformer/AGILLM-3-large-v2-demo) ## Current recommended checkpoint (2026-05-21) **Use [sft_chat_1024_220k_20260520/final.pt](sft_chat_1024_220k_20260520/final.pt) in AR mode.** This is the current canonical AGILLM-3 v2 chat checkpoint: it answers 1+1=2, 2+2=4, and 47+28=75 in the smoke rerun. **Do not use [sft_sat_repair_v3_20260521/final.pt](sft_sat_repair_v3_20260521/final.pt) except as a failed-experiment archive.** That run regressed AR arithmetic and did not repair SAT. SAT mode remains experimental/broken pending a separate objective/inference fix. ## What happened to v1? A `transformers` library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline: - **Root cause:** The tokenizer's `Metaspace` pre-tokenizer was configured to use `▁` (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses `Ġ` (U+0120, GPT-2 convention). This mismatch caused: - **Encoding:** All spaces were silently dropped. `"Water boils"` encoded to `['Water', 'bo', 'ils']` instead of `['ĠWater', 'Ġboils']` - **Decoding:** `tok.decode()` lost all spaces. Round-trip `encode→decode` of `"The meaning of life"` returned `"Themeaningoflife"` - **Training data corruption:** ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade - **Detection:** Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update. - **Fix:** Pinned `transformers==4.48.0` (+ `tokenizers==0.21.4`), which correctly handles the `Ġ` space prefix. Also added a runtime fix in `n.py` that patches the `▁→Ġ` mismatch if detected. ## This repo Resumes training from **step 12,373,125** (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data. ### Model | Parameter | Value | |-----------|-------| | Parameters | 698M | | Hidden dim | 1024 | | Layers | 24 | | Heads | 16 | | Rank | 128 | | Expansion ratio | 2.0x | | Vocab | 128,815 (DeepSeek-V3.2 tokenizer) | | Architecture | Joint AR + SAT (autoregressive + span-aware transformer) | | Training target | 35B tokens | | Tokens seen (at restart) | ~10.83B (30.9%) | ### Training setup - **GPU:** RTX 4090 (Vast.ai, ~$0.27/hr) - **Speed:** ~20,000 tok/s - **Block size:** 1122 - **Batch size:** 1 - **Mixed precision:** AMP (BF16) - **Optimizer:** AdamW (LR core=5e-5, LR head=2e-4) - **Data:** Streamed from multiple HuggingFace datasets (web crawl, cleaned text) ### Important: Tokenizer compatibility This model requires `transformers<=4.48.0` for correct tokenizer behavior: ```bash pip install transformers==4.48.0 tokenizers==0.21.4 ``` Or use the runtime fix in `n.py` which auto-patches the `▁/Ġ` mismatch. ## Sample output (step 12,373,125) **Prompt:** "Water boils at one hundred degrees" > Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life." ## Latest smoke-test output (step 30,028,112) Generated from `pretrain_delta_step30028112.pt` on 2026-05-16 after the tokenizer health check passed with `transformers==4.48.0` and `tokenizers==0.21.4`. - **Tokens seen at checkpoint:** saved between 33,185,096,055 and 33,186,096,759 / 35,000,000,000 tokens (~94.8%). - **Live ETA snapshot:** as of 2026-05-16 04:00:55 UTC, training was at 33,217,118,583 / 35,000,000,000 tokens (94.906053%) at 5,221 tok/s, with 1,782,881,417 tokens remaining. Pretraining was projected to finish at 2026-05-20 02:52:17 UTC (03:52:17 UK) if speed holds. **Prompt:** "Water boils at one hundred degrees" > Water boils at one hundred degrees, and there is a challenge to the Doctor to receive the two-week term, before finally waiting for their upcoming term. The case comes as the justices are poised against it for an exemption because there is not what a Mississippi law that bars most ## Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20) After pretraining hit 35B tokens and saved `pretrain_final.pt`, a chat SFT pass was run on top, producing `sft_chat/final.pt` (step 32,440,025; "All Training Complete" at 2026-05-20 04:55 UTC). It is now uploaded to this repo under `sft_chat/`. Raw smoke-test log: [`inference_results/sft_chat_final_smoke_20260520T084912Z.txt`](inference_results/sft_chat_final_smoke_20260520T084912Z.txt). **Result, honestly:** the `User:` / `Assistant:` turn template is obeyed across AR, SAT-fixed, and SAT-variable sampling paths, so the SFT fine-tune is reaching the model. Content quality is not yet usable in any of the three. Decode speeds on RTX 4090, ~698M params: | Mode | Flags | tok/s | |------|-------|-------| | AR | `--mode ar` | ~51.6 | | SAT fixed-stride | `--mode sat --no-var` | ~81.7 | | SAT variable-stride | `--mode sat --var` | ~80.7 | ### AR — chat prompt ``` User: In one short paragraph, say what you are and answer: what is 2+2? Assistant: ```python import pandas as pd from sklearn.linear_model import LinearRegression def task_func(df): if not isinstance(df, columns): raise ValueError("The function must be a positive integer.") ... ``` The model immediately drops into BigCodeBench-style "code task" boilerplate instead of answering. Suggests the SFT mix was too code/task-heavy. ### SAT fixed-stride — same chat prompt ``` User: In one short paragraph, say what you are and answer: what is 2+2? Assistant: that -a In to In-:.k a18 the) In in : Ininx The (43 as of:0 In1012 ... ``` Decodes as token-salad. ### SAT variable-stride — same chat prompt ``` User: In one short paragraph, say what you are and answer: what is 2+2? Assistant: -v" -) the.k8410kl inx have :9 a1 of$0in J [art to?ert200 only V612 ... ``` Also token-salad. The variable-stride path needs `--no_inference_mode` to run today, due to `alibi_plus_mask` reading `_version` on an inference tensor — flag docstring says math is unchanged, just a perf/version-tracking tradeoff. ### AR — arithmetic smoke ``` 2+2=0 return $($($($($($($($($$))).join('')).join('')).join('')).join('')).join('')).join('')). ``` Wrong answer plus PowerShell-flavoured nonsense. ### Read of this - Inference plumbing, tokenizer health check, turn template, and checkpoint load are all fine across AR / SAT-fixed / SAT-variable. - The chat capability itself is not there yet at this SFT pass — needs a cleaner, more balanced instruct/chat mix (less code-task data, more direct-answer dialogue). - Next step is a continuation SFT from `sft_chat/final.pt` into `sft_chat_v2/` on a UltraChat + SlimOrca + OpenHermes + UltraFeedback-chosen mix. ## Long-SFT chat run — `sft_chat_1024_220k_20260520/final.pt` (2026-05-21) A second SFT pass on top of `pretrain_final.pt`, ran for ~220k steps on a cleaner chat/math mix at block 1024. Completed at step **32,820,025** on **2026-05-21 04:09 UTC** and uploaded the same day. **This is the first chat checkpoint from the v2 lineage that actually answers arithmetic correctly in AR mode.** | File | Size | SHA256 | |------|------|--------| | `sft_chat_1024_220k_20260520/final.pt` | 8,385,662,500 B | `01bc728b0e03ef0d2f2661162d5b65f6aac75251c305a20cb4fde0de5388a7de` | Plus 8 intermediate delta checkpoints (`sft_step32643775.pt` → `sft_step32796883.pt`) and 3 raw smoke logs under [`sft_chat_1024_220k_20260520/inference/`](sft_chat_1024_220k_20260520/inference). ### AR-mode smoke results (rerun, 2026-05-21 04:27 UTC) | Prompt | Completion | Verdict | |---|---|---| | `User: What is 1+1? Assistant:` | ` 2` | ✓ | | `User: What is 2+2? Assistant:` | ` 4` | ✓ | | `User: What is 47 + 28? Assistant:` | ` 75` | ✓ | | `User: In one short paragraph, say what you are and answer: what is 2+2?` | `I am a small experimental language model. 1) The distance between the two points (x-3)/6=<<0/5*7=21>>20 kmph...` | self-identifies, then drifts into a math-template | | `User: Hello, can you chat normally for one sentence?` | `Sure! Here's a simple example of how to use the word "hello" in your poem. The first line is 'I'm not good enough'...` | coherent English, drifts after one sentence | Decode speed in AR mode: 47–50 tok/s on RTX 4090 at 698M params. Direct numerical answers come back in 3 tokens (`[0.26s | 3 tokens | 11.4 tok/s]`). ### SAT-mode smoke results — still collapsed Both fixed-stride (`--mode sat --no-var`) and variable-stride (`--mode sat --var`) produce the same stopword-cloud output regardless of prompt: ``` . and, value the step of to ``` ~28–30 tok/s, 8–10 tokens before stopping. The SAT head wasn't repaired by this SFT pass — AR is the only usable inference path on this checkpoint. ### Read of this - **Use:** `sft_chat_1024_220k_20260520/final.pt` in AR mode for direct-answer prompts. Limit max_new to keep it from drifting into the math-word-problem template it was trained on. - **Skip:** `sft_math_v1/final.pt` — the dedicated math SFT actually came out worse (`1+1=5`, `2+2=5`). The long chat SFT subsumed it. - **Don't bother with SAT mode** on this checkpoint until a future SFT pass repairs the SAT head. ## v3 SAT-repair SFT — **failed experiment, do not use** (2026-05-21) The v3 SAT-repair pass at [`sft_sat_repair_v3_20260521/`](sft_sat_repair_v3_20260521) **regressed the model** vs its `sft_chat_1024_220k_20260520/final.pt` warm-start and is archived here only as a negative result. ### Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC) | Prompt | 220k base (working) | v3 (regressed) | |---|---|---| | `User: What is 1+1? Assistant:` | ` 2` ✓ | ` The answer to the question "What's the deal with the movie?" is: 2*3.` ✗ | | `User: What is 2+2? Assistant:` | ` 4` ✓ | ` The answer to the question "statement 1": A man standing in front of a square, then one side is placed on his shoulder...` ✗ | | `User: What is 47 + 28? Assistant:` | ` 75` ✓ | ` Let's denote the number of ways to arrange a square in one place... (A + B) = 47 - 28 = 0` ✗ | | Short-chat prompt | "Sure! Here's a simple example..." (coherent intro, drifts) | "Here's a short story about a person who has a secret gift..." (immediate novel-drift) | | SAT-fixed 2+2 | stopword salad ✗ | identical stopword salad ✗ | | SAT-var 2+2 | stopword salad ✗ | **byte-for-byte identical** to sat-fixed (gate not engaging) ✗ | ### What went wrong 1. **AR regressed.** The 220k could answer simple arithmetic in 3 tokens. v3 lost that — the new mix (heavy on OpenMathInstruct-2 + OpenR1-Math-220k + Magpie) overfit the model to "frame any number question as a multi-step word problem," destroying the direct-answer behavior. 2. **SAT did not recover.** Same stopword-collapse pattern. The unilaterally-applied SAT-target patch in `nB300.py` did not fix the head; it trained it to a different broken equilibrium. The fact that `--mode sat --var` produces byte-identical output to `--mode sat --no-var` means the SAT gate isn't engaging — a regression vs the 220k state. 3. **The loss-spike was permanent damage, not re-equilibration.** Loss trajectory: `3.014 → 1.231 (12%)` → spike to `8.181 (16%)` → bouncing 5-9 → final **5.557**. Final loss higher than starting loss = the run made the model worse on its own objective. Not a re-find of equilibrium. ### Recommendation **Use [`sft_chat_1024_220k_20260520/final.pt`](sft_chat_1024_220k_20260520) as the canonical chat checkpoint.** It still answers `1+1=2`, `2+2=4`, `47+28=75` in AR mode and respects EOS. SAT mode remains broken on both checkpoints and needs a separate inference/objective investigation, not another SFT pass with the same broken patch. The full v3 final.pt + 8 intermediate deltas + train/upload logs + the trainer code snapshot ([`code/nB300_20260521T192736Z.py`](sft_sat_repair_v3_20260521/code)) are preserved in `sft_sat_repair_v3_20260521/` as a record of the failed experiment. ## RECOMMENDED CHECKPOINT — `sat_current_turn_v18_20260522` (2026-05-22) **First AGILLM-3 checkpoint where AR mode AND both SAT modes (fixed + variable stride) produce correct arithmetic AND coherent chat in the same model.** Promoted to canonical on 2026-05-22. - Path: [`sat_current_turn_v18_20260522/final.pt`](sat_current_turn_v18_20260522) - Size: 4,923,609,336 bytes - SHA256: `50a4dedb6bd768fbfa99815e2ab2af5ef0b7debcfad08f548236b19099e514c6` - Lineage: head-only current-turn repair, warm-start from `sat_chat_math_v13_20260522`, core frozen, ~20 min on a single RTX 4090 ### Smoke results | Prompt | AR | SAT-fixed | SAT-var | |---|---|---|---| | `What is 1+1?` | ` 2` ✓ | ` 2` ✓ | ` 2` ✓ | | `What is 2+2?` | ` 4` ✓ | **` 4` ✓** | **` 4` ✓** | | `What is 47+28?` | ` 75` ✓ | ` 75` ✓ | ` 75` ✓ | | Short chat | "Yes, I can chat normally in short natural text." | "Yes, I am working." | "Yes, I am working." | | Prev-turn trap | "...The result is 4+1." (mild AR drift) | "...The answer is 4." ✓ | "...The answer is 4." ✓ | ### Architectural significance First checkpoint that vindicates the dual-head AR+SAT thesis. Lineage: - `sft_chat_1024_220k_20260520`: AR worked, SAT collapsed to stopword salad - v3 → v9 → v17: progressive partial fixes (gate fix, head reset, MLP head, AR distillation, per-position gate) - **v18: AR and both SAT modes hit the bar simultaneously.** ### Demoted `sft_chat_1024_220k_20260520/final.pt` is no longer the recommended canonical — preserved as a record of the AR-only-working era. ## License Apache 2.0