AGILLM-3-Large v2

Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

Demo Space: OpenTransformer/AGILLM-3-large-v2-demo

Current recommended checkpoint (2026-05-21)

Use sft_chat_1024_220k_20260520/final.pt in AR mode. This is the current canonical AGILLM-3 v2 chat checkpoint: it answers 1+1=2, 2+2=4, and 47+28=75 in the smoke rerun.

Do not use sft_sat_repair_v3_20260521/final.pt except as a failed-experiment archive. That run regressed AR arithmetic and did not repair SAT. SAT mode remains experimental/broken pending a separate objective/inference fix.

What happened to v1?

A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

Root cause: The tokenizer's Metaspace pre-tokenizer was configured to use ▁ (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses Ġ (U+0120, GPT-2 convention). This mismatch caused:
- Encoding: All spaces were silently dropped. "Water boils" encoded to ['Water', 'bo', 'ils'] instead of ['ĠWater', 'Ġboils']
- Decoding: tok.decode() lost all spaces. Round-trip encode→decode of "The meaning of life" returned "Themeaningoflife"
- Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.
Fix: Pinned transformers==4.48.0 (+ tokenizers==0.21.4), which correctly handles the Ġ space prefix. Also added a runtime fix in n.py that patches the ▁→Ġ mismatch if detected.

This repo

Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.

Model

Parameter	Value
Parameters	698M
Hidden dim	1024
Layers	24
Heads	16
Rank	128
Expansion ratio	2.0x
Vocab	128,815 (DeepSeek-V3.2 tokenizer)
Architecture	Joint AR + SAT (autoregressive + span-aware transformer)
Training target	35B tokens
Tokens seen (at restart)	~10.83B (30.9%)

Training setup

GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
Speed: ~20,000 tok/s
Block size: 1122
Batch size: 1
Mixed precision: AMP (BF16)
Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

Important: Tokenizer compatibility

This model requires transformers<=4.48.0 for correct tokenizer behavior:

pip install transformers==4.48.0 tokenizers==0.21.4

Or use the runtime fix in n.py which auto-patches the ▁/Ġ mismatch.

Sample output (step 12,373,125)

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

Latest smoke-test output (step 30,028,112)

Generated from pretrain_delta_step30028112.pt on 2026-05-16 after the tokenizer health check passed with transformers==4.48.0 and tokenizers==0.21.4.

Tokens seen at checkpoint: saved between 33,185,096,055 and 33,186,096,759 / 35,000,000,000 tokens (~94.8%).
Live ETA snapshot: as of 2026-05-16 04:00:55 UTC, training was at 33,217,118,583 / 35,000,000,000 tokens (94.906053%) at 5,221 tok/s, with 1,782,881,417 tokens remaining. Pretraining was projected to finish at 2026-05-20 02:52:17 UTC (03:52:17 UK) if speed holds.

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees, and there is a challenge to the Doctor to receive the two-week term, before finally waiting for their upcoming term. The case comes as the justices are poised against it for an exemption because there is not what a Mississippi law that bars most

Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20)

After pretraining hit 35B tokens and saved pretrain_final.pt, a chat SFT pass was run on top, producing sft_chat/final.pt (step 32,440,025; "All Training Complete" at 2026-05-20 04:55 UTC). It is now uploaded to this repo under sft_chat/. Raw smoke-test log: inference_results/sft_chat_final_smoke_20260520T084912Z.txt.

Result, honestly: the User: / Assistant: turn template is obeyed across AR, SAT-fixed, and SAT-variable sampling paths, so the SFT fine-tune is reaching the model. Content quality is not yet usable in any of the three.

Decode speeds on RTX 4090, ~698M params:

Mode	Flags	tok/s
AR	`--mode ar`	~51.6
SAT fixed-stride	`--mode sat --no-var`	~81.7
SAT variable-stride	`--mode sat --var`	~80.7

AR — chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: ```python
import pandas as pd
from sklearn.linear_model import LinearRegression

def task_func(df):
    if not isinstance(df, columns):
        raise ValueError("The function must be a positive integer.")
    ...

The model immediately drops into BigCodeBench-style "code task" boilerplate instead of answering. Suggests the SFT mix was too code/task-heavy.

SAT fixed-stride — same chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: that -a In to In-:.k a18 the) In in : Ininx The (43 as of:0 In1012 ...

Decodes as token-salad.

SAT variable-stride — same chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: -v" -) the.k8410kl inx have :9 a1 of$0in J [art to?ert200 only V612 ...

Also token-salad. The variable-stride path needs --no_inference_mode to run today, due to alibi_plus_mask reading _version on an inference tensor — flag docstring says math is unchanged, just a perf/version-tracking tradeoff.

AR — arithmetic smoke

2+2=0
    return $($($($($($($($($$))).join('')).join('')).join('')).join('')).join('')).join('')).

Wrong answer plus PowerShell-flavoured nonsense.

Read of this

Inference plumbing, tokenizer health check, turn template, and checkpoint load are all fine across AR / SAT-fixed / SAT-variable.
The chat capability itself is not there yet at this SFT pass — needs a cleaner, more balanced instruct/chat mix (less code-task data, more direct-answer dialogue).
Next step is a continuation SFT from sft_chat/final.pt into sft_chat_v2/ on a UltraChat + SlimOrca + OpenHermes + UltraFeedback-chosen mix.

Long-SFT chat run — `sft_chat_1024_220k_20260520/final.pt` (2026-05-21)

A second SFT pass on top of pretrain_final.pt, ran for ~220k steps on a cleaner chat/math mix at block 1024. Completed at step 32,820,025 on 2026-05-21 04:09 UTC and uploaded the same day.

This is the first chat checkpoint from the v2 lineage that actually answers arithmetic correctly in AR mode.

File	Size	SHA256
`sft_chat_1024_220k_20260520/final.pt`	8,385,662,500 B	`01bc728b0e03ef0d2f2661162d5b65f6aac75251c305a20cb4fde0de5388a7de`

Plus 8 intermediate delta checkpoints (sft_step32643775.pt → sft_step32796883.pt) and 3 raw smoke logs under sft_chat_1024_220k_20260520/inference/.

AR-mode smoke results (rerun, 2026-05-21 04:27 UTC)

Prompt	Completion	Verdict
`User: What is 1+1?
Assistant:`	`2`	✓
`User: What is 2+2?
Assistant:`	`4`	✓
`User: What is 47 + 28?
Assistant:`	`75`	✓
`User: In one short paragraph, say what you are and answer: what is 2+2?`	`I am a small experimental language model. 1) The distance between the two points (x-3)/6=<<0/5*7=21>>20 kmph...`	self-identifies, then drifts into a math-template
`User: Hello, can you chat normally for one sentence?`	`Sure! Here's a simple example of how to use the word "hello" in your poem. The first line is 'I'm not good enough'...`	coherent English, drifts after one sentence

Decode speed in AR mode: 47–50 tok/s on RTX 4090 at 698M params. Direct numerical answers come back in 3 tokens ([0.26s | 3 tokens | 11.4 tok/s]).

SAT-mode smoke results — still collapsed

Both fixed-stride (--mode sat --no-var) and variable-stride (--mode sat --var) produce the same stopword-cloud output regardless of prompt:

.
 and, value
 the step of to

~28–30 tok/s, 8–10 tokens before stopping. The SAT head wasn't repaired by this SFT pass — AR is the only usable inference path on this checkpoint.

Read of this

Use: sft_chat_1024_220k_20260520/final.pt in AR mode for direct-answer prompts. Limit max_new to keep it from drifting into the math-word-problem template it was trained on.
Skip: sft_math_v1/final.pt — the dedicated math SFT actually came out worse (1+1=5, 2+2=5). The long chat SFT subsumed it.
Don't bother with SAT mode on this checkpoint until a future SFT pass repairs the SAT head.

v3 SAT-repair SFT — failed experiment, do not use (2026-05-21)

The v3 SAT-repair pass at sft_sat_repair_v3_20260521/ regressed the model vs its sft_chat_1024_220k_20260520/final.pt warm-start and is archived here only as a negative result.

Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC)

Prompt	220k base (working)	v3 (regressed)
`User: What is 1+1?
Assistant:`	`2` ✓	`The answer to the question "What's the deal with the movie?" is: 2*3.` ✗
`User: What is 2+2?
Assistant:`	`4` ✓	`The answer to the question "statement 1": A man standing in front of a square, then one side is placed on his shoulder...` ✗
`User: What is 47 + 28?
Assistant:`	`75` ✓	`Let's denote the number of ways to arrange a square in one place... (A + B) = 47 - 28 = 0` ✗
Short-chat prompt	"Sure! Here's a simple example..." (coherent intro, drifts)	"Here's a short story about a person who has a secret gift..." (immediate novel-drift)
SAT-fixed 2+2	stopword salad ✗	identical stopword salad ✗
SAT-var 2+2	stopword salad ✗	byte-for-byte identical to sat-fixed (gate not engaging) ✗

What went wrong

AR regressed. The 220k could answer simple arithmetic in 3 tokens. v3 lost that — the new mix (heavy on OpenMathInstruct-2 + OpenR1-Math-220k + Magpie) overfit the model to "frame any number question as a multi-step word problem," destroying the direct-answer behavior.
SAT did not recover. Same stopword-collapse pattern. The unilaterally-applied SAT-target patch in nB300.py did not fix the head; it trained it to a different broken equilibrium. The fact that --mode sat --var produces byte-identical output to --mode sat --no-var means the SAT gate isn't engaging — a regression vs the 220k state.
The loss-spike was permanent damage, not re-equilibration. Loss trajectory: 3.014 → 1.231 (12%) → spike to 8.181 (16%) → bouncing 5-9 → final 5.557. Final loss higher than starting loss = the run made the model worse on its own objective. Not a re-find of equilibrium.

Recommendation

Use sft_chat_1024_220k_20260520/final.pt as the canonical chat checkpoint. It still answers 1+1=2, 2+2=4, 47+28=75 in AR mode and respects EOS. SAT mode remains broken on both checkpoints and needs a separate inference/objective investigation, not another SFT pass with the same broken patch.

The full v3 final.pt + 8 intermediate deltas + train/upload logs + the trainer code snapshot (code/nB300_20260521T192736Z.py) are preserved in sft_sat_repair_v3_20260521/ as a record of the failed experiment.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

OpenTransformer
/

AGILLM-3-large-v2

AGILLM-3-Large v2

Current recommended checkpoint (2026-05-21)

What happened to v1?

This repo

Model

Training setup

Important: Tokenizer compatibility

Sample output (step 12,373,125)

Latest smoke-test output (step 30,028,112)

Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20)

AR — chat prompt

SAT fixed-stride — same chat prompt

SAT variable-stride — same chat prompt

AR — arithmetic smoke

Read of this

Long-SFT chat run — `sft_chat_1024_220k_20260520/final.pt` (2026-05-21)

AR-mode smoke results (rerun, 2026-05-21 04:27 UTC)

SAT-mode smoke results — still collapsed

Read of this

v3 SAT-repair SFT — failed experiment, do not use (2026-05-21)

Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC)

What went wrong

Recommendation

License

Space using OpenTransformer/AGILLM-3-large-v2 1

AGILLM-3-Large v2

Current recommended checkpoint (2026-05-21)

What happened to v1?

This repo

Model

Training setup

Important: Tokenizer compatibility

Sample output (step 12,373,125)

Latest smoke-test output (step 30,028,112)

Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20)

AR — chat prompt

SAT fixed-stride — same chat prompt

SAT variable-stride — same chat prompt

AR — arithmetic smoke

Read of this

Long-SFT chat run — sft_chat_1024_220k_20260520/final.pt (2026-05-21)

AR-mode smoke results (rerun, 2026-05-21 04:27 UTC)

SAT-mode smoke results — still collapsed

Read of this

v3 SAT-repair SFT — failed experiment, do not use (2026-05-21)

Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC)

What went wrong

Recommendation

License

Space using OpenTransformer/AGILLM-3-large-v2 1

Long-SFT chat run — `sft_chat_1024_220k_20260520/final.pt` (2026-05-21)