AGILLM-3-Large v2

Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

Demo Space: OpenTransformer/AGILLM-3-large-v2-demo

Current recommended checkpoint (2026-05-21)

Use sft_chat_1024_220k_20260520/final.pt in AR mode. This is the current canonical AGILLM-3 v2 chat checkpoint: it answers 1+1=2, 2+2=4, and 47+28=75 in the smoke rerun.

Do not use sft_sat_repair_v3_20260521/final.pt except as a failed-experiment archive. That run regressed AR arithmetic and did not repair SAT. SAT mode remains experimental/broken pending a separate objective/inference fix.

What happened to v1?

A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

  • Root cause: The tokenizer's Metaspace pre-tokenizer was configured to use ▁ (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses Δ  (U+0120, GPT-2 convention). This mismatch caused:

    • Encoding: All spaces were silently dropped. "Water boils" encoded to ['Water', 'bo', 'ils'] instead of ['Δ Water', 'Δ boils']
    • Decoding: tok.decode() lost all spaces. Round-trip encodeβ†’decode of "The meaning of life" returned "Themeaningoflife"
    • Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
  • Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.

  • Fix: Pinned transformers==4.48.0 (+ tokenizers==0.21.4), which correctly handles the Δ  space prefix. Also added a runtime fix in n.py that patches the ▁→Ġ mismatch if detected.

This repo

Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) β€” the last checkpoint with correctly-encoded training data.

Model

Parameter Value
Parameters 698M
Hidden dim 1024
Layers 24
Heads 16
Rank 128
Expansion ratio 2.0x
Vocab 128,815 (DeepSeek-V3.2 tokenizer)
Architecture Joint AR + SAT (autoregressive + span-aware transformer)
Training target 35B tokens
Tokens seen (at restart) ~10.83B (30.9%)

Training setup

  • GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
  • Speed: ~20,000 tok/s
  • Block size: 1122
  • Batch size: 1
  • Mixed precision: AMP (BF16)
  • Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
  • Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

Important: Tokenizer compatibility

This model requires transformers<=4.48.0 for correct tokenizer behavior:

pip install transformers==4.48.0 tokenizers==0.21.4

Or use the runtime fix in n.py which auto-patches the ▁/Δ  mismatch.

Sample output (step 12,373,125)

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

Latest smoke-test output (step 30,028,112)

Generated from pretrain_delta_step30028112.pt on 2026-05-16 after the tokenizer health check passed with transformers==4.48.0 and tokenizers==0.21.4.

  • Tokens seen at checkpoint: saved between 33,185,096,055 and 33,186,096,759 / 35,000,000,000 tokens (~94.8%).
  • Live ETA snapshot: as of 2026-05-16 04:00:55 UTC, training was at 33,217,118,583 / 35,000,000,000 tokens (94.906053%) at 5,221 tok/s, with 1,782,881,417 tokens remaining. Pretraining was projected to finish at 2026-05-20 02:52:17 UTC (03:52:17 UK) if speed holds.

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees, and there is a challenge to the Doctor to receive the two-week term, before finally waiting for their upcoming term. The case comes as the justices are poised against it for an exemption because there is not what a Mississippi law that bars most

Post-SFT chat smoke test (sft_chat/final.pt, 2026-05-20)

After pretraining hit 35B tokens and saved pretrain_final.pt, a chat SFT pass was run on top, producing sft_chat/final.pt (step 32,440,025; "All Training Complete" at 2026-05-20 04:55 UTC). It is now uploaded to this repo under sft_chat/. Raw smoke-test log: inference_results/sft_chat_final_smoke_20260520T084912Z.txt.

Result, honestly: the User: / Assistant: turn template is obeyed across AR, SAT-fixed, and SAT-variable sampling paths, so the SFT fine-tune is reaching the model. Content quality is not yet usable in any of the three.

Decode speeds on RTX 4090, ~698M params:

Mode Flags tok/s
AR --mode ar ~51.6
SAT fixed-stride --mode sat --no-var ~81.7
SAT variable-stride --mode sat --var ~80.7

AR β€” chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: ```python
import pandas as pd
from sklearn.linear_model import LinearRegression

def task_func(df):
    if not isinstance(df, columns):
        raise ValueError("The function must be a positive integer.")
    ...

The model immediately drops into BigCodeBench-style "code task" boilerplate instead of answering. Suggests the SFT mix was too code/task-heavy.

SAT fixed-stride β€” same chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: that -a In to In-:.k a18 the) In in : Ininx The (43 as of:0 In1012 ...

Decodes as token-salad.

SAT variable-stride β€” same chat prompt

User: In one short paragraph, say what you are and answer: what is 2+2?
Assistant: -v" -) the.k8410kl inx have :9 a1 of$0in J [art to?ert200 only V612 ...

Also token-salad. The variable-stride path needs --no_inference_mode to run today, due to alibi_plus_mask reading _version on an inference tensor β€” flag docstring says math is unchanged, just a perf/version-tracking tradeoff.

AR β€” arithmetic smoke

2+2=0
    return $($($($($($($($($$))).join('')).join('')).join('')).join('')).join('')).join('')).

Wrong answer plus PowerShell-flavoured nonsense.

Read of this

  • Inference plumbing, tokenizer health check, turn template, and checkpoint load are all fine across AR / SAT-fixed / SAT-variable.
  • The chat capability itself is not there yet at this SFT pass β€” needs a cleaner, more balanced instruct/chat mix (less code-task data, more direct-answer dialogue).
  • Next step is a continuation SFT from sft_chat/final.pt into sft_chat_v2/ on a UltraChat + SlimOrca + OpenHermes + UltraFeedback-chosen mix.

Long-SFT chat run β€” sft_chat_1024_220k_20260520/final.pt (2026-05-21)

A second SFT pass on top of pretrain_final.pt, ran for ~220k steps on a cleaner chat/math mix at block 1024. Completed at step 32,820,025 on 2026-05-21 04:09 UTC and uploaded the same day.

This is the first chat checkpoint from the v2 lineage that actually answers arithmetic correctly in AR mode.

File Size SHA256
sft_chat_1024_220k_20260520/final.pt 8,385,662,500 B 01bc728b0e03ef0d2f2661162d5b65f6aac75251c305a20cb4fde0de5388a7de

Plus 8 intermediate delta checkpoints (sft_step32643775.pt β†’ sft_step32796883.pt) and 3 raw smoke logs under sft_chat_1024_220k_20260520/inference/.

AR-mode smoke results (rerun, 2026-05-21 04:27 UTC)

Prompt Completion Verdict
`User: What is 1+1?
Assistant:` 2 βœ“
`User: What is 2+2?
Assistant:` 4 βœ“
`User: What is 47 + 28?
Assistant:` 75 βœ“
User: In one short paragraph, say what you are and answer: what is 2+2? I am a small experimental language model. 1) The distance between the two points (x-3)/6=<<0/5*7=21>>20 kmph... self-identifies, then drifts into a math-template
User: Hello, can you chat normally for one sentence? Sure! Here's a simple example of how to use the word "hello" in your poem. The first line is 'I'm not good enough'... coherent English, drifts after one sentence

Decode speed in AR mode: 47–50 tok/s on RTX 4090 at 698M params. Direct numerical answers come back in 3 tokens ([0.26s | 3 tokens | 11.4 tok/s]).

SAT-mode smoke results β€” still collapsed

Both fixed-stride (--mode sat --no-var) and variable-stride (--mode sat --var) produce the same stopword-cloud output regardless of prompt:

.
 and, value
 the step of to

~28–30 tok/s, 8–10 tokens before stopping. The SAT head wasn't repaired by this SFT pass β€” AR is the only usable inference path on this checkpoint.

Read of this

  • Use: sft_chat_1024_220k_20260520/final.pt in AR mode for direct-answer prompts. Limit max_new to keep it from drifting into the math-word-problem template it was trained on.
  • Skip: sft_math_v1/final.pt β€” the dedicated math SFT actually came out worse (1+1=5, 2+2=5). The long chat SFT subsumed it.
  • Don't bother with SAT mode on this checkpoint until a future SFT pass repairs the SAT head.

v3 SAT-repair SFT β€” failed experiment, do not use (2026-05-21)

The v3 SAT-repair pass at sft_sat_repair_v3_20260521/ regressed the model vs its sft_chat_1024_220k_20260520/final.pt warm-start and is archived here only as a negative result.

Smoke verdict (post-training, 2026-05-21 22:52-22:54 UTC)

Prompt 220k base (working) v3 (regressed)
`User: What is 1+1?
Assistant:` 2 βœ“ The answer to the question "What's the deal with the movie?" is: 2*3. βœ—
`User: What is 2+2?
Assistant:` 4 βœ“ The answer to the question "statement 1": A man standing in front of a square, then one side is placed on his shoulder... βœ—
`User: What is 47 + 28?
Assistant:` 75 βœ“ Let's denote the number of ways to arrange a square in one place... (A + B) = 47 - 28 = 0 βœ—
Short-chat prompt "Sure! Here's a simple example..." (coherent intro, drifts) "Here's a short story about a person who has a secret gift..." (immediate novel-drift)
SAT-fixed 2+2 stopword salad βœ— identical stopword salad βœ—
SAT-var 2+2 stopword salad βœ— byte-for-byte identical to sat-fixed (gate not engaging) βœ—

What went wrong

  1. AR regressed. The 220k could answer simple arithmetic in 3 tokens. v3 lost that β€” the new mix (heavy on OpenMathInstruct-2 + OpenR1-Math-220k + Magpie) overfit the model to "frame any number question as a multi-step word problem," destroying the direct-answer behavior.
  2. SAT did not recover. Same stopword-collapse pattern. The unilaterally-applied SAT-target patch in nB300.py did not fix the head; it trained it to a different broken equilibrium. The fact that --mode sat --var produces byte-identical output to --mode sat --no-var means the SAT gate isn't engaging β€” a regression vs the 220k state.
  3. The loss-spike was permanent damage, not re-equilibration. Loss trajectory: 3.014 β†’ 1.231 (12%) β†’ spike to 8.181 (16%) β†’ bouncing 5-9 β†’ final 5.557. Final loss higher than starting loss = the run made the model worse on its own objective. Not a re-find of equilibrium.

Recommendation

Use sft_chat_1024_220k_20260520/final.pt as the canonical chat checkpoint. It still answers 1+1=2, 2+2=4, 47+28=75 in AR mode and respects EOS. SAT mode remains broken on both checkpoints and needs a separate inference/objective investigation, not another SFT pass with the same broken patch.

The full v3 final.pt + 8 intermediate deltas + train/upload logs + the trainer code snapshot (code/nB300_20260521T192736Z.py) are preserved in sft_sat_repair_v3_20260521/ as a record of the failed experiment.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using OpenTransformer/AGILLM-3-large-v2 1