Nemotron-Terminal-8B Reproduction (Marin)
A reproduction of NVIDIA's Nemotron-Terminal-8B by the Marin community, SFT'd on the full Nemotron-Terminal-Corpus (366K examples).
Results
| Benchmark | Nemotron-Terminal-8B (released) | This model (Marin SFT) | Diff |
|---|---|---|---|
| Terminal-Bench 2.0 | 13.0% ± 2.2 | 15.9% (14/88) | +2.9pp |
| Terminal-Bench Lite | 23.0% (23/100) | 29.0% (29/100) | +6.0pp |
Both benchmarks show our reproduction outperforming the released weights, despite training on 366K examples vs the paper's 490K.
Training Details
- Base model: Qwen/Qwen3-8B
- Training data: nvidia/Nemotron-Terminal-Corpus — 366K examples (skill_based_easy, skill_based_medium, skill_based_mixed, diverse_complex subsets)
- Steps: 5,721 (2 epochs)
- Learning rate: 2e-5 (cosine schedule)
- Batch size: 128
- Sequence length: 32,768
- Final loss: ~0.35
- Hardware: TPU v5p-32
- Framework: Marin (Levanter-based)
- W&B run: exp3490b
Evaluation Setup
- Inference: vLLM-TPU (native) on v5p-8
- Agent: Terminus-2 (temperature=0.6)
- Environments: Daytona
Links
- Tracking issue: marin-community/marin#3490
- Training code: experiments/exp3490b
- Eval code: experiments/exp3490b_eval
- Downloads last month
- 18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for AlienKevin/nemotron-terminal-8b-repro
Dataset used to train AlienKevin/nemotron-terminal-8b-repro
Evaluation results
- TB2 Accuracy (%) on Terminal-Bench 2.0self-reported15.900
- TBLite Accuracy (%) on OpenThoughts-TBLiteself-reported29.000