PyTorch
English
llama
File size: 9,661 Bytes
e6fa89b
 
 
 
 
 
 
 
292819f
e6fa89b
f7fc21f
 
ebfea49
f7fc21f
e6fa89b
292819f
e6fa89b
292819f
e6fa89b
292819f
e6fa89b
292819f
e6fa89b
 
 
35cf92c
e6fa89b
 
 
 
292819f
 
e6fa89b
 
 
 
 
 
 
 
 
 
 
292819f
 
e6fa89b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c7a990
 
 
 
 
 
 
 
 
 
e6fa89b
 
0435a77
 
ddb27fb
 
 
707673e
 
 
 
 
 
ddb27fb
 
0435a77
292819f
e6fa89b
 
 
 
 
292819f
e6fa89b
 
 
292819f
 
e6fa89b
 
 
 
 
 
 
 
 
 
 
 
 
 
292819f
 
e6fa89b
 
 
292819f
 
e6fa89b
 
 
 
 
 
 
 
292819f
e6fa89b
292819f
 
 
e6fa89b
 
 
292819f
 
 
 
 
 
 
 
 
 
 
 
 
 
e6fa89b
 
292819f
e6fa89b
292819f
 
 
 
 
 
 
 
 
 
 
e6fa89b
 
c565563
 
 
 
 
 
 
 
e6fa89b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
license: apache-2.0
language:
- en
---

# **K2-V2**

<img src="figures/K2.LOGO.PRIMARY.RGB.png" width="100" alt="K2-V2 model logo"/>

📚 [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) - 📝 [Training Code](https://github.com/llm360/k2v2_train) - 🏢 [Evaluation Code](https://github.com/llm360/eval360) 

🗂️ [Pretraining Data: TxT360](https://huggingface.co/datasets/LLM360/TxT360) - 🗂️ [Midtraining Data: TxT360-Midas](https://huggingface.co/datasets/LLM360/TxT360-Midas) - 🗂️ [SFT Data: TxT360-3efforts](https://huggingface.co/datasets/LLM360/TxT360-3efforts)


K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family.

<img src="figures/sft-models.png" width="400" alt="K2-V2 SFT results"/>

Beyond standard competencies such as factual knowledge and conversational ability, K2-V2 demonstrates strong long-context consistency, deep mathematical understanding, and robust reasoning skills. These capabilities serve as building blocks for sophisticated downstream applications, such as solving complex math problems and executing agentic workflows.

<img src="figures/base-models.png" width="400" alt="K2-V2 GPQA results"/>

---

## **Quick Start**

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("LLM360/K2-V2", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-V2")

prompt = "Explain why the derivative of sin(x) is cos(x)."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## **Evaluation Summary**

Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality.

| Task / Model | base | mid-1 | mid-2 | mid-3 | mid-4 | Qwen2.5-72B | Llama3.0-70B | Llama3.1-70B | Olmo3-32B |
|--------------|------|-------|-------|-------|-------|--------------|---------------|---------------|------------|
| **General Tasks** | | | | | | | | | |
| **MMLU** | 74.3 | 74.4 | 73.5 | 75.0 | 75.2 | **86.1** | <u>79.5</u> | 79.3 | 75.2 |
| **MMLU-Pro** | 43.7 | 46.8 | 48.1 | **59.8** | 57.0 | <u>58.1</u> | 52.8 | 53.8 | 49.6 |
| **BBH** | 68.4 | 79.8 | 81.1 | 82.2 | <u>83.2</u> | **86.3** | 82.2 | 82.1 | 77.6 |
| **HELLASWAG** | <u>87.8</u> | 86.9 | 86.6 | 86.6 | 86.0 | 87.6 | **88.0** | 85.0 | 84.8 |
| **WINOGRANDE** | 82.6 | 83.7 | 83.7 | 83.7 | 83.0 | 83.9 | <u>85.3</u> | 79.8 | **90.3** |
| **PIQA** | 84.2 | 84.0 | 83.3 | 82.9 | 83.1 | 83.5 | <u>84.6</u> | 84.3 | **85.6** |
| **TRUTHFULQA** | 54.0 | 54.9 | 55.1 | <u>55.8</u> | 53.9 | **60.5** | 45.6 | 49.7 | 54.9 |
| **Math & STEM Tasks** | | | | | | | | | |
| **GPQA-DIAMOND** | 26.3 | 31.3 | 27.8 | <u>43.9</u> | **55.1** | 34.9 | 21.2 | 27.3 | 30.3 |
| **GSM8K** | 68.0 | 76.4 | 82.1 | **93.6** | <u>92.5</u> | 91.2 | 83.2 | 81.1 | 80.5 |
| **MATH** | 27.8 | 38.2 | 41.1 | **94.7** | <u>91.4</u> | 58.5 | 41.9 | 41.6 | 43.4 |
| **AIME 2025** | 0.0 | 17.6 | 25.1 | **53.2** | <u>46.9</u> | 1.7 | 0.1 | 0.2 | 14.7 |
| **ARC-CHALLENGE** | 64.9 | 66.4 | 66.4 | 66.0 | 66.3 | **72.4** | <u>69.2</u> | 64.9 | 65.4 |
| **Coding Tasks** | | | | | | | | | |
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
| **Logic Puzzles** | | | | | | | | | |
| **COUNTDOWN** | 1.3 | <u>53.3</u> | 53.1 | 35.9 | **75.6** | 6.0 | 1.0 | 0.5 | 23.2 |
| **KK-4 PEOPLE** | 4.8 | 44.9 | <u>68.0</u> | 64.5 | **92.9** | 26.1 | 4.2 | 7.6 | 42.4 |
| **KK-8 PEOPLE** | 0.5 | 23.2 | 41.3 | <u>51.6</u> | **82.8** | 5.7 | 1.1 | 1.3 | 13.0 |
| **ORDER-15 ITEMS** | 4.7 | 30.7 | 47.2 | <u>55.8</u> | **87.6** | 37.0 | 3.5 | 4.5 | 25.0 |
| **ORDER-30 ITEMS** | 0.0 | 0.3 | 3.0 | <u>34.1</u> | **40.3** | 0.7 | 0.2 | 0.1 | 0.6 |
| **Instruction Following** | | | | | | | | | |
| **IFEVAL** | 17.4 | 26.2 | 28.5 | <u>34.5</u> | 26.7 | **40.3** | 15.1 | 17.4 | 13.2 |
| **Arabic** | | | | | | | | | |
| **MMLU-Arabic** | 65.4 | 66.1 | 64.5 | 66.6 | 65.5 | **74.1** | 65.0 | <u>66.8</u> | 47.8 |


Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).

| Metric / Model | **K2 Low**<br><sub>Dense · 70B</sub> | **K2 Medium**<br><sub>Dense · 70B</sub> | **K2 High**<br><sub>Dense · 70B</sub> | **Olmo3 Think SFT**<br><sub>Dense · 32B · No RL</sub> | **Olmo3 Think**<br><sub>Dense · 32B · RL</sub> | **GLM-4.5 Air**<br><sub>MoE · 106B A12B</sub> | **MiniMax-M2**<br><sub>MoE · 230B A10B</sub> | **Qwen3 235B**<br><sub>MoE · 235B A22B · Reasoning</sub> | **Qwen 2.5 72B**<br><sub>Dense · 72B</sub> |
|--------|--------------------------------------|------------------------------------------|----------------------------------------|------------------------------------------------------|--------------------------------------------------|----------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------|
| **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 49.4 | 55.8 | 60.9 | 47.2 |
| **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 | 81.3 | 75.8 | 88.8 | 15.2 |
| **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 | 73.3 | 63.5 | 84.2 | 9.79 |
| **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 | 96.1 | 95.4 | 93.5 | 85.8 |
| **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 | 94.9 | 85.3 | 98.0 | 82.1 |
| **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 | 75.3 | 76.2 | 80.7 | 50.5 |
| **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 | 82.8 | 83.8 | 96.2 | 80.0 |
| **HumanEval** | 82.3 | 91.5 | 91.5 | 96.3 | 96.3 | 97.6 | 89.6 | 94.5 | 85.4 |
| **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 67.8 | 79.2 | 72.8 | 36.7 |

Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.

---

## **Datasets & Mixtures**

K2-V2 training is organized into three stages, each using a transparent, publicly released mixture:

### **Pretraining Mix**

* Large-scale natural text corpus spanning web content, books, code, and multilingual sources
* Mixture designed for stable scaling and broad general-knowledge coverage
* ~12T tokens

### **Mid-Training Mix**

* **TxT360-Midas**: reasoning-oriented + long-context extensions
* Domain-focused sources: math, programming, scientific literature
* Synthetic expansions where natural data is scarce

### **SFT Mix**

* Check out https://huggingface.co/LLM360/K2-V2-Instruct

All mixtures, filtering rules, and data sources are fully released for reproducibility.

Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed datasets and mixtures information.

---

## **Model Description**
- **Model type:** K2-V2 follows a standard decoder-only transformer with grouped-query attention and RMSNorm.
- **Training stage:** Pre-training
- **Language(s) (NLP):** English
- **License:** Apache 2.0


| Model Hyperparameter      | Value |
| ----------- | ----------- |
| Total Parameters      | 70B       |
| Hidden Size   | 8,192        |
| Intermediate Size (FFN)   | 28,672        |
| Number of Attention Heads   | 64        |
| Number of Layers  | 80        |
| RMSNorm ɛ  | 1e-5        |
| Pre-training Seq Length   | 8,192        |
| Max Mid-training Seq Length   | 524,288        |
| Vocab Size | 250,000 |


---

## **Intended Use**

K2-V2 is designed for:

* research on large language models and reasoning
* downstream fine-tuning (e.g., instruction following, agents, domain models)
* experimentation with long-context architectures
* open, transparent benchmarking of LLM scaling

K2-V2 is **not** instruction-tuned. For aligned conversational use, please see **K2-V2-Instruct**.

---

## **Limitations**

* May generate incorrect or hallucinated content, especially when asked about facts not seen during training
* Not optimized for safety, moderation, or refusal behavior (base model)
* Long-context performance depends on prompt quality and retrieval structure
* Primarily trained on English; multilingual capabilities are limited
* Inference cost is high due to the 70B parameter size

---

## Citation

If you use K2-V2 in your research, please cite the following:

```
@misc{k2team2025k2v2360openreasoningenhancedllm,
      title={K2-V2: A 360-Open, Reasoning-Enhanced LLM}, 
      author={K2 Team and Zhengzhong Liu and Liping Tang and Linghao Jin and Haonan Li and Nikhil Ranjan and Desai Fan and Shaurya Rohatgi and Richard Fan and Omkar Pangarkar and Huijuan Wang and Zhoujun Cheng and Suqi Sun and Seungwook Han and Bowen Tan and Gurpreet Gosal and Xudong Han and Varad Pimpalkhute and Shibo Hao and Ming Shan Hee and Joel Hestness and Haolong Jia and Liqun Ma and Aaryamonvikram Singh and Daria Soboleva and Natalia Vassilieva and Renxi Wang and Yingquan Wu and Yuekai Sun and Taylor Killian and Alexander Moreno and John Maggs and Hector Ren and Guowei He and Hongyi Wang and Xuezhe Ma and Yuqi Wang and Mikhail Yurochkin and Eric P. Xing},
      year={2025},
      eprint={2512.06201},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.06201}, 
}
```