LLM360
/

K2-V2

@@ -61,6 +61,14 @@ Below we report performance across general, reasoning, mathematical, and coding
 | **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
 Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
 ---

 | **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
+Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
+| Model Specifications | LongBench V2 | AIME25 | HMMT25 | GSM8K | Minerva | GPQA-D | MBPP | HumanEval | LCBv6 |
+|----------------------|--------------|--------|--------|-------|---------|--------|-------|------------|--------|
+| **K2 Low**<br><sub>Dense · 70B</sub> | 40.7 | 27.3 | 19.0 | 92.4 | 85.0 | 48.5 | 71.0 | 82.3 | 39.9 |
+| **K2 Medium**<br><sub>Dense · 70B</sub> | 41.3 | 62.0 | 45.6 | 92.0 | 90.6 | 60.6 | 75.8 | 84.2 | 51.3 |
+| **K2 High**<br><sub>Dense · 70B</sub> | 42.6 | 80.2 | 71.4 | 94.8 | 94.5 | 69.3 | 84.8 | 91.5 | 67.0 |
 Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
 ---