Update README.md (#11)
Browse files- Update README.md (74713718fb8812f89937ede411db6c156c2f3d8d)
README.md
CHANGED
|
@@ -61,6 +61,14 @@ Below we report performance across general, reasoning, mathematical, and coding
|
|
| 61 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
|
| 62 |
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 65 |
|
| 66 |
---
|
|
|
|
| 61 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
|
| 62 |
|
| 63 |
|
| 64 |
+
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|
| 65 |
+
|
| 66 |
+
| Model Specifications | LongBench V2 | AIME25 | HMMT25 | GSM8K | Minerva | GPQA-D | MBPP | HumanEval | LCBv6 |
|
| 67 |
+
|----------------------|--------------|--------|--------|-------|---------|--------|-------|------------|--------|
|
| 68 |
+
| **K2 Low**<br><sub>Dense 路 70B</sub> | 40.7 | 27.3 | 19.0 | 92.4 | 85.0 | 48.5 | 71.0 | 82.3 | 39.9 |
|
| 69 |
+
| **K2 Medium**<br><sub>Dense 路 70B</sub> | 41.3 | 62.0 | 45.6 | 92.0 | 90.6 | 60.6 | 75.8 | 84.2 | 51.3 |
|
| 70 |
+
| **K2 High**<br><sub>Dense 路 70B</sub> | 42.6 | 80.2 | 71.4 | 94.8 | 94.5 | 69.3 | 84.8 | 91.5 | 67.0 |
|
| 71 |
+
|
| 72 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 73 |
|
| 74 |
---
|