Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ language:
|
|
| 7 |
- ko
|
| 8 |
---
|
| 9 |
|
| 10 |
-
*Last update:
|
| 11 |
|
| 12 |
# Introduction
|
| 13 |
|
|
@@ -44,8 +44,8 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 44 |
|HumanEval|0-shot|30.5|68.3|+123.93%|
|
| 45 |
|MBPP|3-shot|47.5|60.3|+26.95%|
|
| 46 |
|MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
|
| 47 |
-
|GSM8K|8-shot, maj@8|52.2|
|
| 48 |
-
||||**Average**|**+
|
| 49 |
|
| 50 |
\* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
|
| 51 |
|
|
@@ -117,11 +117,11 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 117 |
|IFEval|-|80.4|74.02|-7.94%|
|
| 118 |
|HumanEval|0-shot|72.6|68.3|-5.92%|
|
| 119 |
|MBPP|0-shot|72.8|57.93|-20.43%|
|
| 120 |
-
|GSM8K|8-shot, CoT|84.5|
|
| 121 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
| 122 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
| 123 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
| 124 |
-
||||**Average**|**-15.
|
| 125 |
|
| 126 |
#### Llama 3.2
|
| 127 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
|
@@ -132,12 +132,12 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 132 |
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|
| 133 |
|TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
|
| 134 |
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|
| 135 |
-
|GSM8K|8-shot, CoT|44.4|77.7|
|
| 136 |
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|
| 137 |
|ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
|
| 138 |
|GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
|
| 139 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
| 140 |
-
|||||**Average**|**+
|
| 141 |
|
| 142 |
### Comparison to the Phi series by Microsoft
|
| 143 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
|
@@ -147,7 +147,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 147 |
|MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
|
| 148 |
|HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
|
| 149 |
|ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
|
| 150 |
-
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|
|
| 151 |
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|
| 152 |
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|
| 153 |
|AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
|
|
@@ -166,7 +166,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 166 |
|MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
|
| 167 |
|GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
|
| 168 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 169 |
-
||||||**Average**|**-
|
| 170 |
|
| 171 |
## Evaluation Appendix
|
| 172 |
|
|
|
|
| 7 |
- ko
|
| 8 |
---
|
| 9 |
|
| 10 |
+
*Last update: 11th June 2025*
|
| 11 |
|
| 12 |
# Introduction
|
| 13 |
|
|
|
|
| 44 |
|HumanEval|0-shot|30.5|68.3|+123.93%|
|
| 45 |
|MBPP|3-shot|47.5|60.3|+26.95%|
|
| 46 |
|MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
|
| 47 |
+
|GSM8K|8-shot, maj@8|52.2|80.21|+53.66%|
|
| 48 |
+
||||**Average**|**+34.25%**|
|
| 49 |
|
| 50 |
\* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
|
| 51 |
|
|
|
|
| 117 |
|IFEval|-|80.4|74.02|-7.94%|
|
| 118 |
|HumanEval|0-shot|72.6|68.3|-5.92%|
|
| 119 |
|MBPP|0-shot|72.8|57.93|-20.43%|
|
| 120 |
+
|GSM8K|8-shot, CoT|84.5|80.21|-5.08%|
|
| 121 |
|MATH|0-shot, CoT|51.9|49.68|-4.28%|
|
| 122 |
|ARC Challenge|0-shot|83.4|74.2|-11.03%|
|
| 123 |
|GPQA|0-shot, CoT|32.8|18.53|-43.51%|
|
| 124 |
+
||||**Average**|**-15.04%**|
|
| 125 |
|
| 126 |
#### Llama 3.2
|
| 127 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
|
|
|
| 132 |
|Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
|
| 133 |
|TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
|
| 134 |
|IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
|
| 135 |
+
|GSM8K|8-shot, CoT|44.4|77.7|80.21|+80.65%|+3.23%|
|
| 136 |
|MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
|
| 137 |
|ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
|
| 138 |
|GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
|
| 139 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
| 140 |
+
|||||**Average**|**+41.82%**|**-2.49%**|
|
| 141 |
|
| 142 |
### Comparison to the Phi series by Microsoft
|
| 143 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
|
|
|
| 147 |
|MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
|
| 148 |
|HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
|
| 149 |
|ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
|
| 150 |
+
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|80.21|-2.78%|-10.48%|+31.28%|
|
| 151 |
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|
| 152 |
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|
| 153 |
|AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
|
|
|
|
| 166 |
|MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
|
| 167 |
|GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
|
| 168 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 169 |
+
||||||**Average**|**-9.87%**|**-13.25%**|**+10.56%**|
|
| 170 |
|
| 171 |
## Evaluation Appendix
|
| 172 |
|