Motif-Technologies
/

Motif-2.6B

@@ -7,7 +7,7 @@ language:
 - ko
 ---
-*Last update: 9th June 2025*
 # Introduction
@@ -44,8 +44,8 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |HumanEval|0-shot|30.5|68.3|+123.93%|
 |MBPP|3-shot|47.5|60.3|+26.95%|
 |MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
-|GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
-||||**Average**|**+33.77%**|
 \* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
@@ -117,11 +117,11 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |IFEval|-|80.4|74.02|-7.94%|
 |HumanEval|0-shot|72.6|68.3|-5.92%|
 |MBPP|0-shot|72.8|57.93|-20.43%|
-|GSM8K|8-shot, CoT|84.5|77.71|-8.04%|
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
-||||**Average**|**-15.36%**|
 #### Llama 3.2
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
@@ -132,12 +132,12 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
 |TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
 |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
-|GSM8K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
 |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
 |ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
 |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
 |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
-|||||**Average**|**+39.42%**|**-3.86%**|
 ### Comparison to the Phi series by Microsoft
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
@@ -147,7 +147,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
 |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
 |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
-|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
 |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
 |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
 |AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
@@ -166,7 +166,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
 |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
-||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
 ## Evaluation Appendix

 - ko
 ---
+*Last update: 11th June 2025*
 # Introduction
 |HumanEval|0-shot|30.5|68.3|+123.93%|
 |MBPP|3-shot|47.5|60.3|+26.95%|
 |MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
+|GSM8K|8-shot, maj@8|52.2|80.21|+53.66%|
+||||**Average**|**+34.25%**|
 \* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
 |IFEval|-|80.4|74.02|-7.94%|
 |HumanEval|0-shot|72.6|68.3|-5.92%|
 |MBPP|0-shot|72.8|57.93|-20.43%|
+|GSM8K|8-shot, CoT|84.5|80.21|-5.08%|
 |MATH|0-shot, CoT|51.9|49.68|-4.28%|
 |ARC Challenge|0-shot|83.4|74.2|-11.03%|
 |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
+||||**Average**|**-15.04%**|
 #### Llama 3.2
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
 |TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
 |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
+|GSM8K|8-shot, CoT|44.4|77.7|80.21|+80.65%|+3.23%|
 |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
 |ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
 |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
 |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
+|||||**Average**|**+41.82%**|**-2.49%**|
 ### Comparison to the Phi series by Microsoft
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
 |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
 |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
+|GSM-8K|8-shot, CoT|82.5|89.6|61.1|80.21|-2.78%|-10.48%|+31.28%|
 |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
 |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
 |AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
 |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
 |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
+||||||**Average**|**-9.87%**|**-13.25%**|**+10.56%**|
 ## Evaluation Appendix