JH-Motif commited on
Commit
f4b8318
·
1 Parent(s): d9bd2c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -7,7 +7,7 @@ language:
7
  - ko
8
  ---
9
 
10
- *Last update: 9th June 2025*
11
 
12
  # Introduction
13
 
@@ -44,8 +44,8 @@ The benchmarks and corresponding scores listed in the table below are taken dire
44
  |HumanEval|0-shot|30.5|68.3|+123.93%|
45
  |MBPP|3-shot|47.5|60.3|+26.95%|
46
  |MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
47
- |GSM8K|8-shot, maj@8|52.2|77.71|+48.87%|
48
- ||||**Average**|**+33.77%**|
49
 
50
  \* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
51
 
@@ -117,11 +117,11 @@ The benchmarks and corresponding scores listed in the table below are taken dire
117
  |IFEval|-|80.4|74.02|-7.94%|
118
  |HumanEval|0-shot|72.6|68.3|-5.92%|
119
  |MBPP|0-shot|72.8|57.93|-20.43%|
120
- |GSM8K|8-shot, CoT|84.5|77.71|-8.04%|
121
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
122
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
123
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
124
- ||||**Average**|**-15.36%**|
125
 
126
  #### Llama 3.2
127
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
@@ -132,12 +132,12 @@ The benchmarks and corresponding scores listed in the table below are taken dire
132
  |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
133
  |TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
134
  |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
135
- |GSM8K|8-shot, CoT|44.4|77.7|74.9|+68.69%|-3.60%|
136
  |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
137
  |ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
138
  |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
139
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
140
- |||||**Average**|**+39.42%**|**-3.86%**|
141
 
142
  ### Comparison to the Phi series by Microsoft
143
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
@@ -147,7 +147,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
147
  |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
148
  |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
149
  |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
150
- |GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
151
  |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
152
  |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
153
  |AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
@@ -166,7 +166,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
166
  |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
167
  |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
168
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
169
- ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
170
 
171
  ## Evaluation Appendix
172
 
 
7
  - ko
8
  ---
9
 
10
+ *Last update: 11th June 2025*
11
 
12
  # Introduction
13
 
 
44
  |HumanEval|0-shot|30.5|68.3|+123.93%|
45
  |MBPP|3-shot|47.5|60.3|+26.95%|
46
  |MATH|4-shot, maj@4|13.1|40.2*|+206.87%|
47
+ |GSM8K|8-shot, maj@8|52.2|80.21|+53.66%|
48
+ ||||**Average**|**+34.25%**|
49
 
50
  \* : We report the 4-shot, maj@1 score instead of the 4-shot, maj@4.
51
 
 
117
  |IFEval|-|80.4|74.02|-7.94%|
118
  |HumanEval|0-shot|72.6|68.3|-5.92%|
119
  |MBPP|0-shot|72.8|57.93|-20.43%|
120
+ |GSM8K|8-shot, CoT|84.5|80.21|-5.08%|
121
  |MATH|0-shot, CoT|51.9|49.68|-4.28%|
122
  |ARC Challenge|0-shot|83.4|74.2|-11.03%|
123
  |GPQA|0-shot, CoT|32.8|18.53|-43.51%|
124
+ ||||**Average**|**-15.04%**|
125
 
126
  #### Llama 3.2
127
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 
132
  |Open-rewrite eval*|0-shot, rougeL|41.6|40.1|-|-|-|
133
  |TLDR9+|test, 1-shot, rougeL|16.8|19|-|-|-|
134
  |IFEval|-|59.5|77.4|74.02|+24.40%|-4.37%|
135
+ |GSM8K|8-shot, CoT|44.4|77.7|80.21|+80.65%|+3.23%|
136
  |MATH|0-shot, CoT|30.6|48|49.68|+62.35%|+3.50%|
137
  |ARC Challenge|0-shot|59.4|78.6|74.2|+24.92%|-5.6%|
138
  |GPQA|0-shot|27.2|32.8|25.45|-6.43%|-22.41%|
139
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
140
+ |||||**Average**|**+41.82%**|**-2.49%**|
141
 
142
  ### Comparison to the Phi series by Microsoft
143
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 
147
  |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
148
  |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
149
  |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
150
+ |GSM-8K|8-shot, CoT|82.5|89.6|61.1|80.21|-2.78%|-10.48%|+31.28%|
151
  |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
152
  |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
153
  |AGIEval|0-shot|37.5|45.1|29.8|-|-|-|-|
 
166
  |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
167
  |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
168
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
169
+ ||||||**Average**|**-9.87%**|**-13.25%**|**+10.56%**|
170
 
171
  ## Evaluation Appendix
172