JH-Motif commited on
Commit
d5d4fbf
·
verified ·
1 Parent(s): 2f6f3b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -2
README.md CHANGED
@@ -45,7 +45,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
45
  #### Gemma 1 & 2
46
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
47
 
48
- _Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters._
49
 
50
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
51
  |---|---|---|---|---|---|---|---|---|---|---|
@@ -163,4 +163,35 @@ The benchmarks and corresponding scores listed in the table below are taken dire
163
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
164
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
165
 
166
- \*: We were unable to find an evaluation framework for this benchmark.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  #### Gemma 1 & 2
46
  The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
47
 
48
+ *Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters.*
49
 
50
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
51
  |---|---|---|---|---|---|---|---|---|---|---|
 
163
  |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
164
  ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
165
 
166
+ \*: We were unable to find an evaluation framework for this benchmark.
167
+
168
+
169
+ ## Evaluation Appendix
170
+
171
+ In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports.
172
+
173
+ However, when compared against the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B demonstrated a +18.55% average improvement over Llama 3 8B and a +2.63% improvement over Gemma 2 9B. See the table below for details.
174
+
175
+ ### Comparison to Llama 3 8B and Gemma 2 9B based on scores from the [Qwen2.5 technical report](https://arxiv.org/abs/2412.15115).
176
+
177
+ |Benchmark|Metric|Llama 3 8B|Gemma 2 9B|Motif 2.6B|Improvement(over Llama 3 8B)|Improvement(over Gemma 2 9B)|
178
+ |---|---|---|---|---|---|---|
179
+ |MMLU|5-shot|66.6|71.3|57.93|-13.02%|-18.75%|
180
+ |MMLU-pro|5-shot|35.4|44.7|28.4|-19.77%|-36.47%|
181
+ |MMLU-redux|5-shot|61.6|67.9|59.54|-3.34%|-12.31%|
182
+ |BBH|3-shot|57.7|68.2|39.28|-31.92%|-42.40%|
183
+ |ARC-C|25-shot|59.3|68.2|75.08|+26.61%|+10.09%|
184
+ |TruthfulQA|0-shot|44|45.3|41.55|-5.56%|-8.27%|
185
+ |Winogrande|5-shot|77.4|79.5|67.09|-13.32%|-15.61%|
186
+ |HellaSwag|10-shot|82.1|81.9|69.88|-14.88%|-14.68%|
187
+ |GPQA|5-shot|25.8|25.8|29.24|+13.33%|+13.33%|
188
+ |TheoremQA|5-shot|22.1|28.9|-|-|-|
189
+ |MATH|4-shot|20.5|37.7|40.2|+96.10%|+6.63%|
190
+ |MMLU-stem|5-shot|55.3|65.1|52.9|-4.34%|-18.74%|
191
+ |GSM8K|4-shot|55.3|70.7|68.84|+24.48%|-2.63%|
192
+ |HumanEval|0-shot|33.5|37.8|68.3|+103.88%|+80.69%|
193
+ |HumanEval+|0-shot|29.3|30.5|62.2|+112.29%|+103.93%|
194
+ |MBPP|0-shot|53.9|62.2|60.3|+11.87%|-3.05%|
195
+ |MBPP+|0-shot|44.4|50.6|50.8|+14.41%|+0.40%|
196
+ |Multi_L-E|0-shot|22.6|34.9|-|-|-|
197
+ |||||**Average**|**+18.55%**|**+2.63%**|