Update README.md
Browse files
README.md
CHANGED
|
@@ -45,7 +45,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 45 |
#### Gemma 1 & 2
|
| 46 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 51 |
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -163,4 +163,35 @@ The benchmarks and corresponding scores listed in the table below are taken dire
|
|
| 163 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 164 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
| 165 |
|
| 166 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
#### Gemma 1 & 2
|
| 46 |
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
| 47 |
|
| 48 |
+
*Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters.*
|
| 49 |
|
| 50 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 51 |
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| 163 |
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 164 |
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
| 165 |
|
| 166 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
## Evaluation Appendix
|
| 170 |
+
|
| 171 |
+
In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports.
|
| 172 |
+
|
| 173 |
+
However, when compared against the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B demonstrated a +18.55% average improvement over Llama 3 8B and a +2.63% improvement over Gemma 2 9B. See the table below for details.
|
| 174 |
+
|
| 175 |
+
### Comparison to Llama 3 8B and Gemma 2 9B based on scores from the [Qwen2.5 technical report](https://arxiv.org/abs/2412.15115).
|
| 176 |
+
|
| 177 |
+
|Benchmark|Metric|Llama 3 8B|Gemma 2 9B|Motif 2.6B|Improvement(over Llama 3 8B)|Improvement(over Gemma 2 9B)|
|
| 178 |
+
|---|---|---|---|---|---|---|
|
| 179 |
+
|MMLU|5-shot|66.6|71.3|57.93|-13.02%|-18.75%|
|
| 180 |
+
|MMLU-pro|5-shot|35.4|44.7|28.4|-19.77%|-36.47%|
|
| 181 |
+
|MMLU-redux|5-shot|61.6|67.9|59.54|-3.34%|-12.31%|
|
| 182 |
+
|BBH|3-shot|57.7|68.2|39.28|-31.92%|-42.40%|
|
| 183 |
+
|ARC-C|25-shot|59.3|68.2|75.08|+26.61%|+10.09%|
|
| 184 |
+
|TruthfulQA|0-shot|44|45.3|41.55|-5.56%|-8.27%|
|
| 185 |
+
|Winogrande|5-shot|77.4|79.5|67.09|-13.32%|-15.61%|
|
| 186 |
+
|HellaSwag|10-shot|82.1|81.9|69.88|-14.88%|-14.68%|
|
| 187 |
+
|GPQA|5-shot|25.8|25.8|29.24|+13.33%|+13.33%|
|
| 188 |
+
|TheoremQA|5-shot|22.1|28.9|-|-|-|
|
| 189 |
+
|MATH|4-shot|20.5|37.7|40.2|+96.10%|+6.63%|
|
| 190 |
+
|MMLU-stem|5-shot|55.3|65.1|52.9|-4.34%|-18.74%|
|
| 191 |
+
|GSM8K|4-shot|55.3|70.7|68.84|+24.48%|-2.63%|
|
| 192 |
+
|HumanEval|0-shot|33.5|37.8|68.3|+103.88%|+80.69%|
|
| 193 |
+
|HumanEval+|0-shot|29.3|30.5|62.2|+112.29%|+103.93%|
|
| 194 |
+
|MBPP|0-shot|53.9|62.2|60.3|+11.87%|-3.05%|
|
| 195 |
+
|MBPP+|0-shot|44.4|50.6|50.8|+14.41%|+0.40%|
|
| 196 |
+
|Multi_L-E|0-shot|22.6|34.9|-|-|-|
|
| 197 |
+
|||||**Average**|**+18.55%**|**+2.63%**|
|