Motif-Technologies
/

Motif-2.6B

@@ -45,7 +45,7 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 #### Gemma 1 & 2
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
-_Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters._
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
@@ -163,4 +163,35 @@ The benchmarks and corresponding scores listed in the table below are taken dire
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
 ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
-\*: We were unable to find an evaluation framework for this benchmark.

 #### Gemma 1 & 2
 The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
+*Note: Although referred to as "2B", Gemma 2 2B actually has <U>2.6 billion</U> parameters.*
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
 |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
 ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
+\*: We were unable to find an evaluation framework for this benchmark.
+## Evaluation Appendix
+In the comparisons presented above, Motif 2.6B showed average performance improvements of -15.36% and -14.78% over Llama 3 8B and Gemma 2 9B, respectively, based on the benchmark scores reported in their original technical reports.
+However, when compared against the benchmarks and scores reported in the Qwen 2.5 technical report, Motif 2.6B demonstrated a +18.55% average improvement over Llama 3 8B and a +2.63% improvement over Gemma 2 9B. See the table below for details.
+### Comparison to Llama 3 8B and Gemma 2 9B based on scores from the [Qwen2.5 technical report](https://arxiv.org/abs/2412.15115).
+|Benchmark|Metric|Llama 3 8B|Gemma 2 9B|Motif 2.6B|Improvement(over Llama 3 8B)|Improvement(over Gemma 2 9B)|
+|---|---|---|---|---|---|---|
+|MMLU|5-shot|66.6|71.3|57.93|-13.02%|-18.75%|
+|MMLU-pro|5-shot|35.4|44.7|28.4|-19.77%|-36.47%|
+|MMLU-redux|5-shot|61.6|67.9|59.54|-3.34%|-12.31%|
+|BBH|3-shot|57.7|68.2|39.28|-31.92%|-42.40%|
+|ARC-C|25-shot|59.3|68.2|75.08|+26.61%|+10.09%|
+|TruthfulQA|0-shot|44|45.3|41.55|-5.56%|-8.27%|
+|Winogrande|5-shot|77.4|79.5|67.09|-13.32%|-15.61%|
+|HellaSwag|10-shot|82.1|81.9|69.88|-14.88%|-14.68%|
+|GPQA|5-shot|25.8|25.8|29.24|+13.33%|+13.33%|
+|TheoremQA|5-shot|22.1|28.9|-|-|-|
+|MATH|4-shot|20.5|37.7|40.2|+96.10%|+6.63%|
+|MMLU-stem|5-shot|55.3|65.1|52.9|-4.34%|-18.74%|
+|GSM8K|4-shot|55.3|70.7|68.84|+24.48%|-2.63%|
+|HumanEval|0-shot|33.5|37.8|68.3|+103.88%|+80.69%|
+|HumanEval+|0-shot|29.3|30.5|62.2|+112.29%|+103.93%|
+|MBPP|0-shot|53.9|62.2|60.3|+11.87%|-3.05%|
+|MBPP+|0-shot|44.4|50.6|50.8|+14.41%|+0.40%|
+|Multi_L-E|0-shot|22.6|34.9|-|-|-|
+|||||**Average**|**+18.55%**|**+2.63%**|