Motif-Technologies
/

Motif-2.6B

@@ -20,7 +20,7 @@ To illustrate how much evaluation scores can vary across reports, we provide con
 ### Comparsion to Mistral
-The benchmarks and metrics used are identical to those in the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
 |Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
@@ -43,7 +43,7 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
 ### Comparsion to Llama
 #### Llama 3
-The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
@@ -60,7 +60,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
 ||||**Average**|**-15.36%**|
 #### Llama 3.2
-The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
 |---|---|---|---|---|---|---|
@@ -78,7 +78,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
 \*: We were unable to find an evaluation framework for this benchmark.
 ### Comparsion to Phi
-The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
 |---|---|---|---|---|---|---|---|---|
@@ -111,7 +111,7 @@ The benchmarks and metrics used are identical to those in the [Phi-3 technical r
 ### Comparsion to Gemma
 #### Gemma 1 & 2
-The benchmarks and metrics used are identical to those in the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
@@ -137,7 +137,7 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
 \*: We were unable to find an evaluation framework for this benchmark.
 #### Gemma 3
-The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
 |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
 |---|---|---|---|---|---|---|
@@ -152,4 +152,24 @@ The benchmarks and metrics used are identical to those in the [Gemma 3 technical
 |MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
 |||||**Average**|**+24.71%**|**-8.28%**|
-\*: We were unable to find an evaluation framework for this benchmark.

 ### Comparsion to Mistral
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
 |Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
 ### Comparsion to Llama
 #### Llama 3
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
 |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
 |---|---|---|---|---|
 ||||**Average**|**-15.36%**|
 #### Llama 3.2
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
 |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
 |---|---|---|---|---|---|---|
 \*: We were unable to find an evaluation framework for this benchmark.
 ### Comparsion to Phi
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
 |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
 |---|---|---|---|---|---|---|---|---|
 ### Comparsion to Gemma
 #### Gemma 1 & 2
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
 |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
 |---|---|---|---|---|---|---|---|---|---|---|
 \*: We were unable to find an evaluation framework for this benchmark.
 #### Gemma 3
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
 |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
 |---|---|---|---|---|---|---|
 |MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
 |||||**Average**|**+24.71%**|**-8.28%**|
+\*: We were unable to find an evaluation framework for this benchmark.
+Given that the benchmark set in the Gemma 3 technical report seemed somewhat non-standard, we also compared its results with the Gemma 3 scores from the Qwen3 technical report, which were independently evaluated by the Qwen3 team.
+The benchmarks and corresponding scores listed in the table below are taken directly from the [Qwen3 technical report](https://arxiv.org/abs/2505.09388).
+|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
+|---|---|---|---|---|---|---|
+|MMLU|5-shot|51.81|59.51|57.93|+9.70%|-2.66%|
+|MMLU-Redux|5-shot|51.26|56.91|59.54|+16.15%|+4.62%|
+|MMLU-Pro|5-shot, CoT|24.74|29.23|-|-|-|
+|SuperGPQA|5-shot, CoT|15.03|17.68|-|-|-|
+|BBH|3-shot, CoT|41.47|51.7|48.56|+17.10%|-6.07%|
+|GPQA|5-shot, CoT|26.77|24.24|26.78|+0.04%|+10.48%|
+|GSM8K|4-shot, CoT|59.59|43.97|76.49|+28.36%|+73.96%|
+|MATH||4-shot, CoT|32.44|26.1|40.2|+23.92%|54.02%|
+|EvalPlus|0-shot|36.23|43.23|59.57|+64.42%|37.80%|
+|MultiPL-E|0-shot|24.58|28.06|-|-|-|
+|MBPP|3-shot|36.6|46.4|60.3|+64.75%|+29.96%|
+|CRUX-O|1-shot|27|34|28.1|+4.07%|-17.35%|
+|||||**Average**|**+25.39%**|**+20.53%**|