Update README.md
Browse files
README.md
CHANGED
|
@@ -20,7 +20,7 @@ To illustrate how much evaluation scores can vary across reports, we provide con
|
|
| 20 |
|
| 21 |
### Comparsion to Mistral
|
| 22 |
|
| 23 |
-
The benchmarks and
|
| 24 |
|
| 25 |
|Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
|
| 26 |
|---|---|---|---|---|
|
|
@@ -43,7 +43,7 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
|
|
| 43 |
### Comparsion to Llama
|
| 44 |
|
| 45 |
#### Llama 3
|
| 46 |
-
The benchmarks and
|
| 47 |
|
| 48 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
| 49 |
|---|---|---|---|---|
|
|
@@ -60,7 +60,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
|
|
| 60 |
||||**Average**|**-15.36%**|
|
| 61 |
|
| 62 |
#### Llama 3.2
|
| 63 |
-
The benchmarks and
|
| 64 |
|
| 65 |
|Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
|
| 66 |
|---|---|---|---|---|---|---|
|
|
@@ -78,7 +78,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
|
|
| 78 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 79 |
|
| 80 |
### Comparsion to Phi
|
| 81 |
-
The benchmarks and
|
| 82 |
|
| 83 |
|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|
| 84 |
|---|---|---|---|---|---|---|---|---|
|
|
@@ -111,7 +111,7 @@ The benchmarks and metrics used are identical to those in the [Phi-3 technical r
|
|
| 111 |
### Comparsion to Gemma
|
| 112 |
|
| 113 |
#### Gemma 1 & 2
|
| 114 |
-
The benchmarks and
|
| 115 |
|
| 116 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 117 |
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -137,7 +137,7 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
|
|
| 137 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 138 |
|
| 139 |
#### Gemma 3
|
| 140 |
-
The benchmarks and
|
| 141 |
|
| 142 |
|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|
| 143 |
|---|---|---|---|---|---|---|
|
|
@@ -152,4 +152,24 @@ The benchmarks and metrics used are identical to those in the [Gemma 3 technical
|
|
| 152 |
|MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
|
| 153 |
|||||**Average**|**+24.71%**|**-8.28%**|
|
| 154 |
|
| 155 |
-
\*: We were unable to find an evaluation framework for this benchmark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
### Comparsion to Mistral
|
| 22 |
|
| 23 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
|
| 24 |
|
| 25 |
|Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
|
| 26 |
|---|---|---|---|---|
|
|
|
|
| 43 |
### Comparsion to Llama
|
| 44 |
|
| 45 |
#### Llama 3
|
| 46 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
|
| 47 |
|
| 48 |
|Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
|
| 49 |
|---|---|---|---|---|
|
|
|
|
| 60 |
||||**Average**|**-15.36%**|
|
| 61 |
|
| 62 |
#### Llama 3.2
|
| 63 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
|
| 64 |
|
| 65 |
|Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
|
| 66 |
|---|---|---|---|---|---|---|
|
|
|
|
| 78 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 79 |
|
| 80 |
### Comparsion to Phi
|
| 81 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
| 82 |
|
| 83 |
|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|
| 84 |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
| 111 |
### Comparsion to Gemma
|
| 112 |
|
| 113 |
#### Gemma 1 & 2
|
| 114 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
|
| 115 |
|
| 116 |
|Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
|
| 117 |
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| 137 |
\*: We were unable to find an evaluation framework for this benchmark.
|
| 138 |
|
| 139 |
#### Gemma 3
|
| 140 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
|
| 141 |
|
| 142 |
|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|
| 143 |
|---|---|---|---|---|---|---|
|
|
|
|
| 152 |
|MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
|
| 153 |
|||||**Average**|**+24.71%**|**-8.28%**|
|
| 154 |
|
| 155 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
| 156 |
+
|
| 157 |
+
Given that the benchmark set in the Gemma 3 technical report seemed somewhat non-standard, we also compared its results with the Gemma 3 scores from the Qwen3 technical report, which were independently evaluated by the Qwen3 team.
|
| 158 |
+
|
| 159 |
+
The benchmarks and corresponding scores listed in the table below are taken directly from the [Qwen3 technical report](https://arxiv.org/abs/2505.09388).
|
| 160 |
+
|
| 161 |
+
|Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
|
| 162 |
+
|---|---|---|---|---|---|---|
|
| 163 |
+
|MMLU|5-shot|51.81|59.51|57.93|+9.70%|-2.66%|
|
| 164 |
+
|MMLU-Redux|5-shot|51.26|56.91|59.54|+16.15%|+4.62%|
|
| 165 |
+
|MMLU-Pro|5-shot, CoT|24.74|29.23|-|-|-|
|
| 166 |
+
|SuperGPQA|5-shot, CoT|15.03|17.68|-|-|-|
|
| 167 |
+
|BBH|3-shot, CoT|41.47|51.7|48.56|+17.10%|-6.07%|
|
| 168 |
+
|GPQA|5-shot, CoT|26.77|24.24|26.78|+0.04%|+10.48%|
|
| 169 |
+
|GSM8K|4-shot, CoT|59.59|43.97|76.49|+28.36%|+73.96%|
|
| 170 |
+
|MATH||4-shot, CoT|32.44|26.1|40.2|+23.92%|54.02%|
|
| 171 |
+
|EvalPlus|0-shot|36.23|43.23|59.57|+64.42%|37.80%|
|
| 172 |
+
|MultiPL-E|0-shot|24.58|28.06|-|-|-|
|
| 173 |
+
|MBPP|3-shot|36.6|46.4|60.3|+64.75%|+29.96%|
|
| 174 |
+
|CRUX-O|1-shot|27|34|28.1|+4.07%|-17.35%|
|
| 175 |
+
|||||**Average**|**+25.39%**|**+20.53%**|
|