JH-Motif commited on
Commit
24517e4
·
verified ·
1 Parent(s): 1c86ca5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -7
README.md CHANGED
@@ -20,7 +20,7 @@ To illustrate how much evaluation scores can vary across reports, we provide con
20
 
21
  ### Comparsion to Mistral
22
 
23
- The benchmarks and metrics used are identical to those in the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
24
 
25
  |Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
26
  |---|---|---|---|---|
@@ -43,7 +43,7 @@ The benchmarks and metrics used are identical to those in the [Mistral 7B techni
43
  ### Comparsion to Llama
44
 
45
  #### Llama 3
46
- The benchmarks and metrics used are identical to those in the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
47
 
48
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
49
  |---|---|---|---|---|
@@ -60,7 +60,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3 technical
60
  ||||**Average**|**-15.36%**|
61
 
62
  #### Llama 3.2
63
- The benchmarks and metrics used are identical to those in the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
64
 
65
  |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
66
  |---|---|---|---|---|---|---|
@@ -78,7 +78,7 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
78
  \*: We were unable to find an evaluation framework for this benchmark.
79
 
80
  ### Comparsion to Phi
81
- The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
82
 
83
  |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
84
  |---|---|---|---|---|---|---|---|---|
@@ -111,7 +111,7 @@ The benchmarks and metrics used are identical to those in the [Phi-3 technical r
111
  ### Comparsion to Gemma
112
 
113
  #### Gemma 1 & 2
114
- The benchmarks and metrics used are identical to those in the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
115
 
116
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
117
  |---|---|---|---|---|---|---|---|---|---|---|
@@ -137,7 +137,7 @@ The benchmarks and metrics used are identical to those in the [Gemma 2 technical
137
  \*: We were unable to find an evaluation framework for this benchmark.
138
 
139
  #### Gemma 3
140
- The benchmarks and metrics used are identical to those in the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
141
 
142
  |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
143
  |---|---|---|---|---|---|---|
@@ -152,4 +152,24 @@ The benchmarks and metrics used are identical to those in the [Gemma 3 technical
152
  |MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
153
  |||||**Average**|**+24.71%**|**-8.28%**|
154
 
155
- \*: We were unable to find an evaluation framework for this benchmark.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ### Comparsion to Mistral
22
 
23
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Mistral 7B technical report](https://arxiv.org/pdf/2310.06825).
24
 
25
  |Benchmark|Metric|Mistral 7B|Motif 2.6B|Improvement|
26
  |---|---|---|---|---|
 
43
  ### Comparsion to Llama
44
 
45
  #### Llama 3
46
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3 technical report](https://arxiv.org/abs/2407.21783).
47
 
48
  |Benchmark|Metric|Llama 3 8B|Motif 2.6B|Improvement|
49
  |---|---|---|---|---|
 
60
  ||||**Average**|**-15.36%**|
61
 
62
  #### Llama 3.2
63
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Llama 3.2 official blog](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/).
64
 
65
  |Benchmark|Metric|Llama 3.2 1B|Llama 3.2 1B|Motif 2.6B|Improvement(over 1B)|Improvement(over 3B)|
66
  |---|---|---|---|---|---|---|
 
78
  \*: We were unable to find an evaluation framework for this benchmark.
79
 
80
  ### Comparsion to Phi
81
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
82
 
83
  |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
84
  |---|---|---|---|---|---|---|---|---|
 
111
  ### Comparsion to Gemma
112
 
113
  #### Gemma 1 & 2
114
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 2 technical report](https://arxiv.org/abs/2408.00118).
115
 
116
  |Benchmark|Metric|Gemma 1 2B|Gemma 1 7B|Gemma 2 2B|Gemma 2 9B|Motif 2.6B|Improvement(over 1 1B)|Improvement(over 1 7B)|Improvement(over 2 2B)|Improvement(over 2 9B)|
117
  |---|---|---|---|---|---|---|---|---|---|---|
 
137
  \*: We were unable to find an evaluation framework for this benchmark.
138
 
139
  #### Gemma 3
140
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Gemma 3 technical report](https://arxiv.org/abs/2503.19786).
141
 
142
  |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
143
  |---|---|---|---|---|---|---|
 
152
  |MMLU(val)|5-shot|-|48.8|57.93|-|+18.71%|
153
  |||||**Average**|**+24.71%**|**-8.28%**|
154
 
155
+ \*: We were unable to find an evaluation framework for this benchmark.
156
+
157
+ Given that the benchmark set in the Gemma 3 technical report seemed somewhat non-standard, we also compared its results with the Gemma 3 scores from the Qwen3 technical report, which were independently evaluated by the Qwen3 team.
158
+
159
+ The benchmarks and corresponding scores listed in the table below are taken directly from the [Qwen3 technical report](https://arxiv.org/abs/2505.09388).
160
+
161
+ |Benchmark|Metric|Gemma 3 1B|Gemma 3 4B|Motif 2.6B|Improvement(over 1B)|Improvement(over 4B)|
162
+ |---|---|---|---|---|---|---|
163
+ |MMLU|5-shot|51.81|59.51|57.93|+9.70%|-2.66%|
164
+ |MMLU-Redux|5-shot|51.26|56.91|59.54|+16.15%|+4.62%|
165
+ |MMLU-Pro|5-shot, CoT|24.74|29.23|-|-|-|
166
+ |SuperGPQA|5-shot, CoT|15.03|17.68|-|-|-|
167
+ |BBH|3-shot, CoT|41.47|51.7|48.56|+17.10%|-6.07%|
168
+ |GPQA|5-shot, CoT|26.77|24.24|26.78|+0.04%|+10.48%|
169
+ |GSM8K|4-shot, CoT|59.59|43.97|76.49|+28.36%|+73.96%|
170
+ |MATH||4-shot, CoT|32.44|26.1|40.2|+23.92%|54.02%|
171
+ |EvalPlus|0-shot|36.23|43.23|59.57|+64.42%|37.80%|
172
+ |MultiPL-E|0-shot|24.58|28.06|-|-|-|
173
+ |MBPP|3-shot|36.6|46.4|60.3|+64.75%|+29.96%|
174
+ |CRUX-O|1-shot|27|34|28.1|+4.07%|-17.35%|
175
+ |||||**Average**|**+25.39%**|**+20.53%**|