Update README.md
Browse files
README.md
CHANGED
|
@@ -75,4 +75,36 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
|
|
| 75 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
| 76 |
|||||**Average**|**+39.42%**|**-3.83%**|
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
|
| 76 |
|||||**Average**|**+39.42%**|**-3.83%**|
|
| 77 |
|
| 78 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|
| 79 |
+
|
| 80 |
+
### Comparsion to Phi
|
| 81 |
+
The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
|
| 82 |
+
|
| 83 |
+
|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
|
| 84 |
+
|---|---|---|---|---|---|---|---|---|
|
| 85 |
+
|MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
|
| 86 |
+
|HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
|
| 87 |
+
|ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
|
| 88 |
+
|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
|
| 89 |
+
|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
|
| 90 |
+
|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
|
| 91 |
+
|AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-|
|
| 92 |
+
|TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
|
| 93 |
+
|Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
|
| 94 |
+
|Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
|
| 95 |
+
|PIQA|5-shot|84.2|86.9|60.2|78.29|-7.02%|-9.91%|+30.05%|
|
| 96 |
+
|SociQA|5-shot|76.6|79.2|68.3|66.73|-12.89%|-15.74%|-2.3%|
|
| 97 |
+
|BigBench-Hard|3-shot, CoT|71.7|79.1|59.4|48.56|-32.27%|-38.61%|-18.25%|
|
| 98 |
+
|WinoGrande|5-shot|70.8|81.5|54.7|67.09|-5.24%|-17.68%|+22.65%|
|
| 99 |
+
|OpenBookQA|10-shot|83.2|88|73.6|87.8|+5.53%|-0.23%|+19.29%|
|
| 100 |
+
|BoolQ|2-shot|77.2|84.8|-|70.7|-8.42%|-16.63%|-|
|
| 101 |
+
|CommonSenseQA|10-shot|80.2|80|69.3|71.25|-11.16%|-10.94%|2.81%|
|
| 102 |
+
|TruthfulQA|10-shot|65|70.2|-|52.07|-19.89%|-25.83%|-|
|
| 103 |
+
|HumanEval|0-shot|58.5|61|59|68.29|+16.74%|+11.95%|+15.75%|
|
| 104 |
+
|MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
|
| 105 |
+
|GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
|
| 106 |
+
|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
|
| 107 |
+
||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
\*: We were unable to find an evaluation framework for this benchmark.
|