JH-Motif commited on
Commit
fbf41fa
·
verified ·
1 Parent(s): 8750422

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -1
README.md CHANGED
@@ -75,4 +75,36 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
75
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
76
  |||||**Average**|**+39.42%**|**-3.83%**|
77
 
78
- \* We were unable to find an evaluation framework for this benchmark.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
76
  |||||**Average**|**+39.42%**|**-3.83%**|
77
 
78
+ \*: We were unable to find an evaluation framework for this benchmark.
79
+
80
+ ### Comparsion to Phi
81
+ The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
82
+
83
+ |Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
84
+ |---|---|---|---|---|---|---|---|---|
85
+ |MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
86
+ |HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
87
+ |ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
88
+ |GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
89
+ |MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
90
+ |MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
91
+ |AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-|
92
+ |TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
93
+ |Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
94
+ |Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
95
+ |PIQA|5-shot|84.2|86.9|60.2|78.29|-7.02%|-9.91%|+30.05%|
96
+ |SociQA|5-shot|76.6|79.2|68.3|66.73|-12.89%|-15.74%|-2.3%|
97
+ |BigBench-Hard|3-shot, CoT|71.7|79.1|59.4|48.56|-32.27%|-38.61%|-18.25%|
98
+ |WinoGrande|5-shot|70.8|81.5|54.7|67.09|-5.24%|-17.68%|+22.65%|
99
+ |OpenBookQA|10-shot|83.2|88|73.6|87.8|+5.53%|-0.23%|+19.29%|
100
+ |BoolQ|2-shot|77.2|84.8|-|70.7|-8.42%|-16.63%|-|
101
+ |CommonSenseQA|10-shot|80.2|80|69.3|71.25|-11.16%|-10.94%|2.81%|
102
+ |TruthfulQA|10-shot|65|70.2|-|52.07|-19.89%|-25.83%|-|
103
+ |HumanEval|0-shot|58.5|61|59|68.29|+16.74%|+11.95%|+15.75%|
104
+ |MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
105
+ |GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
106
+ |MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
107
+ ||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
108
+
109
+
110
+ \*: We were unable to find an evaluation framework for this benchmark.