Motif-Technologies
/

Motif-2.6B

Text Generation

text-generation-inference

Model card Files Files and versions

JH-Motif commited on Jun 7

Commit

fbf41fa

·

verified ·

1 Parent(s): 8750422

Update README.md

Files changed (1) hide show

README.md +33 -1

README.md CHANGED Viewed

@@ -75,4 +75,36 @@ The benchmarks and metrics used are identical to those in the [Llama 3.2 officia
 |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
 |||||**Average**|**+39.42%**|**-3.83%**|
-\* We were unable to find an evaluation framework for this benchmark.

 |Hellaswag|0-shot|41.2|69.8|61.35|+48.91%|-12.11%|
 |||||**Average**|**+39.42%**|**-3.83%**|
+\*: We were unable to find an evaluation framework for this benchmark.
+### Comparsion to Phi
+The benchmarks and metrics used are identical to those in the [Phi-3 technical report](https://arxiv.org/abs/2404.14219).
+|Benchmark|Metric|Phi-3 3.8B|Phi-3 7B|Phi-2 2.7B|Motif 2.6B|Improvement(over 3.8B)|Improvement(over 7B)|Improvement(over 2.7B)|
+|---|---|---|---|---|---|---|---|---|
+|MMLU|5-shot|68.8|75.7|56.3|57.93|-15.80%|-23.47%|+2.90%|
+|HellaSwag|5-shot|76.7|77|53.6|68.97|-10.08%|-10.43%|+28.68%|
+|ANLI|7-shot|52.8|58.1|42.5|47.99|-9.11%|-17.40%|+12.92%|
+|GSM-8K|8-shot, CoT|82.5|89.6|61.1|76.5|-7.27%|-14.62%|+25.20%|
+|MATH|0-shot, CoT|41.3|34.6|-|49.68|+20.29%|+43.58%|-|
+|MedQA|2-shot|53.8|65.4|40.9|42.1|-21.75%|-35.63%|+2.93%|
+|AGIEval*|0-shot|37.5|45.1|29.8|-|-|-|-|
+|TriviaQA|5-shot|64|58.1|45.2|54.97|-14.11%|-5.39%|+21.62%|
+|Arc-C|10-shot|84.9|90.7|75.9|75.17|-11.46%|-17.12%|-0.96%|
+|Arc-E|10-shot|94.6|97|88.5|88.64|-6.30%|-8.62%|+0.16%|
+|PIQA|5-shot|84.2|86.9|60.2|78.29|-7.02%|-9.91%|+30.05%|
+|SociQA|5-shot|76.6|79.2|68.3|66.73|-12.89%|-15.74%|-2.3%|
+|BigBench-Hard|3-shot, CoT|71.7|79.1|59.4|48.56|-32.27%|-38.61%|-18.25%|
+|WinoGrande|5-shot|70.8|81.5|54.7|67.09|-5.24%|-17.68%|+22.65%|
+|OpenBookQA|10-shot|83.2|88|73.6|87.8|+5.53%|-0.23%|+19.29%|
+|BoolQ|2-shot|77.2|84.8|-|70.7|-8.42%|-16.63%|-|
+|CommonSenseQA|10-shot|80.2|80|69.3|71.25|-11.16%|-10.94%|2.81%|
+|TruthfulQA|10-shot|65|70.2|-|52.07|-19.89%|-25.83%|-|
+|HumanEval|0-shot|58.5|61|59|68.29|+16.74%|+11.95%|+15.75%|
+|MBPP|3-shot|70|71.7|60.6|60.3|-13.86%|-15.90%|-0.50%|
+|GPQA|2-shot, CoT|32.8|34.3|-|23.44|-28.54%|-31.66%|-|
+|MT Bench|2R. Avg.|8.38|8.7|-|6.77|-19.21%|-22.18%|-|
+||||||**Average**|**-10.09%**|**-13.45%**|**+10.18%**|
+\*: We were unable to find an evaluation framework for this benchmark.