Add model-index with comprehensive benchmark evaluations
#78
by
davidlms
- opened
Added structured evaluation results from README benchmark tables covering 4 categories:
1. Reasoning & Factuality (11 benchmarks):
- HellaSwag: 77.2, BoolQ: 72.3, PIQA: 79.6, SocialIQA: 51.9
- TriviaQA: 65.8, Natural Questions: 20.0
- ARC-c: 56.2, ARC-e: 82.4, WinoGrande: 64.7
- BIG-Bench Hard: 50.9, DROP: 60.1
2. STEM & Code (8 benchmarks):
- MMLU: 59.6, MMLU Pro COT: 29.2, AGIEval: 42.1
- MATH: 24.2, GSM8K: 38.4, GPQA: 15.0
- MBPP: 46.0, HumanEval: 36.0
3. Multilingual (7 benchmarks):
- MGSM: 34.7, Global-MMLU-Lite: 57.0
- WMT24++ (ChrF): 48.4, FloRes: 39.2, XQuAD: 68.0
- ECLeKTic: 11.0, IndicGenBench: 57.2
4. Multimodal (15 benchmarks):
- COCOcap: 102.0, DocVQA: 72.8, InfoVQA: 44.1, MMMU: 39.2
- TextVQA: 58.9, RealWorldQA: 45.5, ReMI: 27.3
- AI2D: 63.2, ChartQA: 63.6, VQAv2: 63.9
- BLINK: 38.0, OKVQA: 51.0, TallyQA: 42.5
- SpatialSense VQA: 50.9, CountBenchQA: 26.1
Total: 41 benchmarks across reasoning, STEM, code, multilingual, and multimodal capabilities.
This enables the model to appear in leaderboards and makes it easier to compare with other models.
Note: Existing PRs (#57, #49, #34) modify README text content. This PR adds structured metadata to the YAML frontmatter and should not conflict.