Add model-index with comprehensive benchmark evaluations

#78
by davidlms - opened

Added structured evaluation results from README benchmark tables covering 4 categories:

1. Reasoning & Factuality (11 benchmarks):

  • HellaSwag: 77.2, BoolQ: 72.3, PIQA: 79.6, SocialIQA: 51.9
  • TriviaQA: 65.8, Natural Questions: 20.0
  • ARC-c: 56.2, ARC-e: 82.4, WinoGrande: 64.7
  • BIG-Bench Hard: 50.9, DROP: 60.1

2. STEM & Code (8 benchmarks):

  • MMLU: 59.6, MMLU Pro COT: 29.2, AGIEval: 42.1
  • MATH: 24.2, GSM8K: 38.4, GPQA: 15.0
  • MBPP: 46.0, HumanEval: 36.0

3. Multilingual (7 benchmarks):

  • MGSM: 34.7, Global-MMLU-Lite: 57.0
  • WMT24++ (ChrF): 48.4, FloRes: 39.2, XQuAD: 68.0
  • ECLeKTic: 11.0, IndicGenBench: 57.2

4. Multimodal (15 benchmarks):

  • COCOcap: 102.0, DocVQA: 72.8, InfoVQA: 44.1, MMMU: 39.2
  • TextVQA: 58.9, RealWorldQA: 45.5, ReMI: 27.3
  • AI2D: 63.2, ChartQA: 63.6, VQAv2: 63.9
  • BLINK: 38.0, OKVQA: 51.0, TallyQA: 42.5
  • SpatialSense VQA: 50.9, CountBenchQA: 26.1

Total: 41 benchmarks across reasoning, STEM, code, multilingual, and multimodal capabilities.

This enables the model to appear in leaderboards and makes it easier to compare with other models.

Note: Existing PRs (#57, #49, #34) modify README text content. This PR adds structured metadata to the YAML frontmatter and should not conflict.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment