README / 02_evaluate-hub-model.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
7acc317 verified
# Week 1: Evaluate a Hub Model
**Goal:** Add evaluation results to model cards across the Hub. Together, we're building a distributed leaderboard of open source model performance.
>[!NOTE]
> Bonus XP for contributing to the leaderboard application. Open a PR [on the hub](https://huggingface.co/spaces/hf-skills/distributed-leaderboard/discussions) or [on GitHub](https://github.com/huggingface/skills/blob/main/apps/evals-leaderboard/app.py) to get your XP.
## Why This Matters
Model cards without evaluation data are hard to compare. By adding structured eval results to `model-index` metadata, we make models searchable, sortable, and easier to choose between. Your contributions power leaderboards and help the community find the best models for their needs. Also, by doing this in a distributed way, we can share our evaluation results with the community.
## The Skill
Use `hf_model_evaluation/` for this quest. Key capabilities:
- Extract evaluation tables from existing README content
- Import benchmark scores from Artificial Analysis
- Run your own evals with inspect-ai on HF Jobs
- Update model-index metadata (Papers with Code compatible)
```bash
# Preview what would be extracted
python hf_model_evaluation/scripts/evaluation_manager.py extract-readme \
--repo-id "model-author/model-name" --dry-run
```
## XP Tiers
### 🐒 Starter β€” 50 XP
**Extract evaluation results from one benchmark and update its model card.**
1. Pick a Hub model without evaluation data from *trending models* on the hub
2. Use the skill to extract or add a benchmark score
3. Create a PR (or push directly if you own the model)
**What counts:** One model, one dataset, metric visible in model card metadata.
### πŸ• Standard β€” 100 XP
**Import scores from third-party benchmarks like Artificial Analysis.**
1. Find a model with benchmark data on external sites
2. Use `import-aa` to fetch scores from Artificial Analysis API
3. Create a PR with properly attributed evaluation data
**What counts:** Undefined benchmark scores and merged PRs.
```bash
AA_API_KEY="your-key" python hf_model_evaluation/scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" --model-name "claude-sonnet-4" \
--repo-id "target/model" --create-pr
```
### 🦁 Advanced β€” 200 XP
**Run your own evaluation with inspect-ai and publish results.**
1. Choose an eval task (MMLU, GSM8K, HumanEval, etc.)
2. Run the evaluation on HF Jobs infrastructure
3. Update the model card with your results and methodology
**What counts:** Original eval run and merged PR.
```bash
HF_TOKEN=$HF_TOKEN hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small --secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" --task "mmlu"
```
## Tips
- Always use `--dry-run` first to preview changes before pushing
- Check for transposed tables where models are rows and benchmarks are columns
- Be careful with PRs for models you don't own β€” most maintainers appreciate eval contributions but be respectful.
- Manually validate the extracted scores and close PRs if needed.
## Resources
- [SKILL.md](../hf_model_evaluation/SKILL.md) β€” Full skill documentation
- [Example Usage](../hf_model_evaluation/examples/USAGE_EXAMPLES.md) β€” Worked examples
- [Metric Mapping](../hf_model_evaluation/examples/metric_mapping.json) β€” Standard metric types