Spaces:

hf-skills
/

README

Running

App Files Files Community

README / 02_evaluate-hub-model.md

burtenshaw HF Staff

Upload folder using huggingface_hub

7acc317 verified 17 days ago

preview code

raw

history blame contribute delete

3.38 kB

	# Week 1: Evaluate a Hub Model

	Goal: Add evaluation results to model cards across the Hub. Together, we're building a distributed leaderboard of open source model performance.

	>[!NOTE]
	> Bonus XP for contributing to the leaderboard application. Open a PR [on the hub](https://huggingface.co/spaces/hf-skills/distributed-leaderboard/discussions) or [on GitHub](https://github.com/huggingface/skills/blob/main/apps/evals-leaderboard/app.py) to get your XP.

	## Why This Matters

	Model cards without evaluation data are hard to compare. By adding structured eval results to `model-index` metadata, we make models searchable, sortable, and easier to choose between. Your contributions power leaderboards and help the community find the best models for their needs. Also, by doing this in a distributed way, we can share our evaluation results with the community.

	## The Skill

	Use `hf_model_evaluation/` for this quest. Key capabilities:

	- Extract evaluation tables from existing README content
	- Import benchmark scores from Artificial Analysis
	- Run your own evals with inspect-ai on HF Jobs
	- Update model-index metadata (Papers with Code compatible)

	```bash
	# Preview what would be extracted
	python hf_model_evaluation/scripts/evaluation_manager.py extract-readme \
	--repo-id "model-author/model-name" --dry-run
	```

	## XP Tiers

	### 🐢 Starter — 50 XP

	Extract evaluation results from one benchmark and update its model card.

	1. Pick a Hub model without evaluation data from trending models on the hub
	2. Use the skill to extract or add a benchmark score
	3. Create a PR (or push directly if you own the model)

	What counts: One model, one dataset, metric visible in model card metadata.

	### 🐕 Standard — 100 XP

	Import scores from third-party benchmarks like Artificial Analysis.

	1. Find a model with benchmark data on external sites
	2. Use `import-aa` to fetch scores from Artificial Analysis API
	3. Create a PR with properly attributed evaluation data

	What counts: Undefined benchmark scores and merged PRs.

	```bash
	AA_API_KEY="your-key" python hf_model_evaluation/scripts/evaluation_manager.py import-aa \
	--creator-slug "anthropic" --model-name "claude-sonnet-4" \
	--repo-id "target/model" --create-pr
	```

	### 🦁 Advanced — 200 XP

	Run your own evaluation with inspect-ai and publish results.

	1. Choose an eval task (MMLU, GSM8K, HumanEval, etc.)
	2. Run the evaluation on HF Jobs infrastructure
	3. Update the model card with your results and methodology

	What counts: Original eval run and merged PR.

	```bash
	HF_TOKEN=$HF_TOKEN hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
	--flavor a10g-small --secret HF_TOKEN=$HF_TOKEN \
	-- --model "meta-llama/Llama-2-7b-hf" --task "mmlu"
	```

	## Tips

	- Always use `--dry-run` first to preview changes before pushing
	- Check for transposed tables where models are rows and benchmarks are columns
	- Be careful with PRs for models you don't own — most maintainers appreciate eval contributions but be respectful.
	- Manually validate the extracted scores and close PRs if needed.

	## Resources

	- [SKILL.md](../hf_model_evaluation/SKILL.md) — Full skill documentation
	- [Example Usage](../hf_model_evaluation/examples/USAGE_EXAMPLES.md) — Worked examples
	- [Metric Mapping](../hf_model_evaluation/examples/metric_mapping.json) — Standard metric types