Spaces:
Running
Running
Update src/components/tasks.py
Browse files- src/components/tasks.py +60 -17
src/components/tasks.py
CHANGED
|
@@ -14,43 +14,86 @@ def render_task_descriptions():
|
|
| 14 |
|
| 15 |
# Display the MLRC-BENCH information
|
| 16 |
st.markdown("""
|
| 17 |
-
## MLRC-BENCH: Can Language Agents Solve ML Research Challenges?
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
-
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
- **
|
| 39 |
-
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
- **Absolute Improvement to Baseline**
|
| 42 |
-
How much better the agent performs compared to the baseline, expressed as a percentage gain.
|
| 43 |
---
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
""")
|
| 56 |
|
|
|
|
| 14 |
|
| 15 |
# Display the MLRC-BENCH information
|
| 16 |
st.markdown("""
|
|
|
|
| 17 |
|
| 18 |
+
# Can Language Agents Solve Machine Learning Research Challenges?
|
| 19 |
+
|
| 20 |
+
🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
+
## 🤖 What's the Problem?
|
| 25 |
|
| 26 |
+
While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**.
|
| 27 |
|
| 28 |
+
Most existing efforts either:
|
| 29 |
+
- Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas).
|
| 30 |
+
- Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation.
|
| 31 |
|
| 32 |
+
Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**.
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
+
## 🧪 Enter MLRC-BENCH
|
| 37 |
+
|
| 38 |
+
**MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
|
| 39 |
+
- LLM safety
|
| 40 |
+
- Multimodal perception
|
| 41 |
+
- Few-shot learning
|
| 42 |
+
- Machine unlearning
|
| 43 |
+
- Meta learning
|
| 44 |
+
- And more!
|
| 45 |
+
|
| 46 |
+
Each task demands novel method design—not just re-implementing existing solutions.
|
| 47 |
|
| 48 |
+
### ✅ What Makes MLRC-BENCH Unique?
|
| 49 |
|
| 50 |
+
- **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
|
| 51 |
+
- **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
|
| 52 |
+
- **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden.
|
| 53 |
+
- **Continually Updated**: New competition tasks will be added as ML research progresses.
|
| 54 |
|
|
|
|
|
|
|
| 55 |
---
|
| 56 |
|
| 57 |
+
## 📉 What Did We Find?
|
| 58 |
+
|
| 59 |
+
Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**:
|
| 60 |
+
|
| 61 |
+
- The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution.
|
| 62 |
+
- Providing additional ideas from humans or other agents doesn't consistently help.
|
| 63 |
+
- LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.
|
| 64 |
|
| 65 |
+
📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**.
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
+
## 🔬 Under the Hood
|
| 70 |
+
|
| 71 |
+
MLRC-BENCH comes with:
|
| 72 |
+
- **7 fully prepared tasks** with unified code structure.
|
| 73 |
+
- **Development & test splits** for fair comparison.
|
| 74 |
+
- **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**.
|
| 75 |
+
- A leaderboard showcasing normalized improvements over baselines.
|
| 76 |
+
|
| 77 |
+
> Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 🧠 Why This Matters
|
| 82 |
+
|
| 83 |
+
MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks:
|
| 84 |
+
> Can LLMs **propose and implement** solutions that outperform known baselines on hard problems?
|
| 85 |
+
|
| 86 |
+
If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## 📍 Try It Yourself
|
| 91 |
+
|
| 92 |
+
Check out the tasks and submit your own agent:
|
| 93 |
+
|
| 94 |
+
👉 We will open the link for submission in the near future. Stay tuned!
|
| 95 |
|
| 96 |
+
Let’s see if your agent can beat the benchmark!
|
| 97 |
|
| 98 |
""")
|
| 99 |
|