Spaces:
Running
Running
| """ | |
| Task description components for the leaderboard application. | |
| """ | |
| import streamlit as st | |
| from src.utils.config import tasks_info | |
| from src.utils.task_mapping import get_display_name, get_original_name | |
| def render_task_descriptions(): | |
| """ | |
| Render the benchmark details section | |
| """ | |
| # Display the MLRC-BENCH image | |
| st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) | |
| # Display the MLRC-BENCH information | |
| st.markdown(""" | |
| # Can Language Agents Solve Machine Learning Research Challenges? | |
| 🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems. | |
| --- | |
| ## 🤖 What's the Problem? | |
| While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**. | |
| Most existing efforts either: | |
| - Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas). | |
| - Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation. | |
| Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**. | |
| --- | |
| ## 🧪 Enter MLRC-BENCH | |
| **MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in: | |
| - LLM safety | |
| - Multimodal perception | |
| - Few-shot learning | |
| - Machine unlearning | |
| - Meta learning | |
| - And more! | |
| Each task demands novel method design—not just re-implementing existing solutions. | |
| ### ✅ What Makes MLRC-BENCH Unique? | |
| - **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving. | |
| - **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints. | |
| - **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden. | |
| - **Continually Updated**: New competition tasks will be added as ML research progresses. | |
| --- | |
| ## 📉 What Did We Find? | |
| Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**: | |
| - The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution. | |
| - Providing additional ideas from humans or other agents doesn't consistently help. | |
| - LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform. | |
| 📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**. | |
| --- | |
| ## 🔬 Under the Hood | |
| MLRC-BENCH comes with: | |
| - **7 fully prepared tasks** with unified code structure. | |
| - **Development & test splits** for fair comparison. | |
| - **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**. | |
| - A leaderboard showcasing normalized improvements over baselines. | |
| > Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline! | |
| --- | |
| ## 🧠 Why This Matters | |
| MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks: | |
| > Can LLMs **propose and implement** solutions that outperform known baselines on hard problems? | |
| If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**. | |
| --- | |
| ## 📍 Try It Yourself | |
| Check out the tasks and submit your own agent: | |
| 👉 We will open the link for submission in the near future. Stay tuned! | |
| Let’s see if your agent can beat the benchmark! | |
| """) | |
| st.markdown(""" | |
| <div class="card"> | |
| <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div> | |
| <p style="margin-bottom: 20px;"> | |
| Click on any task to learn more. | |
| </p> | |
| </div> | |
| """, unsafe_allow_html=True) | |
| # Task links mapping - using original task names | |
| original_task_links = { | |
| "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model", | |
| "Machine Unlearning": "https://unlearning-challenge.github.io/", | |
| "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io", | |
| "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge", | |
| "Meta Learning": "https://metalearning.chalearn.org/", | |
| "Llm Merging": "https://llm-merging.github.io", | |
| "Rainfall Prediction": "https://weather4cast.net/neurips-2023/" | |
| } | |
| # Update links mapping to use display names as keys | |
| task_links = {get_display_name(task): link for task, link in original_task_links.items()} | |
| # Create two columns | |
| col1, col2 = st.columns(2) | |
| # Split tasks between the two columns with better styling | |
| task_items = list(tasks_info.items()) | |
| mid_point = len(task_items) // 2 | |
| with col1: | |
| for task, description in task_items[:mid_point]: | |
| link = task_links.get(task, "#") | |
| st.markdown(f""" | |
| <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;"> | |
| <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';"> | |
| <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div> | |
| </div> | |
| </a> | |
| """, unsafe_allow_html=True) | |
| with col2: | |
| for task, description in task_items[mid_point:]: | |
| link = task_links.get(task, "#") | |
| st.markdown(f""" | |
| <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;"> | |
| <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';"> | |
| <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div> | |
| </div> | |
| </a> | |
| """, unsafe_allow_html=True) |