EvalArena

Sleeping

App Files Files Community

dror44 commited on Apr 24

Commit

3df66f9

1 Parent(s): 94407ab

more work

Browse files

Files changed (8) hide show

data/history.csv +29 -1
data/leaderboard.csv +7 -7
data/models.jsonl +2 -2
models.jsonl +2 -3
src/app.py +137 -10
src/data_manager.py +1 -1
src/judge.py +29 -2
src/ui.py +2 -2

data/history.csv CHANGED Viewed

	@@ -1 +1,29 @@
1	- timestamp,input,output,judge1_id,judge1_name,judge1_evaluation,judge2_id,judge2_name,judge2_evaluation,winner_id,user_ip

+timestamp,input,output,judge1_id,judge1_name,judge1_evaluation,judge2_id,judge2_name,judge2_evaluation,winner_id,user_ip,judge1_time,judge2_time
+2025-04-24T17:05:59.863028,"Does Neptune And Uranus Have Solid Ground? Credit: Universe Today Neptune and Uranus are gas giants and do not have solid ground. They are mostly made up of hydrogen, helium, and methane. It has been discovered that a mysterious dark storm has formed on Neptune. Hubble has photographed four other mysterious dark vortexes since 1993, the most recent of which was captured on Wednesday.","""solid ground"" can be found on Earth, Jupiter and Mars but not on Neptune and the Sun.",deepseek-v3,DeepSeek V3,"### Evaluation:
+1. **Factual Support**: 7/10
+   - The text explicitly states that Neptune and Uranus are gas giants and do not have solid ground, which directly supports the claim regarding Neptune.
+   - However, the text does not mention Earth, Jupiter, Mars, or the Sun, so there is no direct support for those parts of the claim.
+2. **Logical Connection**: 6/10
+   - The logical connection is strong for Neptune, as the text clearly contradicts the idea of solid ground there.
+   - For the other celestial bodies (Earth, Jupiter, Mars, Sun), the absence of information in the text means the claim cannot be logically validated or refuted based on this text alone.
+3. **Claim Accuracy**: 6/10
+   - The claim is accurate regarding Neptune (no solid ground), which aligns with the text.
+   - The claim about Earth, Jupiter, Mars, and the Sun may or may not be accurate, but the text does not provide evidence either way.
+### Summary:
+The claim is **partially supported** by the text. The assertion that Neptune lacks solid ground is correct and backed by the text. However, the text does not address the other celestial bodies mentioned (Earth, Jupiter, Mars, Sun), so those parts of the claim remain unverified in this context. A more complete evaluation would require additional sources discussing those planets and the Sun.",meta-llama-3.1-8b-instruct-turbo,Meta Llama 3.1 8B Instruct,"**Evaluation:**
+1. **Factual Support:** 8/10
+The text provides factual support for the claim that Neptune and Uranus do not have solid ground. It states that they are gas giants and are mostly made up of hydrogen, helium, and methane. However, the text does not explicitly mention the Sun, so it does not provide factual support for the claim regarding the Sun.
+2. **Logical Connection:** 9/10
+The text provides a logical connection to the claim by stating that Neptune and Uranus are gas giants, which implies that they do not have solid ground. The mention of other planets like Earth, Jupiter, and Mars having solid ground also provides a logical connection to the claim.
+3. **Claim Accuracy:** 7/10
+The claim is partially accurate. It is correct that Neptune and Uranus do not have solid ground. However, the claim is not accurate regarding the Sun, as the text does not provide any information about the Sun's composition or the presence of solid ground.
+**Summary:**
+The claim is partially supported by the text, but it is not entirely accurate. The text provides factual support for the claim regarding Neptune and Uranus, but it does not provide factual support for the claim regarding the Sun. The logical connection between the text and the claim is strong, but the claim's accuracy is limited by the omission of information about the Sun.",deepseek-v3,,23.25821089744568,2.207097053527832

data/leaderboard.csv CHANGED Viewed

@@ -3,29 +3,29 @@ gemma-2-27b-it,Gemma 2 27B,1749.8091372785384,25.0,0.0,25.0,Google,Open Source
 claude-3-opus-latest,Claude 3 Opus,1531.9661669788793,2.0,0.0,2.0,Anthropic,Proprietary
 mistral-7b-instruct-v0.1,Mistral (7B) Instruct v0.1,1516.736306793522,1.0,0.0,1.0,Mistral AI,Open Source
 qwen-2.5-7b-instruct-turbo,Qwen 2.5 7B Instruct,1516.0,1.0,0.0,1.0,Alibaba,Open Source
-meta-llama-3.1-8b-instruct-turbo,Meta Llama 3.1 8B Instruct,1515.2298601853572,1.0,0.0,1.0,Meta,Open Source
 claude-3-haiku-20240307,Claude 3 Haiku,1501.6053648908744,3.0,3.0,6.0,Anthropic,Proprietary
 gpt-4-turbo,GPT-4 Turbo,1500.736306793522,1.0,1.0,2.0,OpenAI,Proprietary
-deepseek-v3,DeepSeek V3,1500.0,0.0,0.0,0.0,DeepSeek,Open Source
 judge1,EvalGPT,1500.0,0.0,0.0,0.0,OpenAI,Commercial
 o3-mini, o3-mini,1500.0,0.0,0.0,0.0,OpenAI,Proprietary
-meta-llama-3.3-70B-instruct-turbo,Meta Llama 4 Scout 32K Instruct,1500.0,0.0,0.0,0.0,Meta,Open Source
 meta-llama-4-scout-17B-16E-instruct,Meta Llama 4 Scout 17B 16E Instruct,1500.0,0.0,0.0,0.0,Meta,Open Source
 judge5,Mixtral,1500.0,0.0,0.0,0.0,Mistral AI,Commercial
 judge4,PrecisionJudge,1500.0,0.0,0.0,0.0,Anthropic,Commercial
-judge3,GradeAssist,1500.0,0.0,0.0,0.0,Anthropic,Commercial
 deepseek-r1,DeepSeek R1,1500.0,0.0,0.0,0.0,DeepSeek,Open Source
-claude-3-sonnet-20240229,Claude 3 Sonnet,1500.0,0.0,0.0,0.0,Anthropic,Proprietary
 claude-3-5-haiku-latest,Claude 3.5 Haiku,1500.0,0.0,0.0,0.0,Anthropic,Proprietary
 atla-selene,Atla Selene,1500.0,0.0,0.0,0.0,Atla,Proprietary
 qwen-2-72b-instruct,Qwen 2 Instruct (72B),1500.0,0.0,0.0,0.0,Alibaba,Open Source
 gemma-2-9b-it,Gemma 2 9B,1500.0,0.0,0.0,0.0,Google,Open Source
 judge2,CritiqueBot,1500.0,0.0,0.0,0.0,OpenAI,Commercial
 meta-llama-3.1-405b-instruct-turbo,Meta Llama 3.1 405B Instruct,1499.263693206478,1.0,1.0,2.0,Meta,Open Source
 meta-llama-3.1-70b-instruct-turbo,Meta Llama 3.1 70B Instruct,1499.2598341210926,2.0,2.0,4.0,Meta,Open Source
 mistral-7b-instruct-v0.3,Mistral (7B) Instruct v0.3,1487.3632548757455,0.0,2.0,2.0,Mistral AI,Open Source
-gpt-4.1,GPT-4.1,1484.7701398146428,0.0,1.0,1.0,OpenAI,Proprietary
 claude-3-5-sonnet-latest,Claude 3.5 Sonnet,1484.0,0.0,1.0,1.0,Anthropic,Proprietary
-gpt-4o,GPT-4o,1484.0,0.0,1.0,1.0,OpenAI,Proprietary
 qwen-2.5-72b-instruct-turbo,Qwen 2.5 72B Instruct,1412.6552679185854,21.0,25.0,46.0,Alibaba,Open Source
 gpt-3.5-turbo,GPT-3.5 Turbo,1318.2061729482512,0.0,21.0,21.0,OpenAI,Proprietary

 claude-3-opus-latest,Claude 3 Opus,1531.9661669788793,2.0,0.0,2.0,Anthropic,Proprietary
 mistral-7b-instruct-v0.1,Mistral (7B) Instruct v0.1,1516.736306793522,1.0,0.0,1.0,Mistral AI,Open Source
 qwen-2.5-7b-instruct-turbo,Qwen 2.5 7B Instruct,1516.0,1.0,0.0,1.0,Alibaba,Open Source
+meta-llama-3.3-70B-instruct-turbo,Meta Llama 4 Scout 32K Instruct,1516.0,1.0,0.0,1.0,Meta,Open Source
+claude-3-sonnet-20240229,Claude 3 Sonnet,1515.263693206478,1.0,0.0,1.0,Anthropic,Proprietary
+gpt-4.1,GPT-4.1,1502.1692789932397,1.0,1.0,2.0,OpenAI,Proprietary
 claude-3-haiku-20240307,Claude 3 Haiku,1501.6053648908744,3.0,3.0,6.0,Anthropic,Proprietary
 gpt-4-turbo,GPT-4 Turbo,1500.736306793522,1.0,1.0,2.0,OpenAI,Proprietary
+deepseek-v3,DeepSeek V3,1500.6365920113653,1.0,1.0,2.0,DeepSeek,Open Source
 judge1,EvalGPT,1500.0,0.0,0.0,0.0,OpenAI,Commercial
 o3-mini, o3-mini,1500.0,0.0,0.0,0.0,OpenAI,Proprietary
 meta-llama-4-scout-17B-16E-instruct,Meta Llama 4 Scout 17B 16E Instruct,1500.0,0.0,0.0,0.0,Meta,Open Source
 judge5,Mixtral,1500.0,0.0,0.0,0.0,Mistral AI,Commercial
 judge4,PrecisionJudge,1500.0,0.0,0.0,0.0,Anthropic,Commercial
 deepseek-r1,DeepSeek R1,1500.0,0.0,0.0,0.0,DeepSeek,Open Source
 claude-3-5-haiku-latest,Claude 3.5 Haiku,1500.0,0.0,0.0,0.0,Anthropic,Proprietary
 atla-selene,Atla Selene,1500.0,0.0,0.0,0.0,Atla,Proprietary
 qwen-2-72b-instruct,Qwen 2 Instruct (72B),1500.0,0.0,0.0,0.0,Alibaba,Open Source
 gemma-2-9b-it,Gemma 2 9B,1500.0,0.0,0.0,0.0,Google,Open Source
 judge2,CritiqueBot,1500.0,0.0,0.0,0.0,OpenAI,Commercial
+judge3,GradeAssist,1500.0,0.0,0.0,0.0,Anthropic,Commercial
 meta-llama-3.1-405b-instruct-turbo,Meta Llama 3.1 405B Instruct,1499.263693206478,1.0,1.0,2.0,Meta,Open Source
 meta-llama-3.1-70b-instruct-turbo,Meta Llama 3.1 70B Instruct,1499.2598341210926,2.0,2.0,4.0,Meta,Open Source
 mistral-7b-instruct-v0.3,Mistral (7B) Instruct v0.3,1487.3632548757455,0.0,2.0,2.0,Mistral AI,Open Source
 claude-3-5-sonnet-latest,Claude 3.5 Sonnet,1484.0,0.0,1.0,1.0,Anthropic,Proprietary
+meta-llama-3.1-8b-instruct-turbo,Meta Llama 3.1 8B Instruct,1481.194128995395,1.0,2.0,3.0,Meta,Open Source
+gpt-4o,GPT-4o,1468.736306793522,0.0,2.0,2.0,OpenAI,Proprietary
 qwen-2.5-72b-instruct-turbo,Qwen 2.5 72B Instruct,1412.6552679185854,21.0,25.0,46.0,Alibaba,Open Source
 gpt-3.5-turbo,GPT-3.5 Turbo,1318.2061729482512,0.0,21.0,21.0,OpenAI,Proprietary

data/models.jsonl CHANGED Viewed

@@ -26,5 +26,5 @@
 {"id": "qwen-2.5-7b-instruct-turbo", "name": "Qwen 2.5 7B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-7B-Instruct-Turbo", "provider": "together", "parameters": "7B"}
 {"id": "atla-selene", "name": "Atla Selene", "organization": "Atla", "license": "Proprietary", "api_model": "atla-selene", "provider": "together", "parameters": "N/A"}
-{"id": "deepseek-v3", "name": "DeepSeek V3", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-v3", "provider": "together", "parameters": "671B"}
-{"id": "deepseek-r1", "name": "DeepSeek R1", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-r1", "provider": "together", "parameters": "671B"}

 {"id": "qwen-2.5-7b-instruct-turbo", "name": "Qwen 2.5 7B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-7B-Instruct-Turbo", "provider": "together", "parameters": "7B"}
 {"id": "atla-selene", "name": "Atla Selene", "organization": "Atla", "license": "Proprietary", "api_model": "atla-selene", "provider": "together", "parameters": "N/A"}
+{"id": "deepseek-v3", "name": "DeepSeek V3", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-ai/DeepSeek-V3", "provider": "together", "parameters": "671B"}
+{"id": "deepseek-r1", "name": "DeepSeek R1", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-ai/DeepSeek-R1", "provider": "together", "parameters": "671B"}

models.jsonl CHANGED Viewed

@@ -25,6 +25,5 @@
 {"id": "qwen-2.5-72b-instruct-turbo", "name": "Qwen 2.5 72B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-72B-Instruct-Turbo", "provider": "together", "parameters": "72B"}
 {"id": "qwen-2.5-7b-instruct-turbo", "name": "Qwen 2.5 7B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-7B-Instruct-Turbo", "provider": "together", "parameters": "7B"}
-{"id": "atla-selene", "name": "Atla Selene", "organization": "Atla", "license": "Proprietary", "api_model": "atla-selene", "provider": "together", "parameters": "N/A"}
-{"id": "deepseek-v3", "name": "DeepSeek V3", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-v3", "provider": "together", "parameters": "671B"}
-{"id": "deepseek-r1", "name": "DeepSeek R1", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-r1", "provider": "together", "parameters": "671B"}

 {"id": "qwen-2.5-72b-instruct-turbo", "name": "Qwen 2.5 72B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-72B-Instruct-Turbo", "provider": "together", "parameters": "72B"}
 {"id": "qwen-2.5-7b-instruct-turbo", "name": "Qwen 2.5 7B Instruct", "organization": "Alibaba", "license": "Open Source", "api_model": "Qwen/Qwen2.5-7B-Instruct-Turbo", "provider": "together", "parameters": "7B"}
+{"id": "deepseek-v3", "name": "DeepSeek V3", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-ai/DeepSeek-V3", "provider": "together", "parameters": "671B"}
+{"id": "deepseek-r1", "name": "DeepSeek R1", "organization": "DeepSeek", "license": "Open Source", "api_model": "deepseek-ai/DeepSeek-R1", "provider": "together", "parameters": "671B"}

src/app.py CHANGED Viewed

@@ -1,8 +1,11 @@
 from typing import Any, Dict, Optional, Tuple
 import gradio as gr
 from loguru import logger
 from src.data_manager import get_random_example, load_models
 from src.judge import JudgeManager
 from src.ui import UI
@@ -221,8 +224,11 @@ def get_evaluation1(
         )
         logger.info("Completed evaluation 1")
         # Make the selection button visible once the evaluation is ready
-        return eval1["display_evaluation"], gr.update(visible=True)
     except Exception as e:
         logger.error(f"Error getting evaluation 1: {e}")
         return f"Error: {str(e)}", gr.update(visible=False)
@@ -261,9 +267,12 @@ def get_evaluation2(
         )
         logger.info("Completed evaluation 2")
         # Make the selection button visible once the evaluation is ready and show additional buttons
         return (
-            eval2["display_evaluation"],
             gr.update(visible=True),
             gr.update(visible=True),
         )
@@ -299,6 +308,54 @@ def format_inputs_for_evaluation(
     return input_text, output_text
 def select_winner(choice: str, judge_manager: JudgeManager) -> str:
     """Select a winner from the evaluations."""
     global eval1, eval2, current_test_type
@@ -307,12 +364,20 @@ def select_winner(choice: str, judge_manager: JudgeManager) -> str:
         if not eval1 or not eval2:
             return "Error: No evaluations available"
         if choice == "Evaluation 1":
             winner_eval = eval1
             loser_eval = eval2
         else:
             winner_eval = eval2
             loser_eval = eval1
         # Update leaderboard
         updated_board = judge_manager.update_leaderboard(
@@ -321,13 +386,29 @@ def select_winner(choice: str, judge_manager: JudgeManager) -> str:
             result_type="win",
         )
         # Construct result message with revealed judges' names
         result_message = f"You selected: {choice}\n\n"
-        result_message += f"Evaluation 1 was by: {eval1['judge']['name']}\n"
-        result_message += f"Evaluation 2 was by: {eval2['judge']['name']}\n\n"
         # Get the winner's new ELO score
-        winner_id = winner_eval["judge"]["id"]
         winner_mask = updated_board["judge_id"] == winner_id
         winner_elo = updated_board[winner_mask]["elo_score"].values[0]
@@ -349,6 +430,12 @@ def handle_both_correct(judge_manager: JudgeManager) -> str:
         if not eval1 or not eval2:
             return "Error: No evaluations available"
         # Update leaderboard for both judges
         updated_board = judge_manager.update_leaderboard(
             eval1["judge"]["id"],
@@ -356,10 +443,27 @@ def handle_both_correct(judge_manager: JudgeManager) -> str:
             result_type="both_correct",
         )
         # Construct result message with revealed judges' names
         result_message = "You selected: Both Correct\n\n"
-        result_message += f"Evaluation 1 was by: {eval1['judge']['name']}\n"
-        result_message += f"Evaluation 2 was by: {eval2['judge']['name']}\n\n"
         # Get the new ELO scores
         judge1_mask = updated_board["judge_id"] == eval1["judge"]["id"]
@@ -371,7 +475,7 @@ def handle_both_correct(judge_manager: JudgeManager) -> str:
         result_message += "\nBoth judges performed well!\n"
         result_message += f"{eval1['judge']['name']} new ELO: {judge1_elo:.2f}\n"
         result_message += f"{eval2['judge']['name']} new ELO: {judge2_elo:.2f}\n"
-        result_message += "Test Type: {current_test_type}\n"
         return result_message
     except Exception as e:
@@ -387,6 +491,12 @@ def handle_both_incorrect(judge_manager: JudgeManager) -> str:
         if not eval1 or not eval2:
             return "Error: No evaluations available"
         # Update leaderboard for both judges
         updated_board = judge_manager.update_leaderboard(
             eval1["judge"]["id"],
@@ -394,10 +504,27 @@ def handle_both_incorrect(judge_manager: JudgeManager) -> str:
             result_type="both_incorrect",
         )
         # Construct result message with revealed judges' names
         result_message = "You selected: Both Incorrect\n\n"
-        result_message += f"Evaluation 1 was by: {eval1['judge']['name']}\n"
-        result_message += f"Evaluation 2 was by: {eval2['judge']['name']}\n\n"
         # Get the new ELO scores
         judge1_mask = updated_board["judge_id"] == eval1["judge"]["id"]

+import datetime
 from typing import Any, Dict, Optional, Tuple
 import gradio as gr
+import pandas as pd
 from loguru import logger
+from src.config import HISTORY_PATH
 from src.data_manager import get_random_example, load_models
 from src.judge import JudgeManager
 from src.ui import UI
         )
         logger.info("Completed evaluation 1")
+        # Format evaluation with time
+        display_eval = f"Evaluation time: {eval1['elapsed_time']:.2f} seconds\n\n{eval1['display_evaluation']}\n\n"
         # Make the selection button visible once the evaluation is ready
+        return display_eval, gr.update(visible=True)
     except Exception as e:
         logger.error(f"Error getting evaluation 1: {e}")
         return f"Error: {str(e)}", gr.update(visible=False)
         )
         logger.info("Completed evaluation 2")
+        # Format evaluation with time
+        display_eval = f"Evaluation time: {eval2['elapsed_time']:.2f} seconds\n\n{eval2['display_evaluation']}\n"
         # Make the selection button visible once the evaluation is ready and show additional buttons
         return (
+            display_eval,
             gr.update(visible=True),
             gr.update(visible=True),
         )
     return input_text, output_text
+def save_to_history(
+    input_text: str,
+    output_text: str,
+    judge1_id: str,
+    judge1_name: str,
+    judge1_evaluation: str,
+    judge1_time: float,
+    judge2_id: str,
+    judge2_name: str,
+    judge2_evaluation: str,
+    judge2_time: float,
+    winner_id: str,
+) -> None:
+    """Save the evaluation results to history CSV file."""
+    try:
+        # Create a new row for the history
+        history_row = {
+            "timestamp": datetime.datetime.now().isoformat(),
+            "input": input_text,
+            "output": output_text,
+            "judge1_id": judge1_id,
+            "judge1_name": judge1_name,
+            "judge1_evaluation": judge1_evaluation,
+            "judge1_time": judge1_time,
+            "judge2_id": judge2_id,
+            "judge2_name": judge2_name,
+            "judge2_evaluation": judge2_evaluation,
+            "judge2_time": judge2_time,
+            "winner_id": winner_id,
+        }
+        # Try to load existing history
+        try:
+            history_df = pd.read_csv(HISTORY_PATH)
+        except (FileNotFoundError, pd.errors.EmptyDataError):
+            # Create a new history dataframe if file doesn't exist or is empty
+            history_df = pd.DataFrame(columns=list(history_row.keys()))
+        # Append the new row
+        history_df = pd.concat([history_df, pd.DataFrame([history_row])], ignore_index=True)
+        # Save to CSV
+        history_df.to_csv(HISTORY_PATH, index=False)
+        logger.info("Saved evaluation to history")
+    except Exception as e:
+        logger.error(f"Error saving to history: {e}")
 def select_winner(choice: str, judge_manager: JudgeManager) -> str:
     """Select a winner from the evaluations."""
     global eval1, eval2, current_test_type
         if not eval1 or not eval2:
             return "Error: No evaluations available"
+        # Get the input and output text that was evaluated
+        input_text, output_text = "", ""
+        if "input_text" in eval1 and "output_text" in eval1:
+            input_text = eval1.get("input_text", "")
+            output_text = eval1.get("output_text", "")
         if choice == "Evaluation 1":
             winner_eval = eval1
             loser_eval = eval2
+            winner_id = eval1["judge"]["id"]
         else:
             winner_eval = eval2
             loser_eval = eval1
+            winner_id = eval2["judge"]["id"]
         # Update leaderboard
         updated_board = judge_manager.update_leaderboard(
             result_type="win",
         )
+        # Save to history
+        save_to_history(
+            input_text=input_text,
+            output_text=output_text,
+            judge1_id=eval1["judge"]["id"],
+            judge1_name=eval1["judge"]["name"],
+            judge1_evaluation=eval1["anonymous_evaluation"],
+            judge1_time=eval1["elapsed_time"],
+            judge2_id=eval2["judge"]["id"],
+            judge2_name=eval2["judge"]["name"],
+            judge2_evaluation=eval2["anonymous_evaluation"],
+            judge2_time=eval2["elapsed_time"],
+            winner_id=winner_id,
+        )
         # Construct result message with revealed judges' names
         result_message = f"You selected: {choice}\n\n"
+        result_message += f"Evaluation 1 was by: {eval1['judge']['name']} (took {eval1['elapsed_time']:.2f} seconds)\n"
+        result_message += (
+            f"Evaluation 2 was by: {eval2['judge']['name']} (took {eval2['elapsed_time']:.2f} seconds)\n\n"
+        )
         # Get the winner's new ELO score
         winner_mask = updated_board["judge_id"] == winner_id
         winner_elo = updated_board[winner_mask]["elo_score"].values[0]
         if not eval1 or not eval2:
             return "Error: No evaluations available"
+        # Get the input and output text that was evaluated
+        input_text, output_text = "", ""
+        if "input_text" in eval1 and "output_text" in eval1:
+            input_text = eval1.get("input_text", "")
+            output_text = eval1.get("output_text", "")
         # Update leaderboard for both judges
         updated_board = judge_manager.update_leaderboard(
             eval1["judge"]["id"],
             result_type="both_correct",
         )
+        # Save to history with both as winners
+        save_to_history(
+            input_text=input_text,
+            output_text=output_text,
+            judge1_id=eval1["judge"]["id"],
+            judge1_name=eval1["judge"]["name"],
+            judge1_evaluation=eval1["anonymous_evaluation"],
+            judge1_time=eval1["elapsed_time"],
+            judge2_id=eval2["judge"]["id"],
+            judge2_name=eval2["judge"]["name"],
+            judge2_evaluation=eval2["anonymous_evaluation"],
+            judge2_time=eval2["elapsed_time"],
+            winner_id="both",
+        )
         # Construct result message with revealed judges' names
         result_message = "You selected: Both Correct\n\n"
+        result_message += f"Evaluation 1 was by: {eval1['judge']['name']} (took {eval1['elapsed_time']:.2f} seconds)\n"
+        result_message += (
+            f"Evaluation 2 was by: {eval2['judge']['name']} (took {eval2['elapsed_time']:.2f} seconds)\n\n"
+        )
         # Get the new ELO scores
         judge1_mask = updated_board["judge_id"] == eval1["judge"]["id"]
         result_message += "\nBoth judges performed well!\n"
         result_message += f"{eval1['judge']['name']} new ELO: {judge1_elo:.2f}\n"
         result_message += f"{eval2['judge']['name']} new ELO: {judge2_elo:.2f}\n"
+        result_message += f"Test Type: {current_test_type}\n"
         return result_message
     except Exception as e:
         if not eval1 or not eval2:
             return "Error: No evaluations available"
+        # Get the input and output text that was evaluated
+        input_text, output_text = "", ""
+        if "input_text" in eval1 and "output_text" in eval1:
+            input_text = eval1.get("input_text", "")
+            output_text = eval1.get("output_text", "")
         # Update leaderboard for both judges
         updated_board = judge_manager.update_leaderboard(
             eval1["judge"]["id"],
             result_type="both_incorrect",
         )
+        # Save to history with neither as winner
+        save_to_history(
+            input_text=input_text,
+            output_text=output_text,
+            judge1_id=eval1["judge"]["id"],
+            judge1_name=eval1["judge"]["name"],
+            judge1_evaluation=eval1["anonymous_evaluation"],
+            judge1_time=eval1["elapsed_time"],
+            judge2_id=eval2["judge"]["id"],
+            judge2_name=eval2["judge"]["name"],
+            judge2_evaluation=eval2["anonymous_evaluation"],
+            judge2_time=eval2["elapsed_time"],
+            winner_id="none",
+        )
         # Construct result message with revealed judges' names
         result_message = "You selected: Both Incorrect\n\n"
+        result_message += f"Evaluation 1 was by: {eval1['judge']['name']} (took {eval1['elapsed_time']:.2f} seconds)\n"
+        result_message += (
+            f"Evaluation 2 was by: {eval2['judge']['name']} (took {eval2['elapsed_time']:.2f} seconds)\n\n"
+        )
         # Get the new ELO scores
         judge1_mask = updated_board["judge_id"] == eval1["judge"]["id"]

src/data_manager.py CHANGED Viewed

@@ -127,7 +127,7 @@ def get_random_example(test_type: str) -> Dict[str, str]:
         # Map fields based on test type
         if test_type == "grounding":
-            result["text"] = example.get("text", "")
             result["claim"] = example.get("claim", "")
         elif test_type in ["prompt injections", "safety"]:
             result["text"] = example.get("text", "")

         # Map fields based on test type
         if test_type == "grounding":
+            result["text"] = example.get("doc", "")
             result["claim"] = example.get("claim", "")
         elif test_type in ["prompt injections", "safety"]:
             result["text"] = example.get("text", "")

src/judge.py CHANGED Viewed

@@ -1,6 +1,9 @@
 import random
 from typing import Any, Dict, List, Tuple
 import pandas as pd
 from litellm import completion
 from loguru import logger
@@ -8,6 +11,8 @@ from together import Together
 from src.config import K_FACTOR, LEADERBOARD_PATH
 class JudgeManager:
     """Manages judge evaluations and judge data"""
@@ -79,18 +84,28 @@ class JudgeManager:
     ) -> Dict[str, Any]:
         """Get an evaluation from a judge"""
         try:
             # Create appropriate system prompt based on test type
             system_prompt = self._get_system_prompt(test_type)
             # Format user message with input and output
             user_message = self._create_user_message(input_text, output_text, test_type)
             # Get evaluation from the API
             if judge["provider"].lower() in ["openai", "anthropic"]:
                 api_response = completion(
                     model=judge["api_model"],
                     messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
-                    temperature=0.2,
                     max_tokens=500,
                 )
                 evaluation = api_response.choices[0].message.content
@@ -98,7 +113,7 @@ class JudgeManager:
                 api_response = self.together_client.chat.completions.create(
                     model=judge["api_model"],
                     messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
-                    temperature=0.2,
                     max_tokens=500,
                 )
                 evaluation = api_response.choices[0].message.content
@@ -112,18 +127,27 @@ class JudgeManager:
             # Store the full evaluation with judge name for revealing later
             full_eval = f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n" f"{evaluation}"
             return {
                 "judge": judge,
                 "evaluation": full_eval,
                 "display_evaluation": anonymous_eval,
                 "anonymous_evaluation": anonymous_eval,
                 "revealed_evaluation": full_eval,
             }
         except Exception as e:
             # Handle API errors gracefully
             logger.error(f"Error getting evaluation from {judge['name']}: {str(e)}")
             # Create a fallback evaluation
             metrics = ["Quality: 7/10", "Relevance: 8/10", "Precision: 7/10"]
             comment = f"[Fallback evaluation due to API error: {str(e)}]"
@@ -139,6 +163,9 @@ class JudgeManager:
                 "anonymous_evaluation": anonymous_eval,
                 "revealed_evaluation": full_eval,
                 "error": str(e),
             }
     def _create_user_message(self, input_text: str, output_text: str, test_type: str) -> str:

 import random
+import time
 from typing import Any, Dict, List, Tuple
+# Add litellm configuration to handle unsupported parameters
+import litellm
 import pandas as pd
 from litellm import completion
 from loguru import logger
 from src.config import K_FACTOR, LEADERBOARD_PATH
+litellm.drop_params = True
 class JudgeManager:
     """Manages judge evaluations and judge data"""
     ) -> Dict[str, Any]:
         """Get an evaluation from a judge"""
         try:
+            # Start timing
+            start_time = time.time()
             # Create appropriate system prompt based on test type
             system_prompt = self._get_system_prompt(test_type)
             # Format user message with input and output
             user_message = self._create_user_message(input_text, output_text, test_type)
+            # Set temperature based on model
+            temperature = 0.2
+            # O-series models only support temperature=1
+            if judge["provider"].lower() == "openai" and "o3" in judge["api_model"]:
+                temperature = 1.0
+                logger.info(f"Using temperature=1.0 for O-series model {judge['api_model']}")
             # Get evaluation from the API
             if judge["provider"].lower() in ["openai", "anthropic"]:
                 api_response = completion(
                     model=judge["api_model"],
                     messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
+                    temperature=temperature,
                     max_tokens=500,
                 )
                 evaluation = api_response.choices[0].message.content
                 api_response = self.together_client.chat.completions.create(
                     model=judge["api_model"],
                     messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}],
+                    temperature=temperature,
                     max_tokens=500,
                 )
                 evaluation = api_response.choices[0].message.content
             # Store the full evaluation with judge name for revealing later
             full_eval = f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n" f"{evaluation}"
+            # Calculate elapsed time
+            elapsed_time = time.time() - start_time
             return {
                 "judge": judge,
                 "evaluation": full_eval,
                 "display_evaluation": anonymous_eval,
                 "anonymous_evaluation": anonymous_eval,
                 "revealed_evaluation": full_eval,
+                "elapsed_time": elapsed_time,
+                "input_text": input_text,
+                "output_text": output_text,
             }
         except Exception as e:
             # Handle API errors gracefully
             logger.error(f"Error getting evaluation from {judge['name']}: {str(e)}")
+            # End timing for error case
+            elapsed_time = time.time() - start_time
             # Create a fallback evaluation
             metrics = ["Quality: 7/10", "Relevance: 8/10", "Precision: 7/10"]
             comment = f"[Fallback evaluation due to API error: {str(e)}]"
                 "anonymous_evaluation": anonymous_eval,
                 "revealed_evaluation": full_eval,
                 "error": str(e),
+                "elapsed_time": elapsed_time,
+                "input_text": input_text,
+                "output_text": output_text,
             }
     def _create_user_message(self, input_text: str, output_text: str, test_type: str) -> str:

src/ui.py CHANGED Viewed

@@ -99,7 +99,7 @@ class UI:
                             label="Test Type",
                             info="Select the type of test to evaluate",
                         )
-                        refresh_button = gr.Button("Get Random Example")
                 # Create different input layouts based on test type
                 with gr.Row():
@@ -120,7 +120,7 @@ class UI:
                         input_text = gr.Textbox(label="Input", lines=4, visible=False)
                         output_text = gr.Textbox(label="Output", lines=6, visible=False)
-                        submit_button = gr.Button("Get Judge Evaluations")
                         status_message = gr.Markdown(visible=False)
                 with gr.Row():

                             label="Test Type",
                             info="Select the type of test to evaluate",
                         )
+                        refresh_button = gr.Button("Load from a dataset")
                 # Create different input layouts based on test type
                 with gr.Row():
                         input_text = gr.Textbox(label="Input", lines=4, visible=False)
                         output_text = gr.Textbox(label="Output", lines=6, visible=False)
+                        submit_button = gr.Button("Evaluate")
                         status_message = gr.Markdown(visible=False)
                 with gr.Row():