dror44 commited on
Commit
a1bff60
·
1 Parent(s): 5bda5f1
.cursor/rules/clean-code.mdc ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: true
5
+ ---
6
+ ---
7
+ description: Guidelines for writing clean, maintainable, and human-readable code. Apply these rules when writing or reviewing code to ensure consistency and quality.
8
+ globs:
9
+ ---
10
+ # Clean Code Guidelines
11
+
12
+ ## Constants Over Magic Numbers
13
+ - Replace hard-coded values with named constants
14
+ - Use descriptive constant names that explain the value's purpose
15
+ - Keep constants at the top of the file or in a dedicated constants file
16
+
17
+ ## Meaningful Names
18
+ - Variables, functions, and classes should reveal their purpose
19
+ - Names should explain why something exists and how it's used
20
+ - Avoid abbreviations unless they're universally understood
21
+
22
+ ## Smart Comments
23
+ - Don't comment on what the code does - make the code self-documenting
24
+ - Use comments to explain why something is done a certain way
25
+ - Document APIs, complex algorithms, and non-obvious side effects
26
+
27
+ ## Single Responsibility
28
+ - Each function should do exactly one thing
29
+ - Functions should be small and focused
30
+ - If a function needs a comment to explain what it does, it should be split
31
+
32
+ ## DRY (Don't Repeat Yourself)
33
+ - Extract repeated code into reusable functions
34
+ - Share common logic through proper abstraction
35
+ - Maintain single sources of truth
36
+
37
+ ## Clean Structure
38
+ - Keep related code together
39
+ - Organize code in a logical hierarchy
40
+ - Use consistent file and folder naming conventions
41
+
42
+ ## Encapsulation
43
+ - Hide implementation details
44
+ - Expose clear interfaces
45
+ - Move nested conditionals into well-named functions
46
+
47
+ ## Code Quality Maintenance
48
+ - Refactor continuously
49
+ - Fix technical debt early
50
+ - Leave code cleaner than you found it
51
+
52
+ ## Testing
53
+ - Write tests before fixing bugs
54
+ - Keep tests readable and maintainable
55
+ - Test edge cases and error conditions
56
+
57
+ ## Version Control
58
+ - Write clear commit messages
59
+ - Make small, focused commits
60
+ - Use meaningful branch names
.cursor/rules/cloudflare-worker-typescript.mdc ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: true
5
+ ---
6
+
7
+ ---
8
+ description: TypeScript coding standards and best practices for modern web development
9
+ globs: **/*.ts, **/*.tsx, **/*.d.ts
10
+ ---
11
+
12
+ # TypeScript Best Practices
13
+
14
+ ## Type System
15
+ - Prefer interfaces over types for object definitions
16
+ - Use type for unions, intersections, and mapped types
17
+ - Avoid using `any`, prefer `unknown` for unknown types
18
+ - Use strict TypeScript configuration
19
+ - Leverage TypeScript's built-in utility types
20
+ - Use generics for reusable type patterns
21
+
22
+ ## Naming Conventions
23
+ - Use PascalCase for type names and interfaces
24
+ - Use camelCase for variables and functions
25
+ - Use UPPER_CASE for constants
26
+ - Use descriptive names with auxiliary verbs (e.g., isLoading, hasError)
27
+ - Prefix interfaces for React props with 'Props' (e.g., ButtonProps)
28
+
29
+ ## Code Organization
30
+ - Keep type definitions close to where they're used
31
+ - Export types and interfaces from dedicated type files when shared
32
+ - Use barrel exports (index.ts) for organizing exports
33
+ - Place shared types in a `types` directory
34
+ - Co-locate component props with their components
35
+
36
+ ## Functions
37
+ - Use explicit return types for public functions
38
+ - Use arrow functions for callbacks and methods
39
+ - Implement proper error handling with custom error types
40
+ - Use function overloads for complex type scenarios
41
+ - Prefer async/await over Promises
42
+
43
+ ## Best Practices
44
+ - Enable strict mode in tsconfig.json
45
+ - Use readonly for immutable properties
46
+ - Leverage discriminated unions for type safety
47
+ - Use type guards for runtime type checking
48
+ - Implement proper null checking
49
+ - Avoid type assertions unless necessary
50
+
51
+ ## Error Handling
52
+ - Create custom error types for domain-specific errors
53
+ - Use Result types for operations that can fail
54
+ - Implement proper error boundaries
55
+ - Use try-catch blocks with typed catch clauses
56
+ - Handle Promise rejections properly
57
+
58
+ ## Patterns
59
+ - Use the Builder pattern for complex object creation
60
+ - Implement the Repository pattern for data access
61
+ - Use the Factory pattern for object creation
62
+ - Leverage dependency injection
63
+ - Use the Module pattern for encapsulation
.cursor/rules/nextjs.mdc ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: true
5
+ ---
6
+ ---
7
+ description: Next.js with TypeScript and Tailwind UI best practices
8
+ globs: **/*.tsx, **/*.ts, src/**/*.ts, src/**/*.tsx
9
+ ---
10
+
11
+ # Next.js Best Practices
12
+
13
+ ## Project Structure
14
+ - Use the App Router directory structure
15
+ - Place components in `app` directory for route-specific components
16
+ - Place shared components in `components` directory
17
+ - Place utilities and helpers in `lib` directory
18
+ - Use lowercase with dashes for directories (e.g., `components/auth-wizard`)
19
+
20
+ ## Components
21
+ - Use Server Components by default
22
+ - Mark client components explicitly with 'use client'
23
+ - Wrap client components in Suspense with fallback
24
+ - Use dynamic loading for non-critical components
25
+ - Implement proper error boundaries
26
+ - Place static content and interfaces at file end
27
+
28
+ ## Performance
29
+ - Optimize images: Use WebP format, size data, lazy loading
30
+ - Minimize use of 'useEffect' and 'setState'
31
+ - Favor Server Components (RSC) where possible
32
+ - Use dynamic loading for non-critical components
33
+ - Implement proper caching strategies
34
+
35
+ ## Data Fetching
36
+ - Use Server Components for data fetching when possible
37
+ - Implement proper error handling for data fetching
38
+ - Use appropriate caching strategies
39
+ - Handle loading and error states appropriately
40
+
41
+ ## Routing
42
+ - Use the App Router conventions
43
+ - Implement proper loading and error states for routes
44
+ - Use dynamic routes appropriately
45
+ - Handle parallel routes when needed
46
+
47
+ ## Forms and Validation
48
+ - Use Zod for form validation
49
+ - Implement proper server-side validation
50
+ - Handle form errors appropriately
51
+ - Show loading states during form submission
52
+
53
+ ## State Management
54
+ - Minimize client-side state
55
+ - Use React Context sparingly
56
+ - Prefer server state when possible
57
+ - Implement proper loading states
.cursor/rules/tailwind.mdc ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: false
5
+ ---
6
+ ---
7
+ description: Tailwind CSS and UI component best practices for modern web applications
8
+ globs: **/*.css, **/*.tsx, **/*.jsx, tailwind.config.js, tailwind.config.ts
9
+ ---
10
+
11
+ # Tailwind CSS Best Practices
12
+
13
+ ## Project Setup
14
+ - Use proper Tailwind configuration
15
+ - Configure theme extension properly
16
+ - Set up proper purge configuration
17
+ - Use proper plugin integration
18
+ - Configure custom spacing and breakpoints
19
+ - Set up proper color palette
20
+
21
+ ## Component Styling
22
+ - Use utility classes over custom CSS
23
+ - Group related utilities with @apply when needed
24
+ - Use proper responsive design utilities
25
+ - Implement dark mode properly
26
+ - Use proper state variants
27
+ - Keep component styles consistent
28
+
29
+ ## Layout
30
+ - Use Flexbox and Grid utilities effectively
31
+ - Implement proper spacing system
32
+ - Use container queries when needed
33
+ - Implement proper responsive breakpoints
34
+ - Use proper padding and margin utilities
35
+ - Implement proper alignment utilities
36
+
37
+ ## Typography
38
+ - Use proper font size utilities
39
+ - Implement proper line height
40
+ - Use proper font weight utilities
41
+ - Configure custom fonts properly
42
+ - Use proper text alignment
43
+ - Implement proper text decoration
44
+
45
+ ## Colors
46
+ - Use semantic color naming
47
+ - Implement proper color contrast
48
+ - Use opacity utilities effectively
49
+ - Configure custom colors properly
50
+ - Use proper gradient utilities
51
+ - Implement proper hover states
52
+
53
+ ## Components
54
+ - Use shadcn/ui components when available
55
+ - Extend components properly
56
+ - Keep component variants consistent
57
+ - Implement proper animations
58
+ - Use proper transition utilities
59
+ - Keep accessibility in mind
60
+
61
+ ## Responsive Design
62
+ - Use mobile-first approach
63
+ - Implement proper breakpoints
64
+ - Use container queries effectively
65
+ - Handle different screen sizes properly
66
+ - Implement proper responsive typography
67
+ - Use proper responsive spacing
68
+
69
+ ## Performance
70
+ - Use proper purge configuration
71
+ - Minimize custom CSS
72
+ - Use proper caching strategies
73
+ - Implement proper code splitting
74
+ - Optimize for production
75
+ - Monitor bundle size
76
+
77
+ ## Best Practices
78
+ - Follow naming conventions
79
+ - Keep styles organized
80
+ - Use proper documentation
81
+ - Implement proper testing
82
+ - Follow accessibility guidelines
83
+ - Use proper version control
.cursor/rules/tech-stack.mdc ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description:
3
+ globs:
4
+ alwaysApply: true
5
+ ---
6
+
7
+ # Your rule content
8
+
9
+ - use python version 3.12 only
10
+ - Use FastAPI exclusively
11
+ - always add tests using pytest
12
+ - always add imports at the top of the file, never in a function
13
+ - if a code is defined somewhere reuse it or refactor it
14
+ - files can't exceed 300 lined of code at that point refactor
15
+ - don't document obvious lined of code
16
+ - use pydantic for types and objects
17
+ - if a database is needed use SQLModel
18
+ - if a LLM call is needed use LiteLLM
19
+ - if an email is needed use resend
20
+ - use httpx for requests
21
+ - for logging use loguru
22
+ -
README.md CHANGED
@@ -11,11 +11,117 @@ short_description: Duplicate this leaderboard to initialize your own!
11
  sdk_version: 5.19.0
12
  ---
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  # Start the configuration
15
 
16
  Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
17
 
18
  Results files should have the following format and be stored as json files:
 
19
  ```json
20
  {
21
  "config": {
@@ -40,7 +146,8 @@ If you encounter problem on the space, don't hesitate to restart it to remove th
40
 
41
  # Code logic for more complex edits
42
 
43
- You'll find
 
44
  - the main table' columns names and properties in `src/display/utils.py`
45
  - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
46
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
 
11
  sdk_version: 5.19.0
12
  ---
13
 
14
+ # AI Evaluation Judge Arena
15
+
16
+ An interactive platform for comparing and ranking AI evaluation models (judges) based on human preferences.
17
+
18
+ ## Overview
19
+
20
+ This application allows users to:
21
+
22
+ 1. View AI-generated outputs based on input prompts
23
+ 2. Compare evaluations from two different AI judges
24
+ 3. Select the better evaluation
25
+ 4. Build a leaderboard of judges ranked by ELO score
26
+
27
+ ## Features
28
+
29
+ - **Blind Comparison**: Judge identities are hidden until after selection
30
+ - **ELO Rating System**: Calculates judge rankings based on user preferences
31
+ - **Leaderboard**: Track performance of different evaluation models
32
+ - **Sample Examples**: Includes pre-loaded examples for immediate testing
33
+
34
+ ## Setup
35
+
36
+ ### Prerequisites
37
+
38
+ - Python 3.6+
39
+ - Required packages: gradio, pandas, numpy
40
+
41
+ ### Installation
42
+
43
+ 1. Clone this repository:
44
+
45
+ ```
46
+ git clone https://github.com/yourusername/eval-arena.git
47
+ cd eval-arena
48
+ ```
49
+
50
+ 2. Install dependencies:
51
+
52
+ ```
53
+ pip install -r requirements.txt
54
+ ```
55
+
56
+ 3. Run the application:
57
+
58
+ ```
59
+ python app.py
60
+ ```
61
+
62
+ 4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860)
63
+
64
+ ## Usage
65
+
66
+ 1. **Get Random Example**: Click to load a random input/output pair
67
+ 2. **Get Judge Evaluations**: View two anonymous evaluations of the output
68
+ 3. **Select Better Evaluation**: Choose which evaluation you prefer
69
+ 4. **See Results**: Learn which judges you compared and update the leaderboard
70
+ 5. **Leaderboard Tab**: View current rankings of all judges
71
+
72
+ ## Extending the Application
73
+
74
+ ### Adding New Examples
75
+
76
+ Add new examples in JSON format to the `data/examples` directory:
77
+
78
+ ```json
79
+ {
80
+ "id": "example_id",
81
+ "input": "Your input prompt",
82
+ "output": "AI-generated output to evaluate"
83
+ }
84
+ ```
85
+
86
+ ### Adding New Judges
87
+
88
+ Add new judges in JSON format to the `data/judges` directory:
89
+
90
+ ```json
91
+ {
92
+ "id": "judge_id",
93
+ "name": "Judge Name",
94
+ "description": "Description of judge's evaluation approach"
95
+ }
96
+ ```
97
+
98
+ ### Integrating Real Models
99
+
100
+ For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations.
101
+
102
+ ## License
103
+
104
+ MIT
105
+
106
+ ## Citation
107
+
108
+ If you use this platform in your research, please cite:
109
+
110
+ ```
111
+ @software{ai_eval_arena,
112
+ author = {Your Name},
113
+ title = {AI Evaluation Judge Arena},
114
+ year = {2023},
115
+ url = {https://github.com/yourusername/eval-arena}
116
+ }
117
+ ```
118
+
119
  # Start the configuration
120
 
121
  Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
122
 
123
  Results files should have the following format and be stored as json files:
124
+
125
  ```json
126
  {
127
  "config": {
 
146
 
147
  # Code logic for more complex edits
148
 
149
+ You'll find
150
+
151
  - the main table' columns names and properties in `src/display/utils.py`
152
  - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
153
+ - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
app.py CHANGED
@@ -1,204 +1,375 @@
 
 
 
 
1
  import gradio as gr
2
- from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
3
  import pandas as pd
4
- from apscheduler.schedulers.background import BackgroundScheduler
5
- from huggingface_hub import snapshot_download
6
-
7
- from src.about import (
8
- CITATION_BUTTON_LABEL,
9
- CITATION_BUTTON_TEXT,
10
- EVALUATION_QUEUE_TEXT,
11
- INTRODUCTION_TEXT,
12
- LLM_BENCHMARKS_TEXT,
13
- TITLE,
14
- )
15
- from src.display.css_html_js import custom_css
16
- from src.display.utils import (
17
- BENCHMARK_COLS,
18
- COLS,
19
- EVAL_COLS,
20
- EVAL_TYPES,
21
- AutoEvalColumn,
22
- ModelType,
23
- fields,
24
- WeightType,
25
- Precision
26
- )
27
- from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
28
- from src.populate import get_evaluation_queue_df, get_leaderboard_df
29
- from src.submission.submit import add_new_eval
30
-
31
-
32
- def restart_space():
33
- API.restart_space(repo_id=REPO_ID)
34
-
35
- ### Space initialisation
36
- try:
37
- print(EVAL_REQUESTS_PATH)
38
- snapshot_download(
39
- repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
40
- )
41
- except Exception:
42
- restart_space()
43
- try:
44
- print(EVAL_RESULTS_PATH)
45
- snapshot_download(
46
- repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
47
  )
48
- except Exception:
49
- restart_space()
50
-
51
-
52
- LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
53
-
54
- (
55
- finished_eval_queue_df,
56
- running_eval_queue_df,
57
- pending_eval_queue_df,
58
- ) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
59
-
60
- def init_leaderboard(dataframe):
61
- if dataframe is None or dataframe.empty:
62
- raise ValueError("Leaderboard DataFrame is empty or None.")
63
- return Leaderboard(
64
- value=dataframe,
65
- datatype=[c.type for c in fields(AutoEvalColumn)],
66
- select_columns=SelectColumns(
67
- default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
68
- cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
69
- label="Select Columns to Display:",
70
- ),
71
- search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
72
- hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
73
- filter_columns=[
74
- ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
75
- ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
76
- ColumnFilter(
77
- AutoEvalColumn.params.name,
78
- type="slider",
79
- min=0.01,
80
- max=150,
81
- label="Select the number of parameters (B)",
82
- ),
83
- ColumnFilter(
84
- AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
85
- ),
86
- ],
87
- bool_checkboxgroup_label="Hide models",
88
- interactive=False,
89
  )
 
 
 
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- demo = gr.Blocks(css=custom_css)
93
- with demo:
94
- gr.HTML(TITLE)
95
- gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
96
 
97
- with gr.Tabs(elem_classes="tab-buttons") as tabs:
98
- with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
99
- leaderboard = init_leaderboard(LEADERBOARD_DF)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
- with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
102
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
 
103
 
104
- with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
105
- with gr.Column():
106
- with gr.Row():
107
- gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
108
-
109
- with gr.Column():
110
- with gr.Accordion(
111
- f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
112
- open=False,
113
- ):
114
- with gr.Row():
115
- finished_eval_table = gr.components.Dataframe(
116
- value=finished_eval_queue_df,
117
- headers=EVAL_COLS,
118
- datatype=EVAL_TYPES,
119
- row_count=5,
120
- )
121
- with gr.Accordion(
122
- f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
123
- open=False,
124
- ):
125
- with gr.Row():
126
- running_eval_table = gr.components.Dataframe(
127
- value=running_eval_queue_df,
128
- headers=EVAL_COLS,
129
- datatype=EVAL_TYPES,
130
- row_count=5,
131
- )
132
-
133
- with gr.Accordion(
134
- f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
135
- open=False,
136
- ):
137
- with gr.Row():
138
- pending_eval_table = gr.components.Dataframe(
139
- value=pending_eval_queue_df,
140
- headers=EVAL_COLS,
141
- datatype=EVAL_TYPES,
142
- row_count=5,
143
- )
144
- with gr.Row():
145
- gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
146
-
147
- with gr.Row():
148
- with gr.Column():
149
- model_name_textbox = gr.Textbox(label="Model name")
150
- revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
151
- model_type = gr.Dropdown(
152
- choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
153
- label="Model type",
154
- multiselect=False,
155
- value=None,
156
- interactive=True,
157
- )
158
-
159
- with gr.Column():
160
- precision = gr.Dropdown(
161
- choices=[i.value.name for i in Precision if i != Precision.Unknown],
162
- label="Precision",
163
- multiselect=False,
164
- value="float16",
165
- interactive=True,
166
- )
167
- weight_type = gr.Dropdown(
168
- choices=[i.value.name for i in WeightType],
169
- label="Weights type",
170
- multiselect=False,
171
- value="Original",
172
- interactive=True,
173
- )
174
- base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
175
-
176
- submit_button = gr.Button("Submit Eval")
177
- submission_result = gr.Markdown()
178
- submit_button.click(
179
- add_new_eval,
180
  [
181
- model_name_textbox,
182
- base_model_name_textbox,
183
- revision_name_textbox,
184
- precision,
185
- weight_type,
186
- model_type,
 
 
 
 
 
187
  ],
188
- submission_result,
189
  )
190
 
191
- with gr.Row():
192
- with gr.Accordion("📙 Citation", open=False):
193
- citation_button = gr.Textbox(
194
- value=CITATION_BUTTON_TEXT,
195
- label=CITATION_BUTTON_LABEL,
196
- lines=20,
197
- elem_id="citation-button",
198
- show_copy_button=True,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
- scheduler = BackgroundScheduler()
202
- scheduler.add_job(restart_space, "interval", seconds=1800)
203
- scheduler.start()
204
- demo.queue(default_concurrency_limit=40).launch()
 
1
+ import json
2
+ import random
3
+ from pathlib import Path
4
+
5
  import gradio as gr
 
6
  import pandas as pd
7
+
8
+ # Constants
9
+ DATA_DIR = Path("data")
10
+ JUDGES_DIR = DATA_DIR / "judges"
11
+ EXAMPLES_DIR = DATA_DIR / "examples"
12
+ LEADERBOARD_PATH = DATA_DIR / "leaderboard.csv"
13
+ HISTORY_PATH = DATA_DIR / "history.csv"
14
+
15
+ # Initialize data directories
16
+ DATA_DIR.mkdir(exist_ok=True)
17
+ JUDGES_DIR.mkdir(exist_ok=True)
18
+ EXAMPLES_DIR.mkdir(exist_ok=True)
19
+
20
+ # ELO calculation parameters
21
+ K_FACTOR = 32 # Standard chess K-factor
22
+
23
+ # Initialize leaderboard if it doesn't exist
24
+ if not LEADERBOARD_PATH.exists():
25
+ leaderboard_df = pd.DataFrame(
26
+ {"judge_id": [], "judge_name": [], "elo_score": [], "wins": [], "losses": [], "total_evaluations": []}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  )
28
+ leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
29
+ else:
30
+ leaderboard_df = pd.read_csv(LEADERBOARD_PATH)
31
+
32
+ # Initialize history if it doesn't exist
33
+ if not HISTORY_PATH.exists():
34
+ history_df = pd.DataFrame(
35
+ {
36
+ "timestamp": [],
37
+ "input": [],
38
+ "output": [],
39
+ "judge1_id": [],
40
+ "judge1_name": [],
41
+ "judge1_evaluation": [],
42
+ "judge2_id": [],
43
+ "judge2_name": [],
44
+ "judge2_evaluation": [],
45
+ "winner_id": [],
46
+ "user_ip": [],
47
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  )
49
+ history_df.to_csv(HISTORY_PATH, index=False)
50
+ else:
51
+ history_df = pd.read_csv(HISTORY_PATH)
52
 
53
+ # Sample data - For demonstration purposes
54
+ # In production, these would be loaded from a proper dataset
55
+ if not list(EXAMPLES_DIR.glob("*.json")) and not list(JUDGES_DIR.glob("*.json")):
56
+ # Create sample examples
57
+ sample_examples = [
58
+ {
59
+ "id": "example1",
60
+ "input": "Write a poem about the ocean.",
61
+ "output": "The waves crash and foam,\nSalt spray fills the air like mist,\nOcean breathes deeply.",
62
+ },
63
+ {
64
+ "id": "example2",
65
+ "input": "Explain how photosynthesis works.",
66
+ "output": "Photosynthesis is the process where plants convert sunlight, water, "
67
+ "and carbon dioxide into glucose and oxygen. The chlorophyll in plant "
68
+ "cells captures light energy, which is then used to convert CO2 and "
69
+ "water into glucose, releasing oxygen as a byproduct.",
70
+ },
71
+ {
72
+ "id": "example3",
73
+ "input": "Solve this math problem: If x + y = 10 and x - y = 4, what are x and y?",
74
+ "output": "To solve this system of equations:\nx + y = 10\nx - y = 4\n\n"
75
+ "Add these equations:\n2x = 14\nx = 7\n\nSubstitute back:\n"
76
+ "7 + y = 10\ny = 3\n\nTherefore, x = 7 and y = 3.",
77
+ },
78
+ ]
79
 
80
+ for example in sample_examples:
81
+ with open(EXAMPLES_DIR / f"{example['id']}.json", "w") as f:
82
+ json.dump(example, f)
 
83
 
84
+ # Create sample judges
85
+ sample_judges = [
86
+ {
87
+ "id": "judge1",
88
+ "name": "EvalGPT",
89
+ "description": "A comprehensive evaluation model focused on accuracy and completeness",
90
+ },
91
+ {
92
+ "id": "judge2",
93
+ "name": "CritiqueBot",
94
+ "description": "An evaluation model specializing in identifying factual errors",
95
+ },
96
+ {
97
+ "id": "judge3",
98
+ "name": "GradeAssist",
99
+ "description": "A holistic evaluation model that balances substance and style",
100
+ },
101
+ {
102
+ "id": "judge4",
103
+ "name": "PrecisionJudge",
104
+ "description": "A technical evaluator that emphasizes precision and correctness",
105
+ },
106
+ ]
107
 
108
+ for judge in sample_judges:
109
+ with open(JUDGES_DIR / f"{judge['id']}.json", "w") as f:
110
+ json.dump(judge, f)
111
 
112
+ # Initialize leaderboard with sample judges
113
+ for judge in sample_judges:
114
+ if judge["id"] not in leaderboard_df["judge_id"].values:
115
+ leaderboard_df = pd.concat(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  [
117
+ leaderboard_df,
118
+ pd.DataFrame(
119
+ {
120
+ "judge_id": [judge["id"]],
121
+ "judge_name": [judge["name"]],
122
+ "elo_score": [1500], # Starting ELO
123
+ "wins": [0],
124
+ "losses": [0],
125
+ "total_evaluations": [0],
126
+ }
127
+ ),
128
  ],
129
+ ignore_index=True,
130
  )
131
 
132
+ leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
133
+
134
+
135
+ # Function to get a random example
136
+ def get_random_example():
137
+ example_files = list(EXAMPLES_DIR.glob("*.json"))
138
+ if not example_files:
139
+ return {"input": "No examples available", "output": ""}
140
+
141
+ example_file = random.choice(example_files)
142
+ with open(example_file, "r") as f:
143
+ example = json.load(f)
144
+
145
+ return example
146
+
147
+
148
+ # Function to get random judges' evaluations
149
+ def get_random_judges_evaluations(example_input, example_output):
150
+ judge_files = list(JUDGES_DIR.glob("*.json"))
151
+ if len(judge_files) < 2:
152
+ return None, None
153
+
154
+ # Choose two different judges
155
+ selected_judge_files = random.sample(judge_files, 2)
156
+
157
+ judges = []
158
+ for judge_file in selected_judge_files:
159
+ with open(judge_file, "r") as f:
160
+ judge = json.load(f)
161
+ judges.append(judge)
162
+
163
+ # In a real application, we'd call the judge models here
164
+ # For demonstration, we'll create sample evaluations
165
+ evaluations = []
166
+ for judge in judges:
167
+ # Simulate different evaluation styles
168
+ if "factual" in judge["description"].lower():
169
+ evaluation = (
170
+ f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
171
+ f"Factual Accuracy: 8/10\nCompleteness: 7/10\n"
172
+ f"Conciseness: 9/10\n\nThe response addresses the question "
173
+ f"but could include more specific details."
174
+ )
175
+ elif "holistic" in judge["description"].lower():
176
+ evaluation = (
177
+ f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
178
+ f"Content Quality: 7/10\nStructure: 8/10\nClarity: 9/10\n\n"
179
+ f"The response is clear and well-structured, though it could "
180
+ f"be more comprehensive."
181
  )
182
+ elif "technical" in judge["description"].lower():
183
+ evaluation = (
184
+ f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
185
+ f"Technical Accuracy: 9/10\nCompleteness: 7/10\n"
186
+ f"Logical Flow: 8/10\n\nThe technical aspects are accurate, "
187
+ f"but some key details are missing."
188
+ )
189
+ else:
190
+ evaluation = (
191
+ f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
192
+ f"Quality: 8/10\nRelevance: 9/10\nPrecision: 7/10\n\n"
193
+ f"The response is relevant and of good quality, but could "
194
+ f"be more precise."
195
+ )
196
+
197
+ # Remove the judge ID from the displayed evaluation for blindness
198
+ display_evaluation = evaluation.replace(f" (ID: {judge['id']})", "")
199
+
200
+ evaluations.append({"judge": judge, "evaluation": evaluation, "display_evaluation": display_evaluation})
201
+
202
+ return evaluations[0], evaluations[1]
203
+
204
+
205
+ # Calculate new ELO scores
206
+ def calculate_elo(winner_rating, loser_rating):
207
+ expected_winner = 1 / (1 + 10 ** ((loser_rating - winner_rating) / 400))
208
+ expected_loser = 1 / (1 + 10 ** ((winner_rating - loser_rating) / 400))
209
+
210
+ new_winner_rating = winner_rating + K_FACTOR * (1 - expected_winner)
211
+ new_loser_rating = loser_rating + K_FACTOR * (0 - expected_loser)
212
+
213
+ return new_winner_rating, new_loser_rating
214
+
215
+
216
+ # Update leaderboard after a comparison
217
+ def update_leaderboard(winner_id, loser_id):
218
+ global leaderboard_df
219
+
220
+ # Get current ratings
221
+ winner_row = leaderboard_df[leaderboard_df["judge_id"] == winner_id].iloc[0]
222
+ loser_row = leaderboard_df[leaderboard_df["judge_id"] == loser_id].iloc[0]
223
+
224
+ winner_rating = winner_row["elo_score"]
225
+ loser_rating = loser_row["elo_score"]
226
+
227
+ # Calculate new ratings
228
+ new_winner_rating, new_loser_rating = calculate_elo(winner_rating, loser_rating)
229
+
230
+ # Update dataframe
231
+ leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "elo_score"] = new_winner_rating
232
+ leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "elo_score"] = new_loser_rating
233
+
234
+ # Update win/loss counts
235
+ leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "wins"] += 1
236
+ leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "losses"] += 1
237
+
238
+ # Update total evaluations
239
+ leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "total_evaluations"] += 1
240
+ leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "total_evaluations"] += 1
241
+
242
+ # Sort by ELO score and save
243
+ leaderboard_df = leaderboard_df.sort_values(by="elo_score", ascending=False).reset_index(drop=True)
244
+ leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
245
+
246
+ return leaderboard_df
247
+
248
+
249
+ # Gradio interface functions
250
+ def refresh_example():
251
+ example = get_random_example()
252
+ return example["input"], example["output"]
253
+
254
+
255
+ def submit_example(input_text, output_text):
256
+ # Global state to store evaluations
257
+ global eval1, eval2
258
+
259
+ eval1, eval2 = get_random_judges_evaluations(input_text, output_text)
260
+
261
+ if not eval1 or not eval2:
262
+ return ("Error: Not enough judges available", "Error: Not enough judges available", None, None)
263
+
264
+ return (eval1["display_evaluation"], eval2["display_evaluation"], gr.update(visible=True), gr.update(visible=True))
265
+
266
+
267
+ def select_winner(choice):
268
+ if not eval1 or not eval2:
269
+ return "Error: No evaluations available"
270
+
271
+ if choice == "Evaluation 1":
272
+ winner_eval = eval1
273
+ loser_eval = eval2
274
+ else:
275
+ winner_eval = eval2
276
+ loser_eval = eval1
277
+
278
+ # Update leaderboard
279
+ updated_leaderboard = update_leaderboard(winner_eval["judge"]["id"], loser_eval["judge"]["id"])
280
+
281
+ # Construct result message
282
+ result_message = f"You selected: {choice}\n\n"
283
+ result_message += f"Evaluation 1 was by: {eval1['judge']['name']} "
284
+ result_message += f"({eval1['judge']['description']})\n"
285
+ result_message += f"Evaluation 2 was by: {eval2['judge']['name']} "
286
+ result_message += f"({eval2['judge']['description']})\n\n"
287
+
288
+ winner_elo = updated_leaderboard[updated_leaderboard["judge_id"] == winner_eval["judge"]["id"]][
289
+ "elo_score"
290
+ ].values[0]
291
+
292
+ result_message += f"Winner: {winner_eval['judge']['name']} (New ELO: {winner_elo:.2f})\n"
293
+
294
+ return result_message
295
+
296
+
297
+ # Create Gradio interface
298
+ with gr.Blocks(title="AI Evaluation Judge Arena") as demo:
299
+ gr.Markdown("# AI Evaluation Judge Arena")
300
+ gr.Markdown(
301
+ "Choose which AI judge provides better evaluation of the output. "
302
+ "The judges' identities are hidden until you make your choice."
303
+ )
304
+
305
+ with gr.Tab("Evaluate Judges"):
306
+ with gr.Row():
307
+ with gr.Column(scale=1):
308
+ refresh_button = gr.Button("Get Random Example")
309
+
310
+ with gr.Column(scale=2):
311
+ input_text = gr.Textbox(label="Input", lines=4)
312
+ output_text = gr.Textbox(label="Output", lines=6)
313
+ submit_button = gr.Button("Get Judge Evaluations")
314
+
315
+ with gr.Row():
316
+ with gr.Column():
317
+ evaluation1 = gr.Textbox(label="Evaluation 1", lines=10)
318
+ select_eval1 = gr.Button("Select Evaluation 1", visible=False)
319
+
320
+ with gr.Column():
321
+ evaluation2 = gr.Textbox(label="Evaluation 2", lines=10)
322
+ select_eval2 = gr.Button("Select Evaluation 2", visible=False)
323
+
324
+ result_text = gr.Textbox(label="Result", lines=6)
325
+
326
+ with gr.Tab("Leaderboard"):
327
+ leaderboard_dataframe = gr.DataFrame(
328
+ value=leaderboard_df,
329
+ headers=["Judge Name", "ELO Score", "Wins", "Losses", "Total Evaluations"],
330
+ datatype=["str", "number", "number", "number", "number"],
331
+ col_count=(5, "fixed"),
332
+ interactive=False,
333
+ )
334
+ refresh_leaderboard = gr.Button("Refresh Leaderboard")
335
+
336
+ with gr.Tab("About"):
337
+ gr.Markdown(
338
+ """
339
+ ## About AI Evaluation Judge Arena
340
+
341
+ This platform allows users to compare and rate different AI evaluation models (judges).
342
+
343
+ ### How it works:
344
+ 1. You are presented with an input prompt and AI-generated output
345
+ 2. Two AI judges provide evaluations of the output
346
+ 3. You select which evaluation you think is better
347
+ 4. The judges' identities are revealed, and their ELO ratings are updated
348
+
349
+ ### ELO Rating System
350
+ The platform uses the ELO rating system (like in chess) to rank the judges. When you choose a winner:
351
+ - The winning judge gains ELO points
352
+ - The losing judge loses ELO points
353
+ - The amount of points transferred depends on the difference in current ratings
354
+
355
+ ### Purpose
356
+ This platform helps determine which AI evaluation methods are most aligned with human preferences.
357
+ """
358
+ )
359
+
360
+ # Set up event handlers
361
+ refresh_button.click(refresh_example, [], [input_text, output_text])
362
+ submit_button.click(
363
+ submit_example, [input_text, output_text], [evaluation1, evaluation2, select_eval1, select_eval2]
364
+ )
365
+ select_eval1.click(lambda: select_winner("Evaluation 1"), [], result_text)
366
+ select_eval2.click(lambda: select_winner("Evaluation 2"), [], result_text)
367
+ refresh_leaderboard.click(lambda: leaderboard_df, [], leaderboard_dataframe)
368
+
369
+ # Initialize global variables for evaluation state
370
+ eval1 = None
371
+ eval2 = None
372
 
373
+ # Launch the app
374
+ if __name__ == "__main__":
375
+ demo.launch()
 
data/examples/example1.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "example1", "input": "Write a poem about the ocean.", "output": "The waves crash and foam,\nSalt spray fills the air like mist,\nOcean breathes deeply."}
data/examples/example2.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "example2", "input": "Explain how photosynthesis works.", "output": "Photosynthesis is the process where plants convert sunlight, water, and carbon dioxide into glucose and oxygen. The chlorophyll in plant cells captures light energy, which is then used to convert CO2 and water into glucose, releasing oxygen as a byproduct."}
data/examples/example3.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "example3", "input": "Solve this math problem: If x + y = 10 and x - y = 4, what are x and y?", "output": "To solve this system of equations:\nx + y = 10\nx - y = 4\n\nAdd these equations:\n2x = 14\nx = 7\n\nSubstitute back:\n7 + y = 10\ny = 3\n\nTherefore, x = 7 and y = 3."}
data/history.csv ADDED
@@ -0,0 +1 @@
 
 
1
+ timestamp,input,output,judge1_id,judge1_name,judge1_evaluation,judge2_id,judge2_name,judge2_evaluation,winner_id,user_ip
data/judges/judge1.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "judge1", "name": "EvalGPT", "description": "A comprehensive evaluation model focused on accuracy and completeness"}
data/judges/judge2.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "judge2", "name": "CritiqueBot", "description": "An evaluation model specializing in identifying factual errors"}
data/judges/judge3.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "judge3", "name": "GradeAssist", "description": "A holistic evaluation model that balances substance and style"}
data/judges/judge4.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"id": "judge4", "name": "PrecisionJudge", "description": "A technical evaluator that emphasizes precision and correctness"}
data/leaderboard.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ judge_id,judge_name,elo_score,wins,losses,total_evaluations
2
+ judge2,CritiqueBot,1516.0,1.0,0.0,1.0
3
+ judge4,PrecisionJudge,1500.736306793522,1.0,1.0,2.0
4
+ judge3,GradeAssist,1500.0,0.0,0.0,0.0
5
+ judge1,EvalGPT,1483.263693206478,0.0,1.0,1.0
requirements.txt CHANGED
@@ -1,16 +1,3 @@
1
- APScheduler
2
- black
3
- datasets
4
  gradio
5
- gradio[oauth]
6
- gradio_leaderboard==0.0.13
7
- gradio_client
8
- huggingface-hub>=0.18.0
9
- matplotlib
10
  numpy
11
  pandas
12
- python-dateutil
13
- tqdm
14
- transformers
15
- tokenizers>=0.15.0
16
- sentencepiece
 
 
 
 
1
  gradio
 
 
 
 
 
2
  numpy
3
  pandas