wip
Browse files- .cursor/rules/clean-code.mdc +60 -0
- .cursor/rules/cloudflare-worker-typescript.mdc +63 -0
- .cursor/rules/nextjs.mdc +57 -0
- .cursor/rules/tailwind.mdc +83 -0
- .cursor/rules/tech-stack.mdc +22 -0
- README.md +109 -2
- app.py +360 -189
- data/examples/example1.json +1 -0
- data/examples/example2.json +1 -0
- data/examples/example3.json +1 -0
- data/history.csv +1 -0
- data/judges/judge1.json +1 -0
- data/judges/judge2.json +1 -0
- data/judges/judge3.json +1 -0
- data/judges/judge4.json +1 -0
- data/leaderboard.csv +5 -0
- requirements.txt +0 -13
.cursor/rules/clean-code.mdc
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description:
|
| 3 |
+
globs:
|
| 4 |
+
alwaysApply: true
|
| 5 |
+
---
|
| 6 |
+
---
|
| 7 |
+
description: Guidelines for writing clean, maintainable, and human-readable code. Apply these rules when writing or reviewing code to ensure consistency and quality.
|
| 8 |
+
globs:
|
| 9 |
+
---
|
| 10 |
+
# Clean Code Guidelines
|
| 11 |
+
|
| 12 |
+
## Constants Over Magic Numbers
|
| 13 |
+
- Replace hard-coded values with named constants
|
| 14 |
+
- Use descriptive constant names that explain the value's purpose
|
| 15 |
+
- Keep constants at the top of the file or in a dedicated constants file
|
| 16 |
+
|
| 17 |
+
## Meaningful Names
|
| 18 |
+
- Variables, functions, and classes should reveal their purpose
|
| 19 |
+
- Names should explain why something exists and how it's used
|
| 20 |
+
- Avoid abbreviations unless they're universally understood
|
| 21 |
+
|
| 22 |
+
## Smart Comments
|
| 23 |
+
- Don't comment on what the code does - make the code self-documenting
|
| 24 |
+
- Use comments to explain why something is done a certain way
|
| 25 |
+
- Document APIs, complex algorithms, and non-obvious side effects
|
| 26 |
+
|
| 27 |
+
## Single Responsibility
|
| 28 |
+
- Each function should do exactly one thing
|
| 29 |
+
- Functions should be small and focused
|
| 30 |
+
- If a function needs a comment to explain what it does, it should be split
|
| 31 |
+
|
| 32 |
+
## DRY (Don't Repeat Yourself)
|
| 33 |
+
- Extract repeated code into reusable functions
|
| 34 |
+
- Share common logic through proper abstraction
|
| 35 |
+
- Maintain single sources of truth
|
| 36 |
+
|
| 37 |
+
## Clean Structure
|
| 38 |
+
- Keep related code together
|
| 39 |
+
- Organize code in a logical hierarchy
|
| 40 |
+
- Use consistent file and folder naming conventions
|
| 41 |
+
|
| 42 |
+
## Encapsulation
|
| 43 |
+
- Hide implementation details
|
| 44 |
+
- Expose clear interfaces
|
| 45 |
+
- Move nested conditionals into well-named functions
|
| 46 |
+
|
| 47 |
+
## Code Quality Maintenance
|
| 48 |
+
- Refactor continuously
|
| 49 |
+
- Fix technical debt early
|
| 50 |
+
- Leave code cleaner than you found it
|
| 51 |
+
|
| 52 |
+
## Testing
|
| 53 |
+
- Write tests before fixing bugs
|
| 54 |
+
- Keep tests readable and maintainable
|
| 55 |
+
- Test edge cases and error conditions
|
| 56 |
+
|
| 57 |
+
## Version Control
|
| 58 |
+
- Write clear commit messages
|
| 59 |
+
- Make small, focused commits
|
| 60 |
+
- Use meaningful branch names
|
.cursor/rules/cloudflare-worker-typescript.mdc
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description:
|
| 3 |
+
globs:
|
| 4 |
+
alwaysApply: true
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
description: TypeScript coding standards and best practices for modern web development
|
| 9 |
+
globs: **/*.ts, **/*.tsx, **/*.d.ts
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# TypeScript Best Practices
|
| 13 |
+
|
| 14 |
+
## Type System
|
| 15 |
+
- Prefer interfaces over types for object definitions
|
| 16 |
+
- Use type for unions, intersections, and mapped types
|
| 17 |
+
- Avoid using `any`, prefer `unknown` for unknown types
|
| 18 |
+
- Use strict TypeScript configuration
|
| 19 |
+
- Leverage TypeScript's built-in utility types
|
| 20 |
+
- Use generics for reusable type patterns
|
| 21 |
+
|
| 22 |
+
## Naming Conventions
|
| 23 |
+
- Use PascalCase for type names and interfaces
|
| 24 |
+
- Use camelCase for variables and functions
|
| 25 |
+
- Use UPPER_CASE for constants
|
| 26 |
+
- Use descriptive names with auxiliary verbs (e.g., isLoading, hasError)
|
| 27 |
+
- Prefix interfaces for React props with 'Props' (e.g., ButtonProps)
|
| 28 |
+
|
| 29 |
+
## Code Organization
|
| 30 |
+
- Keep type definitions close to where they're used
|
| 31 |
+
- Export types and interfaces from dedicated type files when shared
|
| 32 |
+
- Use barrel exports (index.ts) for organizing exports
|
| 33 |
+
- Place shared types in a `types` directory
|
| 34 |
+
- Co-locate component props with their components
|
| 35 |
+
|
| 36 |
+
## Functions
|
| 37 |
+
- Use explicit return types for public functions
|
| 38 |
+
- Use arrow functions for callbacks and methods
|
| 39 |
+
- Implement proper error handling with custom error types
|
| 40 |
+
- Use function overloads for complex type scenarios
|
| 41 |
+
- Prefer async/await over Promises
|
| 42 |
+
|
| 43 |
+
## Best Practices
|
| 44 |
+
- Enable strict mode in tsconfig.json
|
| 45 |
+
- Use readonly for immutable properties
|
| 46 |
+
- Leverage discriminated unions for type safety
|
| 47 |
+
- Use type guards for runtime type checking
|
| 48 |
+
- Implement proper null checking
|
| 49 |
+
- Avoid type assertions unless necessary
|
| 50 |
+
|
| 51 |
+
## Error Handling
|
| 52 |
+
- Create custom error types for domain-specific errors
|
| 53 |
+
- Use Result types for operations that can fail
|
| 54 |
+
- Implement proper error boundaries
|
| 55 |
+
- Use try-catch blocks with typed catch clauses
|
| 56 |
+
- Handle Promise rejections properly
|
| 57 |
+
|
| 58 |
+
## Patterns
|
| 59 |
+
- Use the Builder pattern for complex object creation
|
| 60 |
+
- Implement the Repository pattern for data access
|
| 61 |
+
- Use the Factory pattern for object creation
|
| 62 |
+
- Leverage dependency injection
|
| 63 |
+
- Use the Module pattern for encapsulation
|
.cursor/rules/nextjs.mdc
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description:
|
| 3 |
+
globs:
|
| 4 |
+
alwaysApply: true
|
| 5 |
+
---
|
| 6 |
+
---
|
| 7 |
+
description: Next.js with TypeScript and Tailwind UI best practices
|
| 8 |
+
globs: **/*.tsx, **/*.ts, src/**/*.ts, src/**/*.tsx
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Next.js Best Practices
|
| 12 |
+
|
| 13 |
+
## Project Structure
|
| 14 |
+
- Use the App Router directory structure
|
| 15 |
+
- Place components in `app` directory for route-specific components
|
| 16 |
+
- Place shared components in `components` directory
|
| 17 |
+
- Place utilities and helpers in `lib` directory
|
| 18 |
+
- Use lowercase with dashes for directories (e.g., `components/auth-wizard`)
|
| 19 |
+
|
| 20 |
+
## Components
|
| 21 |
+
- Use Server Components by default
|
| 22 |
+
- Mark client components explicitly with 'use client'
|
| 23 |
+
- Wrap client components in Suspense with fallback
|
| 24 |
+
- Use dynamic loading for non-critical components
|
| 25 |
+
- Implement proper error boundaries
|
| 26 |
+
- Place static content and interfaces at file end
|
| 27 |
+
|
| 28 |
+
## Performance
|
| 29 |
+
- Optimize images: Use WebP format, size data, lazy loading
|
| 30 |
+
- Minimize use of 'useEffect' and 'setState'
|
| 31 |
+
- Favor Server Components (RSC) where possible
|
| 32 |
+
- Use dynamic loading for non-critical components
|
| 33 |
+
- Implement proper caching strategies
|
| 34 |
+
|
| 35 |
+
## Data Fetching
|
| 36 |
+
- Use Server Components for data fetching when possible
|
| 37 |
+
- Implement proper error handling for data fetching
|
| 38 |
+
- Use appropriate caching strategies
|
| 39 |
+
- Handle loading and error states appropriately
|
| 40 |
+
|
| 41 |
+
## Routing
|
| 42 |
+
- Use the App Router conventions
|
| 43 |
+
- Implement proper loading and error states for routes
|
| 44 |
+
- Use dynamic routes appropriately
|
| 45 |
+
- Handle parallel routes when needed
|
| 46 |
+
|
| 47 |
+
## Forms and Validation
|
| 48 |
+
- Use Zod for form validation
|
| 49 |
+
- Implement proper server-side validation
|
| 50 |
+
- Handle form errors appropriately
|
| 51 |
+
- Show loading states during form submission
|
| 52 |
+
|
| 53 |
+
## State Management
|
| 54 |
+
- Minimize client-side state
|
| 55 |
+
- Use React Context sparingly
|
| 56 |
+
- Prefer server state when possible
|
| 57 |
+
- Implement proper loading states
|
.cursor/rules/tailwind.mdc
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description:
|
| 3 |
+
globs:
|
| 4 |
+
alwaysApply: false
|
| 5 |
+
---
|
| 6 |
+
---
|
| 7 |
+
description: Tailwind CSS and UI component best practices for modern web applications
|
| 8 |
+
globs: **/*.css, **/*.tsx, **/*.jsx, tailwind.config.js, tailwind.config.ts
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Tailwind CSS Best Practices
|
| 12 |
+
|
| 13 |
+
## Project Setup
|
| 14 |
+
- Use proper Tailwind configuration
|
| 15 |
+
- Configure theme extension properly
|
| 16 |
+
- Set up proper purge configuration
|
| 17 |
+
- Use proper plugin integration
|
| 18 |
+
- Configure custom spacing and breakpoints
|
| 19 |
+
- Set up proper color palette
|
| 20 |
+
|
| 21 |
+
## Component Styling
|
| 22 |
+
- Use utility classes over custom CSS
|
| 23 |
+
- Group related utilities with @apply when needed
|
| 24 |
+
- Use proper responsive design utilities
|
| 25 |
+
- Implement dark mode properly
|
| 26 |
+
- Use proper state variants
|
| 27 |
+
- Keep component styles consistent
|
| 28 |
+
|
| 29 |
+
## Layout
|
| 30 |
+
- Use Flexbox and Grid utilities effectively
|
| 31 |
+
- Implement proper spacing system
|
| 32 |
+
- Use container queries when needed
|
| 33 |
+
- Implement proper responsive breakpoints
|
| 34 |
+
- Use proper padding and margin utilities
|
| 35 |
+
- Implement proper alignment utilities
|
| 36 |
+
|
| 37 |
+
## Typography
|
| 38 |
+
- Use proper font size utilities
|
| 39 |
+
- Implement proper line height
|
| 40 |
+
- Use proper font weight utilities
|
| 41 |
+
- Configure custom fonts properly
|
| 42 |
+
- Use proper text alignment
|
| 43 |
+
- Implement proper text decoration
|
| 44 |
+
|
| 45 |
+
## Colors
|
| 46 |
+
- Use semantic color naming
|
| 47 |
+
- Implement proper color contrast
|
| 48 |
+
- Use opacity utilities effectively
|
| 49 |
+
- Configure custom colors properly
|
| 50 |
+
- Use proper gradient utilities
|
| 51 |
+
- Implement proper hover states
|
| 52 |
+
|
| 53 |
+
## Components
|
| 54 |
+
- Use shadcn/ui components when available
|
| 55 |
+
- Extend components properly
|
| 56 |
+
- Keep component variants consistent
|
| 57 |
+
- Implement proper animations
|
| 58 |
+
- Use proper transition utilities
|
| 59 |
+
- Keep accessibility in mind
|
| 60 |
+
|
| 61 |
+
## Responsive Design
|
| 62 |
+
- Use mobile-first approach
|
| 63 |
+
- Implement proper breakpoints
|
| 64 |
+
- Use container queries effectively
|
| 65 |
+
- Handle different screen sizes properly
|
| 66 |
+
- Implement proper responsive typography
|
| 67 |
+
- Use proper responsive spacing
|
| 68 |
+
|
| 69 |
+
## Performance
|
| 70 |
+
- Use proper purge configuration
|
| 71 |
+
- Minimize custom CSS
|
| 72 |
+
- Use proper caching strategies
|
| 73 |
+
- Implement proper code splitting
|
| 74 |
+
- Optimize for production
|
| 75 |
+
- Monitor bundle size
|
| 76 |
+
|
| 77 |
+
## Best Practices
|
| 78 |
+
- Follow naming conventions
|
| 79 |
+
- Keep styles organized
|
| 80 |
+
- Use proper documentation
|
| 81 |
+
- Implement proper testing
|
| 82 |
+
- Follow accessibility guidelines
|
| 83 |
+
- Use proper version control
|
.cursor/rules/tech-stack.mdc
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
description:
|
| 3 |
+
globs:
|
| 4 |
+
alwaysApply: true
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Your rule content
|
| 8 |
+
|
| 9 |
+
- use python version 3.12 only
|
| 10 |
+
- Use FastAPI exclusively
|
| 11 |
+
- always add tests using pytest
|
| 12 |
+
- always add imports at the top of the file, never in a function
|
| 13 |
+
- if a code is defined somewhere reuse it or refactor it
|
| 14 |
+
- files can't exceed 300 lined of code at that point refactor
|
| 15 |
+
- don't document obvious lined of code
|
| 16 |
+
- use pydantic for types and objects
|
| 17 |
+
- if a database is needed use SQLModel
|
| 18 |
+
- if a LLM call is needed use LiteLLM
|
| 19 |
+
- if an email is needed use resend
|
| 20 |
+
- use httpx for requests
|
| 21 |
+
- for logging use loguru
|
| 22 |
+
-
|
README.md
CHANGED
|
@@ -11,11 +11,117 @@ short_description: Duplicate this leaderboard to initialize your own!
|
|
| 11 |
sdk_version: 5.19.0
|
| 12 |
---
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
# Start the configuration
|
| 15 |
|
| 16 |
Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
|
| 17 |
|
| 18 |
Results files should have the following format and be stored as json files:
|
|
|
|
| 19 |
```json
|
| 20 |
{
|
| 21 |
"config": {
|
|
@@ -40,7 +146,8 @@ If you encounter problem on the space, don't hesitate to restart it to remove th
|
|
| 40 |
|
| 41 |
# Code logic for more complex edits
|
| 42 |
|
| 43 |
-
You'll find
|
|
|
|
| 44 |
- the main table' columns names and properties in `src/display/utils.py`
|
| 45 |
- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
|
| 46 |
-
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
|
|
|
|
| 11 |
sdk_version: 5.19.0
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# AI Evaluation Judge Arena
|
| 15 |
+
|
| 16 |
+
An interactive platform for comparing and ranking AI evaluation models (judges) based on human preferences.
|
| 17 |
+
|
| 18 |
+
## Overview
|
| 19 |
+
|
| 20 |
+
This application allows users to:
|
| 21 |
+
|
| 22 |
+
1. View AI-generated outputs based on input prompts
|
| 23 |
+
2. Compare evaluations from two different AI judges
|
| 24 |
+
3. Select the better evaluation
|
| 25 |
+
4. Build a leaderboard of judges ranked by ELO score
|
| 26 |
+
|
| 27 |
+
## Features
|
| 28 |
+
|
| 29 |
+
- **Blind Comparison**: Judge identities are hidden until after selection
|
| 30 |
+
- **ELO Rating System**: Calculates judge rankings based on user preferences
|
| 31 |
+
- **Leaderboard**: Track performance of different evaluation models
|
| 32 |
+
- **Sample Examples**: Includes pre-loaded examples for immediate testing
|
| 33 |
+
|
| 34 |
+
## Setup
|
| 35 |
+
|
| 36 |
+
### Prerequisites
|
| 37 |
+
|
| 38 |
+
- Python 3.6+
|
| 39 |
+
- Required packages: gradio, pandas, numpy
|
| 40 |
+
|
| 41 |
+
### Installation
|
| 42 |
+
|
| 43 |
+
1. Clone this repository:
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
git clone https://github.com/yourusername/eval-arena.git
|
| 47 |
+
cd eval-arena
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
2. Install dependencies:
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
pip install -r requirements.txt
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
3. Run the application:
|
| 57 |
+
|
| 58 |
+
```
|
| 59 |
+
python app.py
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860)
|
| 63 |
+
|
| 64 |
+
## Usage
|
| 65 |
+
|
| 66 |
+
1. **Get Random Example**: Click to load a random input/output pair
|
| 67 |
+
2. **Get Judge Evaluations**: View two anonymous evaluations of the output
|
| 68 |
+
3. **Select Better Evaluation**: Choose which evaluation you prefer
|
| 69 |
+
4. **See Results**: Learn which judges you compared and update the leaderboard
|
| 70 |
+
5. **Leaderboard Tab**: View current rankings of all judges
|
| 71 |
+
|
| 72 |
+
## Extending the Application
|
| 73 |
+
|
| 74 |
+
### Adding New Examples
|
| 75 |
+
|
| 76 |
+
Add new examples in JSON format to the `data/examples` directory:
|
| 77 |
+
|
| 78 |
+
```json
|
| 79 |
+
{
|
| 80 |
+
"id": "example_id",
|
| 81 |
+
"input": "Your input prompt",
|
| 82 |
+
"output": "AI-generated output to evaluate"
|
| 83 |
+
}
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Adding New Judges
|
| 87 |
+
|
| 88 |
+
Add new judges in JSON format to the `data/judges` directory:
|
| 89 |
+
|
| 90 |
+
```json
|
| 91 |
+
{
|
| 92 |
+
"id": "judge_id",
|
| 93 |
+
"name": "Judge Name",
|
| 94 |
+
"description": "Description of judge's evaluation approach"
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
### Integrating Real Models
|
| 99 |
+
|
| 100 |
+
For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations.
|
| 101 |
+
|
| 102 |
+
## License
|
| 103 |
+
|
| 104 |
+
MIT
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
If you use this platform in your research, please cite:
|
| 109 |
+
|
| 110 |
+
```
|
| 111 |
+
@software{ai_eval_arena,
|
| 112 |
+
author = {Your Name},
|
| 113 |
+
title = {AI Evaluation Judge Arena},
|
| 114 |
+
year = {2023},
|
| 115 |
+
url = {https://github.com/yourusername/eval-arena}
|
| 116 |
+
}
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
# Start the configuration
|
| 120 |
|
| 121 |
Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
|
| 122 |
|
| 123 |
Results files should have the following format and be stored as json files:
|
| 124 |
+
|
| 125 |
```json
|
| 126 |
{
|
| 127 |
"config": {
|
|
|
|
| 146 |
|
| 147 |
# Code logic for more complex edits
|
| 148 |
|
| 149 |
+
You'll find
|
| 150 |
+
|
| 151 |
- the main table' columns names and properties in `src/display/utils.py`
|
| 152 |
- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
|
| 153 |
+
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
|
app.py
CHANGED
|
@@ -1,204 +1,375 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import gradio as gr
|
| 2 |
-
from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
|
| 3 |
import pandas as pd
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
)
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
WeightType,
|
| 25 |
-
Precision
|
| 26 |
-
)
|
| 27 |
-
from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
|
| 28 |
-
from src.populate import get_evaluation_queue_df, get_leaderboard_df
|
| 29 |
-
from src.submission.submit import add_new_eval
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
def restart_space():
|
| 33 |
-
API.restart_space(repo_id=REPO_ID)
|
| 34 |
-
|
| 35 |
-
### Space initialisation
|
| 36 |
-
try:
|
| 37 |
-
print(EVAL_REQUESTS_PATH)
|
| 38 |
-
snapshot_download(
|
| 39 |
-
repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
|
| 40 |
-
)
|
| 41 |
-
except Exception:
|
| 42 |
-
restart_space()
|
| 43 |
-
try:
|
| 44 |
-
print(EVAL_RESULTS_PATH)
|
| 45 |
-
snapshot_download(
|
| 46 |
-
repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
|
| 47 |
)
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
(
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
|
| 69 |
-
label="Select Columns to Display:",
|
| 70 |
-
),
|
| 71 |
-
search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
|
| 72 |
-
hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
|
| 73 |
-
filter_columns=[
|
| 74 |
-
ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
|
| 75 |
-
ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
|
| 76 |
-
ColumnFilter(
|
| 77 |
-
AutoEvalColumn.params.name,
|
| 78 |
-
type="slider",
|
| 79 |
-
min=0.01,
|
| 80 |
-
max=150,
|
| 81 |
-
label="Select the number of parameters (B)",
|
| 82 |
-
),
|
| 83 |
-
ColumnFilter(
|
| 84 |
-
AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
|
| 85 |
-
),
|
| 86 |
-
],
|
| 87 |
-
bool_checkboxgroup_label="Hide models",
|
| 88 |
-
interactive=False,
|
| 89 |
)
|
|
|
|
|
|
|
|
|
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
-
with
|
| 94 |
-
|
| 95 |
-
gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
|
| 96 |
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
with gr.Column():
|
| 110 |
-
with gr.Accordion(
|
| 111 |
-
f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
|
| 112 |
-
open=False,
|
| 113 |
-
):
|
| 114 |
-
with gr.Row():
|
| 115 |
-
finished_eval_table = gr.components.Dataframe(
|
| 116 |
-
value=finished_eval_queue_df,
|
| 117 |
-
headers=EVAL_COLS,
|
| 118 |
-
datatype=EVAL_TYPES,
|
| 119 |
-
row_count=5,
|
| 120 |
-
)
|
| 121 |
-
with gr.Accordion(
|
| 122 |
-
f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
|
| 123 |
-
open=False,
|
| 124 |
-
):
|
| 125 |
-
with gr.Row():
|
| 126 |
-
running_eval_table = gr.components.Dataframe(
|
| 127 |
-
value=running_eval_queue_df,
|
| 128 |
-
headers=EVAL_COLS,
|
| 129 |
-
datatype=EVAL_TYPES,
|
| 130 |
-
row_count=5,
|
| 131 |
-
)
|
| 132 |
-
|
| 133 |
-
with gr.Accordion(
|
| 134 |
-
f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
|
| 135 |
-
open=False,
|
| 136 |
-
):
|
| 137 |
-
with gr.Row():
|
| 138 |
-
pending_eval_table = gr.components.Dataframe(
|
| 139 |
-
value=pending_eval_queue_df,
|
| 140 |
-
headers=EVAL_COLS,
|
| 141 |
-
datatype=EVAL_TYPES,
|
| 142 |
-
row_count=5,
|
| 143 |
-
)
|
| 144 |
-
with gr.Row():
|
| 145 |
-
gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
|
| 146 |
-
|
| 147 |
-
with gr.Row():
|
| 148 |
-
with gr.Column():
|
| 149 |
-
model_name_textbox = gr.Textbox(label="Model name")
|
| 150 |
-
revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
|
| 151 |
-
model_type = gr.Dropdown(
|
| 152 |
-
choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
|
| 153 |
-
label="Model type",
|
| 154 |
-
multiselect=False,
|
| 155 |
-
value=None,
|
| 156 |
-
interactive=True,
|
| 157 |
-
)
|
| 158 |
-
|
| 159 |
-
with gr.Column():
|
| 160 |
-
precision = gr.Dropdown(
|
| 161 |
-
choices=[i.value.name for i in Precision if i != Precision.Unknown],
|
| 162 |
-
label="Precision",
|
| 163 |
-
multiselect=False,
|
| 164 |
-
value="float16",
|
| 165 |
-
interactive=True,
|
| 166 |
-
)
|
| 167 |
-
weight_type = gr.Dropdown(
|
| 168 |
-
choices=[i.value.name for i in WeightType],
|
| 169 |
-
label="Weights type",
|
| 170 |
-
multiselect=False,
|
| 171 |
-
value="Original",
|
| 172 |
-
interactive=True,
|
| 173 |
-
)
|
| 174 |
-
base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
|
| 175 |
-
|
| 176 |
-
submit_button = gr.Button("Submit Eval")
|
| 177 |
-
submission_result = gr.Markdown()
|
| 178 |
-
submit_button.click(
|
| 179 |
-
add_new_eval,
|
| 180 |
[
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
],
|
| 188 |
-
|
| 189 |
)
|
| 190 |
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
demo.queue(default_concurrency_limit=40).launch()
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import random
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
|
| 5 |
import gradio as gr
|
|
|
|
| 6 |
import pandas as pd
|
| 7 |
+
|
| 8 |
+
# Constants
|
| 9 |
+
DATA_DIR = Path("data")
|
| 10 |
+
JUDGES_DIR = DATA_DIR / "judges"
|
| 11 |
+
EXAMPLES_DIR = DATA_DIR / "examples"
|
| 12 |
+
LEADERBOARD_PATH = DATA_DIR / "leaderboard.csv"
|
| 13 |
+
HISTORY_PATH = DATA_DIR / "history.csv"
|
| 14 |
+
|
| 15 |
+
# Initialize data directories
|
| 16 |
+
DATA_DIR.mkdir(exist_ok=True)
|
| 17 |
+
JUDGES_DIR.mkdir(exist_ok=True)
|
| 18 |
+
EXAMPLES_DIR.mkdir(exist_ok=True)
|
| 19 |
+
|
| 20 |
+
# ELO calculation parameters
|
| 21 |
+
K_FACTOR = 32 # Standard chess K-factor
|
| 22 |
+
|
| 23 |
+
# Initialize leaderboard if it doesn't exist
|
| 24 |
+
if not LEADERBOARD_PATH.exists():
|
| 25 |
+
leaderboard_df = pd.DataFrame(
|
| 26 |
+
{"judge_id": [], "judge_name": [], "elo_score": [], "wins": [], "losses": [], "total_evaluations": []}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
)
|
| 28 |
+
leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
|
| 29 |
+
else:
|
| 30 |
+
leaderboard_df = pd.read_csv(LEADERBOARD_PATH)
|
| 31 |
+
|
| 32 |
+
# Initialize history if it doesn't exist
|
| 33 |
+
if not HISTORY_PATH.exists():
|
| 34 |
+
history_df = pd.DataFrame(
|
| 35 |
+
{
|
| 36 |
+
"timestamp": [],
|
| 37 |
+
"input": [],
|
| 38 |
+
"output": [],
|
| 39 |
+
"judge1_id": [],
|
| 40 |
+
"judge1_name": [],
|
| 41 |
+
"judge1_evaluation": [],
|
| 42 |
+
"judge2_id": [],
|
| 43 |
+
"judge2_name": [],
|
| 44 |
+
"judge2_evaluation": [],
|
| 45 |
+
"winner_id": [],
|
| 46 |
+
"user_ip": [],
|
| 47 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
)
|
| 49 |
+
history_df.to_csv(HISTORY_PATH, index=False)
|
| 50 |
+
else:
|
| 51 |
+
history_df = pd.read_csv(HISTORY_PATH)
|
| 52 |
|
| 53 |
+
# Sample data - For demonstration purposes
|
| 54 |
+
# In production, these would be loaded from a proper dataset
|
| 55 |
+
if not list(EXAMPLES_DIR.glob("*.json")) and not list(JUDGES_DIR.glob("*.json")):
|
| 56 |
+
# Create sample examples
|
| 57 |
+
sample_examples = [
|
| 58 |
+
{
|
| 59 |
+
"id": "example1",
|
| 60 |
+
"input": "Write a poem about the ocean.",
|
| 61 |
+
"output": "The waves crash and foam,\nSalt spray fills the air like mist,\nOcean breathes deeply.",
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"id": "example2",
|
| 65 |
+
"input": "Explain how photosynthesis works.",
|
| 66 |
+
"output": "Photosynthesis is the process where plants convert sunlight, water, "
|
| 67 |
+
"and carbon dioxide into glucose and oxygen. The chlorophyll in plant "
|
| 68 |
+
"cells captures light energy, which is then used to convert CO2 and "
|
| 69 |
+
"water into glucose, releasing oxygen as a byproduct.",
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"id": "example3",
|
| 73 |
+
"input": "Solve this math problem: If x + y = 10 and x - y = 4, what are x and y?",
|
| 74 |
+
"output": "To solve this system of equations:\nx + y = 10\nx - y = 4\n\n"
|
| 75 |
+
"Add these equations:\n2x = 14\nx = 7\n\nSubstitute back:\n"
|
| 76 |
+
"7 + y = 10\ny = 3\n\nTherefore, x = 7 and y = 3.",
|
| 77 |
+
},
|
| 78 |
+
]
|
| 79 |
|
| 80 |
+
for example in sample_examples:
|
| 81 |
+
with open(EXAMPLES_DIR / f"{example['id']}.json", "w") as f:
|
| 82 |
+
json.dump(example, f)
|
|
|
|
| 83 |
|
| 84 |
+
# Create sample judges
|
| 85 |
+
sample_judges = [
|
| 86 |
+
{
|
| 87 |
+
"id": "judge1",
|
| 88 |
+
"name": "EvalGPT",
|
| 89 |
+
"description": "A comprehensive evaluation model focused on accuracy and completeness",
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"id": "judge2",
|
| 93 |
+
"name": "CritiqueBot",
|
| 94 |
+
"description": "An evaluation model specializing in identifying factual errors",
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"id": "judge3",
|
| 98 |
+
"name": "GradeAssist",
|
| 99 |
+
"description": "A holistic evaluation model that balances substance and style",
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"id": "judge4",
|
| 103 |
+
"name": "PrecisionJudge",
|
| 104 |
+
"description": "A technical evaluator that emphasizes precision and correctness",
|
| 105 |
+
},
|
| 106 |
+
]
|
| 107 |
|
| 108 |
+
for judge in sample_judges:
|
| 109 |
+
with open(JUDGES_DIR / f"{judge['id']}.json", "w") as f:
|
| 110 |
+
json.dump(judge, f)
|
| 111 |
|
| 112 |
+
# Initialize leaderboard with sample judges
|
| 113 |
+
for judge in sample_judges:
|
| 114 |
+
if judge["id"] not in leaderboard_df["judge_id"].values:
|
| 115 |
+
leaderboard_df = pd.concat(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
[
|
| 117 |
+
leaderboard_df,
|
| 118 |
+
pd.DataFrame(
|
| 119 |
+
{
|
| 120 |
+
"judge_id": [judge["id"]],
|
| 121 |
+
"judge_name": [judge["name"]],
|
| 122 |
+
"elo_score": [1500], # Starting ELO
|
| 123 |
+
"wins": [0],
|
| 124 |
+
"losses": [0],
|
| 125 |
+
"total_evaluations": [0],
|
| 126 |
+
}
|
| 127 |
+
),
|
| 128 |
],
|
| 129 |
+
ignore_index=True,
|
| 130 |
)
|
| 131 |
|
| 132 |
+
leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# Function to get a random example
|
| 136 |
+
def get_random_example():
|
| 137 |
+
example_files = list(EXAMPLES_DIR.glob("*.json"))
|
| 138 |
+
if not example_files:
|
| 139 |
+
return {"input": "No examples available", "output": ""}
|
| 140 |
+
|
| 141 |
+
example_file = random.choice(example_files)
|
| 142 |
+
with open(example_file, "r") as f:
|
| 143 |
+
example = json.load(f)
|
| 144 |
+
|
| 145 |
+
return example
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
# Function to get random judges' evaluations
|
| 149 |
+
def get_random_judges_evaluations(example_input, example_output):
|
| 150 |
+
judge_files = list(JUDGES_DIR.glob("*.json"))
|
| 151 |
+
if len(judge_files) < 2:
|
| 152 |
+
return None, None
|
| 153 |
+
|
| 154 |
+
# Choose two different judges
|
| 155 |
+
selected_judge_files = random.sample(judge_files, 2)
|
| 156 |
+
|
| 157 |
+
judges = []
|
| 158 |
+
for judge_file in selected_judge_files:
|
| 159 |
+
with open(judge_file, "r") as f:
|
| 160 |
+
judge = json.load(f)
|
| 161 |
+
judges.append(judge)
|
| 162 |
+
|
| 163 |
+
# In a real application, we'd call the judge models here
|
| 164 |
+
# For demonstration, we'll create sample evaluations
|
| 165 |
+
evaluations = []
|
| 166 |
+
for judge in judges:
|
| 167 |
+
# Simulate different evaluation styles
|
| 168 |
+
if "factual" in judge["description"].lower():
|
| 169 |
+
evaluation = (
|
| 170 |
+
f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
|
| 171 |
+
f"Factual Accuracy: 8/10\nCompleteness: 7/10\n"
|
| 172 |
+
f"Conciseness: 9/10\n\nThe response addresses the question "
|
| 173 |
+
f"but could include more specific details."
|
| 174 |
+
)
|
| 175 |
+
elif "holistic" in judge["description"].lower():
|
| 176 |
+
evaluation = (
|
| 177 |
+
f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
|
| 178 |
+
f"Content Quality: 7/10\nStructure: 8/10\nClarity: 9/10\n\n"
|
| 179 |
+
f"The response is clear and well-structured, though it could "
|
| 180 |
+
f"be more comprehensive."
|
| 181 |
)
|
| 182 |
+
elif "technical" in judge["description"].lower():
|
| 183 |
+
evaluation = (
|
| 184 |
+
f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
|
| 185 |
+
f"Technical Accuracy: 9/10\nCompleteness: 7/10\n"
|
| 186 |
+
f"Logical Flow: 8/10\n\nThe technical aspects are accurate, "
|
| 187 |
+
f"but some key details are missing."
|
| 188 |
+
)
|
| 189 |
+
else:
|
| 190 |
+
evaluation = (
|
| 191 |
+
f"Evaluation by {judge['name']} (ID: {judge['id']}):\n\n"
|
| 192 |
+
f"Quality: 8/10\nRelevance: 9/10\nPrecision: 7/10\n\n"
|
| 193 |
+
f"The response is relevant and of good quality, but could "
|
| 194 |
+
f"be more precise."
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
# Remove the judge ID from the displayed evaluation for blindness
|
| 198 |
+
display_evaluation = evaluation.replace(f" (ID: {judge['id']})", "")
|
| 199 |
+
|
| 200 |
+
evaluations.append({"judge": judge, "evaluation": evaluation, "display_evaluation": display_evaluation})
|
| 201 |
+
|
| 202 |
+
return evaluations[0], evaluations[1]
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
# Calculate new ELO scores
|
| 206 |
+
def calculate_elo(winner_rating, loser_rating):
|
| 207 |
+
expected_winner = 1 / (1 + 10 ** ((loser_rating - winner_rating) / 400))
|
| 208 |
+
expected_loser = 1 / (1 + 10 ** ((winner_rating - loser_rating) / 400))
|
| 209 |
+
|
| 210 |
+
new_winner_rating = winner_rating + K_FACTOR * (1 - expected_winner)
|
| 211 |
+
new_loser_rating = loser_rating + K_FACTOR * (0 - expected_loser)
|
| 212 |
+
|
| 213 |
+
return new_winner_rating, new_loser_rating
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
# Update leaderboard after a comparison
|
| 217 |
+
def update_leaderboard(winner_id, loser_id):
|
| 218 |
+
global leaderboard_df
|
| 219 |
+
|
| 220 |
+
# Get current ratings
|
| 221 |
+
winner_row = leaderboard_df[leaderboard_df["judge_id"] == winner_id].iloc[0]
|
| 222 |
+
loser_row = leaderboard_df[leaderboard_df["judge_id"] == loser_id].iloc[0]
|
| 223 |
+
|
| 224 |
+
winner_rating = winner_row["elo_score"]
|
| 225 |
+
loser_rating = loser_row["elo_score"]
|
| 226 |
+
|
| 227 |
+
# Calculate new ratings
|
| 228 |
+
new_winner_rating, new_loser_rating = calculate_elo(winner_rating, loser_rating)
|
| 229 |
+
|
| 230 |
+
# Update dataframe
|
| 231 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "elo_score"] = new_winner_rating
|
| 232 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "elo_score"] = new_loser_rating
|
| 233 |
+
|
| 234 |
+
# Update win/loss counts
|
| 235 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "wins"] += 1
|
| 236 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "losses"] += 1
|
| 237 |
+
|
| 238 |
+
# Update total evaluations
|
| 239 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == winner_id, "total_evaluations"] += 1
|
| 240 |
+
leaderboard_df.loc[leaderboard_df["judge_id"] == loser_id, "total_evaluations"] += 1
|
| 241 |
+
|
| 242 |
+
# Sort by ELO score and save
|
| 243 |
+
leaderboard_df = leaderboard_df.sort_values(by="elo_score", ascending=False).reset_index(drop=True)
|
| 244 |
+
leaderboard_df.to_csv(LEADERBOARD_PATH, index=False)
|
| 245 |
+
|
| 246 |
+
return leaderboard_df
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
# Gradio interface functions
|
| 250 |
+
def refresh_example():
|
| 251 |
+
example = get_random_example()
|
| 252 |
+
return example["input"], example["output"]
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
def submit_example(input_text, output_text):
|
| 256 |
+
# Global state to store evaluations
|
| 257 |
+
global eval1, eval2
|
| 258 |
+
|
| 259 |
+
eval1, eval2 = get_random_judges_evaluations(input_text, output_text)
|
| 260 |
+
|
| 261 |
+
if not eval1 or not eval2:
|
| 262 |
+
return ("Error: Not enough judges available", "Error: Not enough judges available", None, None)
|
| 263 |
+
|
| 264 |
+
return (eval1["display_evaluation"], eval2["display_evaluation"], gr.update(visible=True), gr.update(visible=True))
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
def select_winner(choice):
|
| 268 |
+
if not eval1 or not eval2:
|
| 269 |
+
return "Error: No evaluations available"
|
| 270 |
+
|
| 271 |
+
if choice == "Evaluation 1":
|
| 272 |
+
winner_eval = eval1
|
| 273 |
+
loser_eval = eval2
|
| 274 |
+
else:
|
| 275 |
+
winner_eval = eval2
|
| 276 |
+
loser_eval = eval1
|
| 277 |
+
|
| 278 |
+
# Update leaderboard
|
| 279 |
+
updated_leaderboard = update_leaderboard(winner_eval["judge"]["id"], loser_eval["judge"]["id"])
|
| 280 |
+
|
| 281 |
+
# Construct result message
|
| 282 |
+
result_message = f"You selected: {choice}\n\n"
|
| 283 |
+
result_message += f"Evaluation 1 was by: {eval1['judge']['name']} "
|
| 284 |
+
result_message += f"({eval1['judge']['description']})\n"
|
| 285 |
+
result_message += f"Evaluation 2 was by: {eval2['judge']['name']} "
|
| 286 |
+
result_message += f"({eval2['judge']['description']})\n\n"
|
| 287 |
+
|
| 288 |
+
winner_elo = updated_leaderboard[updated_leaderboard["judge_id"] == winner_eval["judge"]["id"]][
|
| 289 |
+
"elo_score"
|
| 290 |
+
].values[0]
|
| 291 |
+
|
| 292 |
+
result_message += f"Winner: {winner_eval['judge']['name']} (New ELO: {winner_elo:.2f})\n"
|
| 293 |
+
|
| 294 |
+
return result_message
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
# Create Gradio interface
|
| 298 |
+
with gr.Blocks(title="AI Evaluation Judge Arena") as demo:
|
| 299 |
+
gr.Markdown("# AI Evaluation Judge Arena")
|
| 300 |
+
gr.Markdown(
|
| 301 |
+
"Choose which AI judge provides better evaluation of the output. "
|
| 302 |
+
"The judges' identities are hidden until you make your choice."
|
| 303 |
+
)
|
| 304 |
+
|
| 305 |
+
with gr.Tab("Evaluate Judges"):
|
| 306 |
+
with gr.Row():
|
| 307 |
+
with gr.Column(scale=1):
|
| 308 |
+
refresh_button = gr.Button("Get Random Example")
|
| 309 |
+
|
| 310 |
+
with gr.Column(scale=2):
|
| 311 |
+
input_text = gr.Textbox(label="Input", lines=4)
|
| 312 |
+
output_text = gr.Textbox(label="Output", lines=6)
|
| 313 |
+
submit_button = gr.Button("Get Judge Evaluations")
|
| 314 |
+
|
| 315 |
+
with gr.Row():
|
| 316 |
+
with gr.Column():
|
| 317 |
+
evaluation1 = gr.Textbox(label="Evaluation 1", lines=10)
|
| 318 |
+
select_eval1 = gr.Button("Select Evaluation 1", visible=False)
|
| 319 |
+
|
| 320 |
+
with gr.Column():
|
| 321 |
+
evaluation2 = gr.Textbox(label="Evaluation 2", lines=10)
|
| 322 |
+
select_eval2 = gr.Button("Select Evaluation 2", visible=False)
|
| 323 |
+
|
| 324 |
+
result_text = gr.Textbox(label="Result", lines=6)
|
| 325 |
+
|
| 326 |
+
with gr.Tab("Leaderboard"):
|
| 327 |
+
leaderboard_dataframe = gr.DataFrame(
|
| 328 |
+
value=leaderboard_df,
|
| 329 |
+
headers=["Judge Name", "ELO Score", "Wins", "Losses", "Total Evaluations"],
|
| 330 |
+
datatype=["str", "number", "number", "number", "number"],
|
| 331 |
+
col_count=(5, "fixed"),
|
| 332 |
+
interactive=False,
|
| 333 |
+
)
|
| 334 |
+
refresh_leaderboard = gr.Button("Refresh Leaderboard")
|
| 335 |
+
|
| 336 |
+
with gr.Tab("About"):
|
| 337 |
+
gr.Markdown(
|
| 338 |
+
"""
|
| 339 |
+
## About AI Evaluation Judge Arena
|
| 340 |
+
|
| 341 |
+
This platform allows users to compare and rate different AI evaluation models (judges).
|
| 342 |
+
|
| 343 |
+
### How it works:
|
| 344 |
+
1. You are presented with an input prompt and AI-generated output
|
| 345 |
+
2. Two AI judges provide evaluations of the output
|
| 346 |
+
3. You select which evaluation you think is better
|
| 347 |
+
4. The judges' identities are revealed, and their ELO ratings are updated
|
| 348 |
+
|
| 349 |
+
### ELO Rating System
|
| 350 |
+
The platform uses the ELO rating system (like in chess) to rank the judges. When you choose a winner:
|
| 351 |
+
- The winning judge gains ELO points
|
| 352 |
+
- The losing judge loses ELO points
|
| 353 |
+
- The amount of points transferred depends on the difference in current ratings
|
| 354 |
+
|
| 355 |
+
### Purpose
|
| 356 |
+
This platform helps determine which AI evaluation methods are most aligned with human preferences.
|
| 357 |
+
"""
|
| 358 |
+
)
|
| 359 |
+
|
| 360 |
+
# Set up event handlers
|
| 361 |
+
refresh_button.click(refresh_example, [], [input_text, output_text])
|
| 362 |
+
submit_button.click(
|
| 363 |
+
submit_example, [input_text, output_text], [evaluation1, evaluation2, select_eval1, select_eval2]
|
| 364 |
+
)
|
| 365 |
+
select_eval1.click(lambda: select_winner("Evaluation 1"), [], result_text)
|
| 366 |
+
select_eval2.click(lambda: select_winner("Evaluation 2"), [], result_text)
|
| 367 |
+
refresh_leaderboard.click(lambda: leaderboard_df, [], leaderboard_dataframe)
|
| 368 |
+
|
| 369 |
+
# Initialize global variables for evaluation state
|
| 370 |
+
eval1 = None
|
| 371 |
+
eval2 = None
|
| 372 |
|
| 373 |
+
# Launch the app
|
| 374 |
+
if __name__ == "__main__":
|
| 375 |
+
demo.launch()
|
|
|
data/examples/example1.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "example1", "input": "Write a poem about the ocean.", "output": "The waves crash and foam,\nSalt spray fills the air like mist,\nOcean breathes deeply."}
|
data/examples/example2.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "example2", "input": "Explain how photosynthesis works.", "output": "Photosynthesis is the process where plants convert sunlight, water, and carbon dioxide into glucose and oxygen. The chlorophyll in plant cells captures light energy, which is then used to convert CO2 and water into glucose, releasing oxygen as a byproduct."}
|
data/examples/example3.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "example3", "input": "Solve this math problem: If x + y = 10 and x - y = 4, what are x and y?", "output": "To solve this system of equations:\nx + y = 10\nx - y = 4\n\nAdd these equations:\n2x = 14\nx = 7\n\nSubstitute back:\n7 + y = 10\ny = 3\n\nTherefore, x = 7 and y = 3."}
|
data/history.csv
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
timestamp,input,output,judge1_id,judge1_name,judge1_evaluation,judge2_id,judge2_name,judge2_evaluation,winner_id,user_ip
|
data/judges/judge1.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "judge1", "name": "EvalGPT", "description": "A comprehensive evaluation model focused on accuracy and completeness"}
|
data/judges/judge2.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "judge2", "name": "CritiqueBot", "description": "An evaluation model specializing in identifying factual errors"}
|
data/judges/judge3.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "judge3", "name": "GradeAssist", "description": "A holistic evaluation model that balances substance and style"}
|
data/judges/judge4.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"id": "judge4", "name": "PrecisionJudge", "description": "A technical evaluator that emphasizes precision and correctness"}
|
data/leaderboard.csv
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
judge_id,judge_name,elo_score,wins,losses,total_evaluations
|
| 2 |
+
judge2,CritiqueBot,1516.0,1.0,0.0,1.0
|
| 3 |
+
judge4,PrecisionJudge,1500.736306793522,1.0,1.0,2.0
|
| 4 |
+
judge3,GradeAssist,1500.0,0.0,0.0,0.0
|
| 5 |
+
judge1,EvalGPT,1483.263693206478,0.0,1.0,1.0
|
requirements.txt
CHANGED
|
@@ -1,16 +1,3 @@
|
|
| 1 |
-
APScheduler
|
| 2 |
-
black
|
| 3 |
-
datasets
|
| 4 |
gradio
|
| 5 |
-
gradio[oauth]
|
| 6 |
-
gradio_leaderboard==0.0.13
|
| 7 |
-
gradio_client
|
| 8 |
-
huggingface-hub>=0.18.0
|
| 9 |
-
matplotlib
|
| 10 |
numpy
|
| 11 |
pandas
|
| 12 |
-
python-dateutil
|
| 13 |
-
tqdm
|
| 14 |
-
transformers
|
| 15 |
-
tokenizers>=0.15.0
|
| 16 |
-
sentencepiece
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
gradio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
numpy
|
| 3 |
pandas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|