Spaces:

SeaLLMs
/

LLM_Leaderboard_for_SEA

Running

App Files Files Community

lukecq commited on Nov 22, 2024

Commit

2678c49

1 Parent(s): 4ecf403

update results

Browse files

Files changed (2) hide show

app.py +12 -12
src/display/about.py +24 -9

app.py CHANGED Viewed

@@ -45,7 +45,7 @@ show_columns_overall = ['R', 'Model', 'type', 'open?','#P(B)', 'SeaExam-pub', 'S
 TYPES_overall = ['number', 'markdown', 'str', 'str', 'number', 'number', 'number', 'number', 'number']
 # Load the data from the csv file
-csv_path = f'{EVAL_RESULTS_PATH}/SeaExam_results_20241030.csv'
 # csv_path = f'eval-results/SeaExam_results_20241030.csv'
 df = pd.read_csv(csv_path, skiprows=1, header=0)
 # df_m3exam, df_mmlu, df_avg = load_data(csv_path)
@@ -54,7 +54,7 @@ df_seaexam, df_seabench, df_overall = load_data(csv_path)
 demo = gr.Blocks(css=custom_css)
 with demo:
     gr.HTML(TITLE)
-    gr.HTML(SUB_TITLE)
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
@@ -125,18 +125,18 @@ with demo:
         with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=3):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-    # with gr.Row():
-    #     with gr.Accordion("📙 Citation", open=False):
-    #         citation_button = gr.Textbox(
-    #             value=CITATION_BUTTON_TEXT,
-    #             label=CITATION_BUTTON_LABEL,
-    #             lines=20,
-    #             elem_id="citation-button",
-    #             show_copy_button=True,
-    #         )
     gr.Markdown(CONTACT_TEXT, elem_classes="markdown-text")
-demo.launch()
 scheduler = BackgroundScheduler()
 scheduler.add_job(restart_space, "interval", seconds=1800)

 TYPES_overall = ['number', 'markdown', 'str', 'str', 'number', 'number', 'number', 'number', 'number']
 # Load the data from the csv file
+csv_path = f'{EVAL_RESULTS_PATH}/SeaExam_results_20241122.csv'
 # csv_path = f'eval-results/SeaExam_results_20241030.csv'
 df = pd.read_csv(csv_path, skiprows=1, header=0)
 # df_m3exam, df_mmlu, df_avg = load_data(csv_path)
 demo = gr.Blocks(css=custom_css)
 with demo:
     gr.HTML(TITLE)
+    # gr.HTML(SUB_TITLE)
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
         with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=3):
             gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
+    with gr.Row():
+        with gr.Accordion("📙 Citation", open=False):
+            citation_button = gr.Textbox(
+                value=CITATION_BUTTON_TEXT,
+                label=CITATION_BUTTON_LABEL,
+                lines=20,
+                elem_id="citation-button",
+                show_copy_button=True,
+            )
     gr.Markdown(CONTACT_TEXT, elem_classes="markdown-text")
+demo.launch(share=True)
 scheduler = BackgroundScheduler()
 scheduler.add_job(restart_space, "interval", seconds=1800)

src/display/about.py CHANGED Viewed

@@ -16,10 +16,11 @@ class Tasks(Enum):
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">📃 SeaExam and SeaBench Leaderboard</h1>"""
 # subtitle
-SUB_TITLE = """<h2 align="center" id="space-title">What is the best LLM for Southeast Asian Languages❓</h1>"""
 # What does your leaderboard evaluate?
 # INTRODUCTION_TEXT = """
@@ -34,6 +35,14 @@ INTRODUCTION_TEXT = """
 This leaderboard evaluates Large Language Models (LLMs) on Southeast Asian (SEA) languages through two comprehensive benchmarks: SeaExam and SeaBench. SeaExam assesses world knowledge and reasoning capabilities through exam-style questions, while SeaBench evaluates instruction-following abilities and multi-turn conversational skills. For detailed methodology and results, please refer to the "📝 About" tab.
 """
 # For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "📝 About" tab.
 # Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
@@ -46,7 +55,7 @@ Even though large language models (LLMs) have shown impressive performance on va
 ## Datasets
-The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/datasets/SeaLLMs/SeaExam) and SeaBench dataset (will be public available soon).
 - **SeaExam**: a benchmark sourced from real and official human exam questions in multiple-choice format.
 - **SeaBench**: a manually created benchmark for evaluating the model's ability to follow instructions and engage in multi-turn conversations. The questions are in open-ended format.
@@ -59,7 +68,7 @@ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/
 _ **SeaBench**:
     We evaluate the responses of the models with GPT-4o-2024-08-06. Each response is scored on a scale of 1-10.
-## Reults
 How to interpret the leaderboard?
 * Each numerical value represet the accuracy (%) for SeaExam and score for SeaBench.
 * The "🏅 Overall" shows the average results across the three langauges for SeaExam public dataset (SeaExam-pub), SeaExam private dataset (SeaExam-prv), SeaBench public dataset (SeaBench-pub), (SeaBench-prv). This leaderboard is ranked by SeaExam-prv.
@@ -69,13 +78,13 @@ How to interpret the leaderboard?
 * "open?" column indicates whether the model is open-source or proprietary.
 ## Reproducibility
-To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
-```python
-python scripts/main.py --model $model_name_or_path
-```
 """
 # You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 EVALUATION_QUEUE_TEXT = """
@@ -110,6 +119,12 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """
 CONTACT_TEXT = f"""

 # Your leaderboard name
+# TITLE = """<h1 align="center" id="space-title">📃 SeaExam and SeaBench Leaderboard</h1>"""
+TITLE = """<h1 align="left" id="space-title">🏅 LLM Leaderboard for SEA</h1>"""
 # subtitle
+SUB_TITLE = """<h2 align="left" id="space-title">What is the best LLM for Southeast Asian Languages❓</h1>"""
 # What does your leaderboard evaluate?
 # INTRODUCTION_TEXT = """
 This leaderboard evaluates Large Language Models (LLMs) on Southeast Asian (SEA) languages through two comprehensive benchmarks: SeaExam and SeaBench. SeaExam assesses world knowledge and reasoning capabilities through exam-style questions, while SeaBench evaluates instruction-following abilities and multi-turn conversational skills. For detailed methodology and results, please refer to the "📝 About" tab.
 """
+INTRODUCTION_TEXT = """
+This leaderboard evaluates Large Language Models (LLMs) on Southeast Asian (SEA) languages through two comprehensive benchmarks: SeaExam and SeaBench:
+* SeaExam assesses world knowledge and reasoning capabilities through exam-style questions [[data (public)](https://huggingface.co/datasets/SeaLLMs/SeaExam)] [[code](https://github.com/DAMO-NLP-SG/SeaExam)]
+* SeaBench evaluates instruction-following abilities and multi-turn conversational skills. [[data (public)](https://huggingface.co/datasets/SeaLLMs/SeaBench)] [[code](https://github.com/DAMO-NLP-SG/SeaBench?tab=readme-ov-file)]
+Note: "pub" denotes public dataset, and "prv" denotes private dataset.
+For more details, please refer to the "📝 About" tab.
+"""
 # For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "📝 About" tab.
 # Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
 ## Datasets
+The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/datasets/SeaLLMs/SeaExam) and [SeaBench dataset](https://huggingface.co/datasets/SeaLLMs/SeaBench).
 - **SeaExam**: a benchmark sourced from real and official human exam questions in multiple-choice format.
 - **SeaBench**: a manually created benchmark for evaluating the model's ability to follow instructions and engage in multi-turn conversations. The questions are in open-ended format.
 _ **SeaBench**:
     We evaluate the responses of the models with GPT-4o-2024-08-06. Each response is scored on a scale of 1-10.
+## Results
 How to interpret the leaderboard?
 * Each numerical value represet the accuracy (%) for SeaExam and score for SeaBench.
 * The "🏅 Overall" shows the average results across the three langauges for SeaExam public dataset (SeaExam-pub), SeaExam private dataset (SeaExam-prv), SeaBench public dataset (SeaBench-pub), (SeaBench-prv). This leaderboard is ranked by SeaExam-prv.
 * "open?" column indicates whether the model is open-source or proprietary.
 ## Reproducibility
+To reproduce our results, use the script in [SeaExam](https://github.com/DAMO-NLP-SG/SeaExam/tree/main) and [SeaBench](https://github.com/DAMO-NLP-SG/SeaBench). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
 """
+# ```python
+# python scripts/main.py --model $model_name_or_path
+# ```
 # You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 EVALUATION_QUEUE_TEXT = """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@article{damonlp2024sealeaderboard,
+  author = {Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, Lidong Bing},
+  title = {SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia},
+  year = {2024},
+  url = {},
+}
 """
 CONTACT_TEXT = f"""