diff --git "a/README.md" "b/README.md"
--- "a/README.md"
+++ "b/README.md"
@@ -23,12 +23,30 @@ tags:
- meralion-2
---
-# ๐ MERaLiON-2-10B
+
๐ฅ MERaLiON-2 ๐ฅ
+
+ ๐ MERaLiON-2-10B |
+ ๐ MERaLiON-2-10B-ASR |
+ ๐ MERaLiON-2-3B
+
-## ๐ What's New in V2
+## Introduction
-- **Extended Audio Length**: Improved support for audio inputs up to 300 seconds (5 minutes), compared to the 30-second limit in V1. We suggest a maximum audio length of 30 seconds for ASR.
+We are excited to announce the release of MERaLiON2, the latest addition to the MERaLiON family of Speech-Text Large Language Models. Our flagship model, [MERaLiON-2-10B](https://huggingface.co/MERaLiON/MERaLiON-2-10B), achieves competitive results in benchmark evaluations of multilingual speech recognition (ASR), speech translation (ST), audio scene understanding, emotion recognition, general speech understanding etc.,
+when compared to other state-of-the-art AudioLLMs such as Qwen2.5-Omni-7B, Phi-4-multimodal-instruct. It is tailored to follow **complex instructions** with a deep understanding of **Singaporeโs multilingual and multicultural landscape**.
+
+
+
+Additionall, we provide an ASR-optimized model, [MERaLiON-2-10B-ASR](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR), demonstrates **5-30%** performance improvement over `whisper-large-v3` on speech recognition tasks across Singapore's 4 official languages (**English**, **Mandarin**, **Malay**, and **Tamil**), 3 SEA languages (**Indonesian**, **Thai**, and **Vietnamese**), **code-switch senarios**, and various local phrases.
+The following visualisation shows `1 - Word Error Rate` for the 7 languages across MERaLiON-2 and various models.
+
+
+
+We also provide [MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) that balances performance with reduced computational requirements, enabling broader accessibility and lightweight deployment.
+
+
+- **Extended Audio Length**: Support audio inputs up to 300 seconds (5 minutes) for audio & speech question answering tasks, **30s for a satisfactory performance for speech transcription (ASR) and speech translation (ST) tasks**.
- **Expanded Language Coverage**: In addition to English, Chinese, and Singlish, V2 introduces support for Malay, Tamil, and other regional languages including Indonesian, Thai, and Vietnamese.
@@ -38,8 +56,6 @@ tags:
- **Three Model Variants**: Available in general-purpose ([MERaLiON-2-10B](https://huggingface.co/MERaLiON/MERaLiON-2-10B)), ASR-optimized ([MERaLiON-2-10B-ASR](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR)) and light-weight ([MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B)) configurations to balance latency, compute efficiency, and task performance across different deployment needs.
----
-
## ๐ Model Description:
MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork.
@@ -58,1432 +74,431 @@ The model supports long-form audio inputs of up to 300 seconds (5 minutes) and i
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
- **Demo:** [MERaLiON-AudioLLM Web Demo](https://meralion.org/demo/)
-
-
**MERaLiON-2** is an upgraded version of [MERaLiON-AudioLLM](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION).
----
-
-
-## ๐ Evaluations:
-
-
-
-
+## ๐ Performance:
We benchmark MERaLiON-2 series models with extended [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench) | [LeaderBoard](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) againstย several recently released open-source multimodal models โ SALMONN-7B, Qwen2.5-Omni series and Phi-4-Multimodal โ as well as two cascade model. The MERaLiON-2 series models shows stronger performance on a wide range of audio/speech understanding tasks.
-**Automatic Speech Recognition (ASR) results**
-
-
-
-
- | type |
- dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- MERaLiON-2-10B-ASR |
- MERaLiON-2-Whisper |
- whisper_large_v3 |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | English |
- common_voice_15_en |
- 0.078 |
- 0.093 |
- 0.087 |
- 0.076 |
- 0.102 |
- 0.100 |
- 0.081 |
- 0.094 |
- 0.080 |
- 0.316 |
- 0.106 |
- 0.099 |
-
-
- | earnings21 |
- 0.138 |
- 0.219 |
- 0.108 |
- 0.092 |
- 0.130 |
- 0.132 |
- 0.131 |
- 0.147 |
- 0.189 |
- 0.277 |
- 0.141 |
- 0.109 |
-
-
- | earnings22 |
- 0.166 |
- 0.239 |
- 0.151 |
- 0.128 |
- 0.168 |
- 0.165 |
- 0.226 |
- 0.197 |
- 0.241 |
- 0.380 |
- 0.172 |
- 0.146 |
-
-
- | gigaspeech |
- 0.145 |
- 0.092 |
- 0.090 |
- 0.088 |
- 0.089 |
- 0.098 |
- 0.099 |
- 0.114 |
- 0.140 |
- 0.110 |
- 0.100 |
- 0.095 |
-
-
- | librispeech_clean |
- 0.024 |
- 0.027 |
- 0.025 |
- 0.021 |
- 0.020 |
- 0.022 |
- 0.017 |
- 0.021 |
- 0.044 |
- 0.096 |
- 0.033 |
- 0.018 |
-
-
- | librispeech_other |
- 0.042 |
- 0.051 |
- 0.047 |
- 0.040 |
- 0.044 |
- 0.039 |
- 0.039 |
- 0.045 |
- 0.069 |
- 0.118 |
- 0.054 |
- 0.036 |
-
-
- | peoples_speech |
- 0.216 |
- 0.206 |
- 0.205 |
- 0.196 |
- 0.197 |
- 0.150 |
- 0.215 |
- 0.262 |
- 0.312 |
- 0.242 |
- 0.203 |
- 0.145 |
-
-
- | tedlium3 |
- 0.082 |
- 0.035 |
- 0.035 |
- 0.031 |
- 0.036 |
- 0.041 |
- 0.029 |
- 0.048 |
- 0.049 |
- 0.039 |
- 0.049 |
- 0.038 |
-
-
- | tedlium3_long_form |
- 0.105 |
- 0.138 |
- 0.044 |
- 0.035 |
- 0.048 |
- 0.045 |
- 0.051 |
- 0.071 |
- 0.084 |
- 0.141 |
- 0.086 |
- 0.049 |
-
-
- | average |
- 0.111 |
- 0.122 |
- 0.088 |
- 0.079 |
- 0.093 |
- 0.088 |
- 0.098 |
- 0.111 |
- 0.134 |
- 0.191 |
- 0.105 |
- 0.082 |
-
-
- | Inhouse |
- cna |
- 0.145 |
- 0.135 |
- 0.133 |
- 0.127 |
- 0.128 |
- 0.138 |
- 0.191 |
- 0.174 |
- 0.183 |
- 0.149 |
- 0.152 |
- 0.138 |
-
-
- | idpc |
- 0.204 |
- 0.177 |
- 0.160 |
- 0.166 |
- 0.169 |
- 0.179 |
- 0.261 |
- 0.199 |
- 0.220 |
- 0.541 |
- 0.170 |
- 0.162 |
-
-
- | idpc_short |
- 0.165 |
- 0.151 |
- 0.157 |
- 0.140 |
- 0.152 |
- 0.220 |
- 0.539 |
- 0.211 |
- 0.414 |
- 0.240 |
- 0.197 |
- 0.153 |
-
-
- | mediacorp |
- 0.123 |
- 0.123 |
- 0.105 |
- 0.104 |
- 0.116 |
- 0.129 |
- 0.198 |
- 0.152 |
- 0.235 |
- 0.364 |
- 0.158 |
- 0.151 |
-
-
- | mediacorp_short |
- 0.128 |
- 0.121 |
- 0.117 |
- 0.118 |
- 0.122 |
- 0.127 |
- 0.122 |
- 0.148 |
- 0.141 |
- 0.199 |
- 0.154 |
- 0.114 |
-
-
- | parliament |
- 0.059 |
- 0.185 |
- 0.060 |
- 0.053 |
- 0.078 |
- 0.090 |
- 0.278 |
- 0.100 |
- 0.110 |
- 0.204 |
- 0.090 |
- 0.065 |
-
-
- | ste |
- 0.159 |
- 0.263 |
- 0.147 |
- 0.125 |
- 0.151 |
- 0.298 |
- 0.297 |
- 0.287 |
- 0.288 |
- 0.422 |
- 0.132 |
- 0.144 |
-
-
- | ukusnews |
- 0.113 |
- 0.174 |
- 0.070 |
- 0.056 |
- 0.083 |
- 0.123 |
- 0.075 |
- 0.091 |
- 0.176 |
- 0.192 |
- 0.123 |
- 0.089 |
-
-
- | ytb_asr_batch1 |
- 0.107 |
- 0.099 |
- 0.098 |
- 0.092 |
- 0.112 |
- 0.133 |
- 0.169 |
- 0.162 |
- 0.174 |
- 0.221 |
- 0.125 |
- 0.108 |
-
-
- | ytb_asr_batch2 |
- 0.133 |
- 0.160 |
- 0.111 |
- 0.099 |
- 0.118 |
- 0.129 |
- 0.232 |
- 0.245 |
- 0.351 |
- 0.350 |
- 0.126 |
- 0.084 |
-
-
- | ytb_asr_batch3_chinese |
- 0.418 |
- 0.256 |
- 0.191 |
- 0.149 |
- 0.177 |
- 0.266 |
- 0.440 |
- 0.250 |
- 0.206 |
- 0.886 |
- 0.347 |
- 0.270 |
-
-
- | ytb_asr_batch3_malay |
- 0.290 |
- 0.280 |
- 0.209 |
- 0.195 |
- 0.290 |
- 0.260 |
- 3.763 |
- 2.944 |
- 1.461 |
- 1.086 |
- 0.314 |
- 0.312 |
-
-
- | ytb_asr_batch3_tamil |
- 0.693 |
- 0.750 |
- 0.664 |
- 0.547 |
- 0.927 |
- 0.841 |
- 2.750 |
- 1.461 |
- 1.362 |
- 0.985 |
- 0.967 |
- 0.898 |
-
-
- | average |
- 0.210 |
- 0.221 |
- 0.171 |
- 0.152 |
- 0.202 |
- 0.226 |
- 0.717 |
- 0.494 |
- 0.409 |
- 0.449 |
- 0.235 |
- 0.207 |
-
-
- | Mandarin |
- aishell_asr_zh |
- 0.128 |
- 0.050 |
- 0.058 |
- 0.043 |
- 0.056 |
- 0.123 |
- 0.122 |
- 0.028 |
- 0.024 |
- 0.931 |
- 0.209 |
- 0.125 |
-
-
- | commonvoice_zh |
- 0.327 |
- 0.131 |
- 0.147 |
- 0.118 |
- 0.141 |
- 0.198 |
- 0.154 |
- 0.113 |
- 0.076 |
- 1.001 |
- 0.319 |
- 0.196 |
-
-
- | average |
- 0.228 |
- 0.091 |
- 0.102 |
- 0.081 |
- 0.098 |
- 0.161 |
- 0.138 |
- 0.071 |
- 0.050 |
- 0.966 |
- 0.264 |
- 0.160 |
-
-
- | SEA languages |
- commonvoice_id |
- 0.260 |
- 0.085 |
- 0.113 |
- 0.079 |
- 0.069 |
- 0.075 |
- 1.327 |
- 0.136 |
- 0.110 |
- 1.189 |
- 0.100 |
- 0.078 |
-
-
- | commonvoice_ta |
- 0.528 |
- 0.139 |
- 0.156 |
- 0.129 |
- 0.195 |
- 0.271 |
- 1.178 |
- 0.831 |
- 0.847 |
- 1.427 |
- 0.238 |
- 0.244 |
-
-
- | commonvoice_th |
- 0.847 |
- 0.307 |
- 0.466 |
- 0.635 |
- 0.051 |
- 0.069 |
- 1.054 |
- 0.113 |
- 0.104 |
- 1.044 |
- 0.093 |
- 0.064 |
-
-
- | commonvoice_vi |
- 0.922 |
- 0.142 |
- 0.156 |
- 0.142 |
- 0.118 |
- 0.129 |
- 1.107 |
- 0.196 |
- 0.184 |
- 1.496 |
- 0.157 |
- 0.117 |
-
-
- | fleurs_tamil_ta |
- 0.462 |
- 0.143 |
- 0.161 |
- 0.138 |
- 0.224 |
- 0.276 |
- 1.702 |
- 1.654 |
- 0.867 |
- 1.508 |
- 0.272 |
- 0.284 |
-
-
- | gigaspeech2_id |
- 0.337 |
- 0.178 |
- 0.172 |
- 0.163 |
- 0.185 |
- 0.196 |
- 5.804 |
- 0.275 |
- 0.227 |
- 2.118 |
- 0.219 |
- 0.193 |
-
-
- | gigaspeech2_th |
- 0.987 |
- 0.200 |
- 0.200 |
- 0.182 |
- 0.171 |
- 0.222 |
- 1.734 |
- 0.300 |
- 0.232 |
- 1.247 |
- 0.276 |
- 0.209 |
-
-
- | gigaspeech2_vi |
- 0.982 |
- 0.168 |
- 0.113 |
- 0.095 |
- 0.127 |
- 0.177 |
- 2.504 |
- 0.177 |
- 0.227 |
- 1.546 |
- 0.171 |
- 0.155 |
-
-
- | lotus_thai_th |
- 0.852 |
- 0.015 |
- 0.019 |
- 0.011 |
- 0.026 |
- 0.039 |
- 1.286 |
- 0.026 |
- 0.021 |
- 1.135 |
- 0.068 |
- 0.032 |
-
-
- | average |
- 0.686 |
- 0.153 |
- 0.173 |
- 0.175 |
- 0.129 |
- 0.162 |
- 1.966 |
- 0.412 |
- 0.313 |
- 1.412 |
- 0.177 |
- 0.153 |
-
-
- | Singlish |
- imda_part1_asr |
- 0.043 |
- 0.049 |
- 0.052 |
- 0.044 |
- 0.052 |
- 0.069 |
- 0.058 |
- 0.053 |
- 0.053 |
- 0.093 |
- 0.071 |
- 0.069 |
-
-
- | imda_part2_asr |
- 0.047 |
- 0.058 |
- 0.145 |
- 0.054 |
- 0.080 |
- 0.318 |
- 0.345 |
- 0.095 |
- 0.094 |
- 0.458 |
- 0.330 |
- 0.319 |
-
-
- | imda_part3_30s_asr |
- 0.213 |
- 0.264 |
- 0.227 |
- 0.196 |
- 0.211 |
- 0.320 |
- 0.438 |
- 0.475 |
- 0.535 |
- 0.681 |
- 0.281 |
- 0.277 |
-
-
- | imda_part4_30s_asr |
- 0.297 |
- 0.360 |
- 0.295 |
- 0.246 |
- 0.271 |
- 0.503 |
- 1.470 |
- 1.250 |
- 1.303 |
- 0.787 |
- 0.459 |
- 0.458 |
-
-
- | imda_part5_30s_asr |
- 0.154 |
- 0.202 |
- 0.168 |
- 0.140 |
- 0.149 |
- 0.237 |
- 0.239 |
- 0.280 |
- 0.374 |
- 0.375 |
- 0.218 |
- 0.214 |
-
-
- | imda_part6_30s_asr |
- 0.109 |
- 0.149 |
- 0.127 |
- 0.099 |
- 0.110 |
- 0.198 |
- 0.144 |
- 0.183 |
- 0.275 |
- 0.255 |
- 0.175 |
- 0.172 |
-
-
- | average |
- 0.144 |
- 0.180 |
- 0.169 |
- 0.130 |
- 0.145 |
- 0.274 |
- 0.449 |
- 0.389 |
- 0.439 |
- 0.441 |
- 0.256 |
- 0.252 |
-
-
-
-
-
-**Spoken Question Answering (SQA) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | cn_college_listen_mcq |
- 57.111 |
- 66.006 |
- 84.588 |
- 75.649 |
- 81.418 |
- 81.726 |
- 50.815 |
- 89.520 |
- 84.985 |
-
-
- | dream_tts_mcq |
- 51.542 |
- 61.160 |
- 83.325 |
- 77.522 |
- 69.995 |
- 70.779 |
- 56.560 |
- 85.154 |
- 86.200 |
-
-
- | imda_part3_30s_sqa |
- 55.200 |
- 52.600 |
- 59.400 |
- 55.000 |
- 52.400 |
- 54.200 |
- 42.000 |
- 51.400 |
- 51.600 |
-
-
- | imda_part4_30s_sqa |
- 50.000 |
- 54.600 |
- 63.000 |
- 56.400 |
- 54.400 |
- 52.000 |
- 35.400 |
- 46.400 |
- 55.600 |
-
-
- | imda_part5_30s_sqa |
- 63.000 |
- 61.400 |
- 72.000 |
- 64.600 |
- 66.000 |
- 62.800 |
- 45.800 |
- 54.600 |
- 62.000 |
-
-
- | imda_part6_30s_sqa |
- 67.400 |
- 70.200 |
- 71.800 |
- 71.800 |
- 69.200 |
- 64.600 |
- 49.600 |
- 62.600 |
- 68.200 |
-
-
- | mmau_mini |
- 53.100 |
- 51.000 |
- 56.700 |
- 58.800 |
- 60.700 |
- 56.100 |
- 50.600 |
- 52.600 |
- 55.900 |
-
-
- | muchomusic |
- 51.348 |
- 55.602 |
- 63.943 |
- 55.265 |
- 59.309 |
- 47.599 |
- 49.705 |
- 50.463 |
- 56.698 |
-
-
- | public_sg_speech_qa |
- 59.593 |
- 69.477 |
- 75.029 |
- 74.186 |
- 61.076 |
- 61.715 |
- 59.390 |
- 70.930 |
- 69.680 |
-
-
- | slue_p2_sqa5 |
- 86.716 |
- 83.186 |
- 89.559 |
- 83.725 |
- 73.873 |
- 77.304 |
- 80.882 |
- 51.520 |
- 86.961 |
-
-
- | spoken_squad |
- 74.207 |
- 81.461 |
- 89.209 |
- 83.196 |
- 59.850 |
- 62.867 |
- 65.648 |
- 57.163 |
- 87.434 |
-
-
- | average |
- 60.838 |
- 64.245 |
- 73.505 |
- 68.740 |
- 64.384 |
- 62.881 |
- 53.309 |
- 61.123 |
- 69.569 |
-
-
-
-
-
-**Speech Translation (ST) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- MERaLiON-2-Whisper |
- whisper_large_v3 |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | covost2_en_id |
- 37.058 |
- 30.658 |
- 36.242 |
- - |
- - |
- 14.554 |
- 22.677 |
- 22.381 |
- 14.193 |
- 27.592 |
- 10.753 |
-
-
- | covost2_en_ta |
- 13.809 |
- 5.602 |
- 10.886 |
- - |
- - |
- 0.148 |
- 0.114 |
- 0.724 |
- 0.001 |
- 7.475 |
- 1.003 |
-
-
- | covost2_en_zh |
- 43.963 |
- 40.028 |
- 43.747 |
- - |
- - |
- 45.480 |
- 41.390 |
- 40.436 |
- 33.256 |
- 28.714 |
- 6.090 |
-
-
- | covost2_id_en |
- 43.374 |
- 37.773 |
- 47.859 |
- 21.269 |
- 44.667 |
- 0.377 |
- 44.702 |
- 43.845 |
- 27.885 |
- 46.805 |
- 46.797 |
-
-
- | covost2_ta_en |
- 4.758 |
- 1.942 |
- 3.479 |
- 0.022 |
- 2.494 |
- 0.073 |
- 0.212 |
- 0.057 |
- 0.406 |
- 2.833 |
- 2.418 |
-
-
- | covost2_zh_en |
- 19.556 |
- 16.778 |
- 22.134 |
- 12.225 |
- 14.865 |
- 22.330 |
- 21.564 |
- 16.686 |
- 5.176 |
- 15.210 |
- 14.156 |
-
-
- | average |
- 27.086 |
- 22.130 |
- 27.391 |
- 11.172 |
- 20.675 |
- 13.827 |
- 21.777 |
- 20.688 |
- 13.486 |
- 21.438 |
- 13.536 |
-
-
-
-
-
-**Spoken Dialogue Summarization (SDS) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | imda_part3_30s_ds |
- 47.800 |
- 42.200 |
- 49.800 |
- 43.600 |
- 42.800 |
- 39.800 |
- 9.000 |
- 48.400 |
- 38.000 |
-
-
- | imda_part4_30s_ds |
- 46.400 |
- 40.200 |
- 46.600 |
- 42.800 |
- 33.200 |
- 31.600 |
- 7.400 |
- 45.600 |
- 38.200 |
-
-
- | imda_part5_30s_ds |
- 54.600 |
- 51.800 |
- 55.400 |
- 55.600 |
- 52.200 |
- 42.800 |
- 16.000 |
- 53.400 |
- 46.200 |
-
-
- | imda_part6_30s_ds |
- 65.600 |
- 60.000 |
- 60.600 |
- 61.000 |
- 58.800 |
- 58.400 |
- 25.200 |
- 56.600 |
- 61.000 |
-
-
- | average |
- 53.600 |
- 48.550 |
- 53.100 |
- 50.750 |
- 46.750 |
- 43.150 |
- 14.400 |
- 51.000 |
- 45.850 |
-
-
-
-
-
-**Speech instruction (SI) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | alpaca_audio |
- 75.200 |
- 25.600 |
- 74.200 |
- 33.400 |
- 64.000 |
- 59.200 |
- 10.400 |
- 67.000 |
- 69.400 |
-
-
- | openhermes_audio |
- 66.400 |
- 12.600 |
- 66.200 |
- 39.000 |
- 66.000 |
- 57.400 |
- 15.400 |
- 78.800 |
- 62.800 |
-
-
- | average |
- 70.800 |
- 19.100 |
- 70.200 |
- 36.200 |
- 65.000 |
- 58.300 |
- 12.900 |
- 72.900 |
- 66.100 |
-
-
-
-
-
-**Audio Captioning (AC) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | audiocaps |
- 39.386 |
- 35.077 |
- 36.041 |
- 33.595 |
- 43.695 |
- 37.700 |
- 35.241 |
- 2.455 |
- 2.514 |
-
-
- | wavcaps |
- 34.566 |
- 31.410 |
- 35.168 |
- 28.069 |
- 34.705 |
- 26.092 |
- 22.520 |
- 3.827 |
- 3.318 |
-
-
- | average |
- 36.976 |
- 33.244 |
- 35.604 |
- 30.832 |
- 39.200 |
- 31.896 |
- 28.881 |
- 3.141 |
- 2.916 |
-
-
-
-
-
-**Accent Recognition (AR) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | voxceleb_accent |
- 47.066 |
- 66.598 |
- 40.788 |
- 2.626 |
- 0.903 |
- 1.662 |
- 31.699 |
- 28.006 |
- 40.295 |
-
-
-
-
-
-**Audio-Scene Question Answering (ASQA) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | audiocaps_qa |
- 48.818 |
- 44.792 |
- 50.351 |
- 40.319 |
- 48.562 |
- 50.415 |
- 50.351 |
- 17.444 |
- 17.061 |
-
-
- | clotho_aqa |
- 62.674 |
- 50.540 |
- 58.201 |
- 48.371 |
- 52.649 |
- 46.592 |
- 58.192 |
- 22.674 |
- 29.820 |
-
-
- | wavcaps_qa |
- 45.132 |
- 43.092 |
- 44.868 |
- 37.961 |
- 43.158 |
- 40.000 |
- 46.908 |
- 14.013 |
- 18.750 |
-
-
- | average |
- 52.208 |
- 46.141 |
- 51.140 |
- 42.217 |
- 48.123 |
- 45.669 |
- 51.817 |
- 18.044 |
- 21.877 |
-
-
-
-
-
-**Emotion Recognition (ER) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | iemocap_emotion |
- 49.104 |
- 51.394 |
- 62.550 |
- 32.072 |
- 34.363 |
- 36.554 |
- 26.195 |
- 41.982 |
- 46.912 |
-
-
- | meld_emotion |
- 44.176 |
- 52.146 |
- 59.808 |
- 40.843 |
- 34.330 |
- 30.077 |
- 32.299 |
- 44.272 |
- 49.425 |
-
-
- | meld_sentiment |
- 52.452 |
- 58.582 |
- 68.851 |
- 49.119 |
- 30.421 |
- 27.778 |
- 42.261 |
- 58.391 |
- 56.475 |
-
-
- | average |
- 48.577 |
- 54.041 |
- 63.736 |
- 40.678 |
- 33.038 |
- 31.469 |
- 33.585 |
- 48.215 |
- 50.938 |
-
-
-
-
-
-**Gender Recognition (GR) results**
-
-
-
-
- | dataset |
- MERaLiON-1 |
- MERaLiON-2-3B |
- MERaLiON-2-10B |
- Phi-4-multimodal-instruct |
- Qwen2.5-Omni-3B |
- Qwen2.5-Omni-7B |
- SALMONN-7B |
- cascade-whisper_v2+sealion |
- cascade-whisper_v3+llama |
-
-
-
-
- | iemocap_gender |
- 94.622 |
- 87.928 |
- 92.968 |
- 46.853 |
- 62.948 |
- 43.367 |
- 80.199 |
- 12.211 |
- 44.382 |
-
-
- | voxceleb_gender |
- 99.733 |
- 99.692 |
- 97.251 |
- 94.584 |
- 32.786 |
- 54.083 |
- 88.531 |
- 26.631 |
- 69.696 |
-
-
- | average |
- 97.177 |
- 93.810 |
- 95.109 |
- 70.718 |
- 47.867 |
- 48.725 |
- 84.365 |
- 19.421 |
- 57.039 |
-
-
-
-
+**Better Automatic Speech Recognition (ASR) Accuracy**
+MERaLiON-2-10B-ASR and MERaLiON-2-10B demonstrate leading performance in Singlish, Mandarin, Malay, Tamil, and other Southeast Asian languages, while maintaining competitive results in English compared to `Whisper-large-v3`.
+
+
+
+
+
+ | |
+ MERaLiON-2-10B-ASR |
+ MERaLiON-2-10B |
+ MERaLiON-2-3B |
+ whisper_large_v3 |
+ cascade_whisper_large_v3_llama_3_8b_instruct |
+ cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct |
+ MERaLiON-AudioLLM-Whisper-SEA-LION |
+ Qwen2.5-Omni-7B |
+ SeaLLMs-Audio-7B |
+ Qwen2.5-Omni-3B |
+ SALMONN_7B |
+ phi_4_multimodal_instruct |
+
+
+
+
+ | thai |
+ 0.096526 |
+ 0.109365 |
+ 0.107279 |
+ 0.121073 |
+ 0.120257 |
+ 0.172105 |
+ 0.919330 |
+ 0.126497 |
+ 0.117152 |
+ 0.163150 |
+ 1.191099 |
+ 1.510068 |
+
+
+ | tamil |
+ 0.271279 |
+ 0.327081 |
+ 0.344081 |
+ 0.441483 |
+ 0.475225 |
+ 0.492336 |
+ 0.561315 |
+ 1.024916 |
+ 2.325402 |
+ 1.315143 |
+ 1.306694 |
+ 1.876722 |
+
+
+ | singlish |
+ 0.129830 |
+ 0.168813 |
+ 0.180395 |
+ 0.248945 |
+ 0.251608 |
+ 0.255717 |
+ 0.143800 |
+ 0.439071 |
+ 0.795990 |
+ 0.389393 |
+ 0.441490 |
+ 0.448863 |
+
+
+ | malay |
+ 0.194638 |
+ 0.209074 |
+ 0.279891 |
+ 0.219692 |
+ 0.311921 |
+ 0.314378 |
+ 0.289895 |
+ 1.460664 |
+ 0.765565 |
+ 2.943750 |
+ 1.085867 |
+ 3.762933 |
+
+
+ | english |
+ 0.078544 |
+ 0.088259 |
+ 0.122295 |
+ 0.080841 |
+ 0.081568 |
+ 0.104830 |
+ 0.110567 |
+ 0.134216 |
+ 0.197824 |
+ 0.110353 |
+ 0.191492 |
+ 0.098225 |
+
+
+ | indonesian |
+ 0.121020 |
+ 0.142813 |
+ 0.131950 |
+ 0.137102 |
+ 0.135390 |
+ 0.159476 |
+ 0.298365 |
+ 0.168659 |
+ 0.220227 |
+ 0.205216 |
+ 1.653502 |
+ 3.565510 |
+
+
+ | mandarin |
+ 0.103694 |
+ 0.132025 |
+ 0.145878 |
+ 0.170980 |
+ 0.196867 |
+ 0.291733 |
+ 0.291183 |
+ 0.102419 |
+ 0.309782 |
+ 0.130429 |
+ 0.939545 |
+ 0.238879 |
+
+
+ | vietnamese |
+ 0.118693 |
+ 0.134808 |
+ 0.155110 |
+ 0.148474 |
+ 0.136075 |
+ 0.164078 |
+ 0.952040 |
+ 0.205491 |
+ 0.222001 |
+ 0.186786 |
+ 1.521174 |
+ 1.805643 |
+
+
+ | private |
+ 0.106150 |
+ 0.112360 |
+ 0.147258 |
+ 0.116630 |
+ 0.118434 |
+ 0.143812 |
+ 0.130667 |
+ 0.222770 |
+ 0.496540 |
+ 0.164556 |
+ 0.273304 |
+ 0.229450 |
+
+
+
+
+
+**Better Instruction Following and Audio Understanding**
+
+
+MERaLiON-2-10B has demonstrated significant improvement across the speech understanding, audio understanding, and paralinguistic tasks. Specifically, MERaLiON-2-10B is able to handle more complicated instructions and answer with more flexibility, minimizing the lost of Gemma's pre-trained knowledge during the audio finetuning process. This allows MERaLiON-2-10B to provide more detailed explaination to queries about the speech content or speaker's emotion status. With further adjustment of the text prompt, it can play different roles like voice assistant, virtual caregiver, or become part of sophisticated multi-AI agent system and software solutions.
+
+
+
+
+
+ | |
+ MERaLiON-2-10B |
+ MERaLiON-AudioLLM-Whisper-SEA-LION |
+ MERaLiON-2-10B-ASR |
+ MERaLiON-2-3B |
+ SeaLLMs-Audio-7B |
+ Qwen2-Audio-7B-Instruct |
+ Qwen2.5-Omni-3B |
+ phi_4_multimodal_instruct |
+ cascade_whisper_large_v3_llama_3_8b_instruct |
+ Qwen2.5-Omni-7B |
+ cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct |
+ Qwen-Audio-Chat |
+ SALMONN_7B |
+ WavLLM_fairseq |
+
+
+
+
+ | speech_instruction |
+ 70.200000 |
+ 70.800000 |
+ 13.400000 |
+ 19.100000 |
+ 66.900000 |
+ 48.700000 |
+ 65.000000 |
+ 36.200000 |
+ 66.100000 |
+ 58.300000 |
+ 72.900000 |
+ 10.200000 |
+ 12.900000 |
+ 20.400000 |
+
+
+ | emotion_recognition |
+ 63.736268 |
+ 48.577313 |
+ 53.693298 |
+ 54.040797 |
+ 52.007576 |
+ 49.846540 |
+ 33.037836 |
+ 40.677800 |
+ 50.937578 |
+ 31.469397 |
+ 48.214969 |
+ 41.671551 |
+ 33.584869 |
+ 50.801545 |
+
+
+ | audio_scene_question_answering |
+ 51.140374 |
+ 52.207756 |
+ 49.511886 |
+ 46.141353 |
+ 50.193739 |
+ 47.048025 |
+ 48.123228 |
+ 42.217143 |
+ 21.876943 |
+ 45.669153 |
+ 18.043681 |
+ 51.618622 |
+ 51.816958 |
+ 33.034083 |
+
+
+ | gender_recognition |
+ 95.109423 |
+ 97.177396 |
+ 97.220335 |
+ 93.810266 |
+ 75.449392 |
+ 95.963266 |
+ 47.867210 |
+ 70.718047 |
+ 57.039409 |
+ 48.724711 |
+ 19.421130 |
+ 60.349349 |
+ 84.365092 |
+ 60.773275 |
+
+
+ | sqa_singlish |
+ 66.550000 |
+ 58.900000 |
+ 61.850000 |
+ 59.700000 |
+ 51.350000 |
+ 46.700000 |
+ 60.500000 |
+ 61.950000 |
+ 59.350000 |
+ 58.400000 |
+ 53.750000 |
+ 42.300000 |
+ 43.200000 |
+ 51.200000 |
+
+
+ | audio_captioning |
+ 35.604270 |
+ 36.976419 |
+ 34.466710 |
+ 33.243839 |
+ 45.089372 |
+ 37.278810 |
+ 39.200328 |
+ 30.832409 |
+ 2.915778 |
+ 31.896243 |
+ 3.140568 |
+ 39.988663 |
+ 28.880570 |
+ 6.200867 |
+
+
+ | sds_singlish |
+ 53.100000 |
+ 53.600000 |
+ 55.800000 |
+ 48.550000 |
+ 45.450000 |
+ 36.300000 |
+ 46.750000 |
+ 50.750000 |
+ 45.850000 |
+ 43.150000 |
+ 51.000000 |
+ 25.250000 |
+ 14.400000 |
+ 39.450000 |
+
+
+ | sqa_english |
+ 79.735049 |
+ 63.711481 |
+ 73.975834 |
+ 68.715179 |
+ 70.920519 |
+ 68.888565 |
+ 67.818546 |
+ 75.513152 |
+ 78.526569 |
+ 68.415131 |
+ 67.814538 |
+ 66.069047 |
+ 60.649071 |
+ 70.595242 |
+
+
+ | music_understanding |
+ 63.942713 |
+ 51.347936 |
+ 60.657119 |
+ 55.602359 |
+ 63.689975 |
+ 71.609099 |
+ 59.309183 |
+ 55.265375 |
+ 56.697557 |
+ 47.598989 |
+ 50.463353 |
+ 59.056445 |
+ 49.705139 |
+ 44.313395 |
+
+
+ | accent_recognition |
+ 41.815396 |
+ 43.799799 |
+ 47.788864 |
+ 60.054981 |
+ 10.143836 |
+ 10.901397 |
+ 0.478694 |
+ 3.097615 |
+ 21.398482 |
+ 0.587293 |
+ 25.929693 |
+ 17.550294 |
+ 11.577381 |
+ 14.294613 |
+
+
+ | st |
+ 27.391115 |
+ 27.086366 |
+ 28.540359 |
+ 22.130258 |
+ 21.143215 |
+ 10.826666 |
+ 21.776628 |
+ 13.827110 |
+ 13.536272 |
+ 20.688241 |
+ 21.437997 |
+ 4.973184 |
+ 13.486003 |
+ 9.046791 |
+
+
+
## ๐ง How to Use