V2 Multilingual Models Offline Use and Comparison with V1

#44
by skypanda64 - opened

I've been using chatterbox offline via TTS-webui for some time now and wanted to share a few thoughts and observations from a non-technical user's perspective.

First off - chatterbox is amazing. It blows similar competition out of the water in terms of cost/performance and is far and away the most human sounding TTS model I've tried offline.

Multilingual is of particular interest to me as I've been experimenting with the combination of chatterbox with a text-generation LLM front end as an assistant and multilingual conversation/education partner. In short, multilingual is a pretty amazing feature and I think the potential for practical applications is off the charts.

A few initial observations having tried both v1 and v2 multilingual models. English feels like it took a big leap forward from old to new, and I can't see myself using another model given how strong chatterbox is right now.

Other languages are more interesting - among which I've tested Chinese, Japanese, and Korean more extensively. These all seem to have taken a step back in their "native language" in exchange for being able to speak really good English now too. I think it's especially noticeable with these non-latin languages which in V1 would either speak extremely accented English or start to output gibberish. Now they can speak English with the barest hint of an accent (like literally 10-15% on my subjective scale of obviousness) - all in the same generation instance with text in the other language as well. The cost in single language accuracy pops up when the model starts to mispronounce words that v1 could read well, especially when generating less common vocabulary in Chinese and Japanese. I'm guessing this was probably an intentional choice/trade-off and a stepping stone towards true seamless multilingual TTS.

In terms of cadence, latin languages seem to speak with a super natural intonation across the board - I tried English, Italian, and French. Whereas Chinese, Japanese, and Korean tend to speak super fast with the same expressiveness and weight settings (0.7 & 0.3 or 0.5 & 0.5 as recommended) and as a result sound less natural compared to V1.

I've also found it interesting playing around with different voice samples and finding that similar quality clips can yield drastically different results that don't seem super tied to sample quality. Some cloned voices sound uncannily real, while others will be more robotic. I've tried cutting sections from the same audiobook and results are variable - especially with accents. It's a fun game of roulette sometimes to see if a sample of the same voice will come out with the right accent or morph into another. It's hilarious cause the tone and timbre of the voice will still be identical. Either way this is super cool tech.

Overall I'm super excited to see the future development and growth of TTS and chatterbox is right up there the best of em'. Thanks for all the great work you guys do and will be following further updates and improvements closely!

Sign up or log in to comment