A significant step towards removing language barriers through expressive, fast and high-quality AI translation
SeamlessM4T
SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.
SeamlessM4T models support the tasks of:
Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)
🌟 We are releasing SemalessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.
To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or 🤗 Model Card
I can't wait to see how well the Expressive model does on anime and foreign films. I wouldn't be surprised if this was the end of terrible dubs.
This is gonna be great for language learning as well. Finally being able to pick any media and watch it in any language. It might even be possible to rig it up to an LLM to tune the vocab to your exact level...
Yeah, I was over-enthusiastic based on their cherry-picked examples. SeamlessExpressive still leaves a lot to be desired.
It has a limited range of emotions and can't change emotion in the middle of the clip. It can't produce the pitch shifts of someone talking excitedly, making the output sound monotonous. Background noise in the input causes a raspy, distorted output voice. Sighs, inter-sentence breaths, etc. aren't reproduced. Sometimes the sentence pacing is just completely unnatural, with missing pauses or pauses in bad places (e.g. before the sentence-final verb in German).
IMO their manual dataset creation is holding them back. If I was in this field, I would try to follow the LLM route: Start with a next-token predictor trained indiscriminately on large-scale speech+text data (e.g. TV shows, movies, news radio, all with subtitles even if the subs need to be AI generated), fine-tune it for specific tasks (mainly learning to predict and generate based on "style tokens" (speaker, emotion, accent, pacing)), then generate a massive "textbook" synthetic dataset. The translation aspect could be almost completely outsourced to LLMs or multilingual subtitles.