Breaking the language barriers with Meta's SeamlessM4T
🏝 TheTechOasis 🏝
Welcome to TTO, the future of AI and the world made simple.
🤯 AI Research of the week 🤯
Because it’s not what you say, but how you say it.
For a long time, developing high-quality speech AI models has been an obsession for research teams all around the world.
The potential for AI to bridge the ‘language gap’ and create universal machine translators is huge, as there are more than 7,200 known languages in the world.
Sadly, as AI is overly reliant on data, most AI solutions today focus primarily on the world's 10-12 most spoken languages, leaving around half of the world’s population behind.
Thus, what if we could create an AI model that allows us to communicate with anyone, anytime, and in real-time?
We aren’t quite there yet, but we have just made substantial progress thanks to Meta, with their SeamlessM4T models that support speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
All in one.
The Holy Grail
Building a high-performant speech model has long been considered a utopia in the world of AI. Especially if we’re talking about translating text or speech between languages.
For reference, for a text-only solution, English-Italian labeled data can be easily sourced for amounts above 100 million text pairs, but you will sweat quite a bit to get over 2 million pairs for speech training.
And we’re talking about two fairly popular languages, imagine the minority ones.
Speech data is, put simply, very scarce.
However, new types of AI models trained in unsupervised or self-supervised approaches are yielding great results. This is great because it allows researchers to use audio data that hasn’t been curated to train great models.
Consequently, we’ve seen a proliferation of cascaded models, models that are specialized in specific use cases (speech-to-text, or text-to-speech, for instance) combined to create new models that are capable of doing a complete speech-to-speech translation exercise, the holy grail of AI speech modeling.
But now Meta has pushed the boundaries of AI one step further, to create the first model that can perform all text and speech-related tasks for translation for up to 100 different languages.
The everything model.
But how does it work?
1 + 1 = 100 - UnitY
As I always say, true multimodality is achieved when a unique set of weights is optimized for several modalities, which is precisely what Meta has done.
And, in the process, they also allow for these weights to not only handle multiple modalities, but also multiple tasks.
To achieve this, they have built the UnitY architecture, as per the image below:
This architecture is comprised of two parts:
The X2T model for text generation
Complete UnitY solution (X2T + T2U + HiFi-GAN) to synthesize speech
The X2T model takes in speech or text, and returns text. This model is enough for speech-to-text (S2TT), text-to-text (T2TT), and automatic speech recognition (ASR).
Although this may look daunting, if you watch carefully it’s always the same two basic elements found in almost every other multimodal solution:
A text-to-text Transformer, otherwise usually described as Large Language Models (LLMs), takes in a text sequence and returns another one (in this case the seamlessM4T-NLLB model)
An encoder for the other modality, in this case speech (the w2v-BERT 2.0 model).
But what if you want to generate speech from the text/speech inputs?
For that, you need the second part of the model (X2T + T2U + HiFi-GAN):
Taking the decoded text from the previous architecture as input, as the above architecture implies, we then use that text to generate the desired speech.
In other words, for text-to-speech or speech-to-speech cases, the initial text and speech inputs are first turned into text with the X2T model, and then that text is decoded into speech through the following two models:
A text-to-unit model that takes the text and transforms it into discrete acoustic units
The discrete acoustic units are then decoded into speech using a HiFi-GAN model
But what are these ‘discrete acoustic units’?
Instead of having to synthesize speech from continuous data, which is a very complex task, the standard approach today is to decompose the data to be decoded into speech into discrete elements, and then use those elements to generate audio.
In other words, for a given text to be turned into speech, this text is broken down into different acoustic signals that the vocoder will easily understand and generate speech from.
Meta continues its strategy
In summary, SeamlessM4T is a unified model for machine translation that supports up to 100 languages in various formats: speech-to-speech, speech-to-text, text-to-speech, and text-to-text.
But more importantly, it showcases, once again, Meta’s approach to AI.
By making all their models open-source (not for commercial use, that being said), they seem to want to make sure that no company in the world develops a moat based on AI.
By continuously presenting state-of-the-art solutions and distilling the knowledge on how to build them to the masses, Meta guarantees that companies need to find other ways besides AI itself to grow.
I tend to support this vision, as I feel that AI will never (and should never) be a moat for any company.
All companies will be AI native, but none will have a competitive advantage based on it.
That’s my bet, but what’s yours?
Will open-source eventually win the AI race?
Key concepts to retain from the paper
🔮 Practical applications 🔮
👾 Other news of the week 👾
🥳 Fine-tuning ChatGPT is now possible
🤨 AI-generated career portraits are a thing now
Subscribe to Leaders to read the rest.
Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In