• TheTechOasis
  • Posts
  • Breaking the language barriers with Meta's SeamlessM4T

Breaking the language barriers with Meta's SeamlessM4T

šŸ TheTechOasis šŸ

Welcome to TTO, the future of AI and the world made simple.

šŸ¤Æ AI Research of the week šŸ¤Æ

Because itā€™s not what you say, but how you say it.

For a long time, developing high-quality speech AI models has been an obsession for research teams all around the world.

The potential for AI to bridge the ā€˜language gapā€™ and create universal machine translators is huge, as there are more than 7,200 known languages in the world.

Sadly, as AI is overly reliant on data, most AI solutions today focus primarily on the world's 10-12 most spoken languages, leaving around half of the worldā€™s population behind.

Thus, what if we could create an AI model that allows us to communicate with anyone, anytime, and in real-time?

We arenā€™t quite there yet, but we have just made substantial progress thanks to Meta, with their SeamlessM4T models that support speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages.

All in one.

The Holy Grail

Building a high-performant speech model has long been considered a utopia in the world of AI. Especially if weā€™re talking about translating text or speech between languages.

For reference, for a text-only solution, English-Italian labeled data can be easily sourced for amounts above 100 million text pairs, but you will sweat quite a bit to get over 2 million pairs for speech training.

And weā€™re talking about two fairly popular languages, imagine the minority ones.

Speech data is, put simply, very scarce.

However, new types of AI models trained in unsupervised or self-supervised approaches are yielding great results. This is great because it allows researchers to use audio data that hasnā€™t been curated to train great models.

Unsupervised and self-supervised solutions do not require extensive human labeling of the data, allowing models to be trained with huge corpora, impossible if we had to label every single example manually.

Consequently, weā€™ve seen a proliferation of cascaded models, models that are specialized in specific use cases (speech-to-text, or text-to-speech, for instance) combined to create new models that are capable of doing a complete speech-to-speech translation exercise, the holy grail of AI speech modeling.

But now Meta has pushed the boundaries of AI one step further, to create the first model that can perform all text and speech-related tasks for translation for up to 100 different languages.

The everything model.

But how does it work?

1 + 1 = 100 - UnitY

As I always say, true multimodality is achieved when a unique set of weights is optimized for several modalities, which is precisely what Meta has done.

And, in the process, they also allow for these weights to not only handle multiple modalities, but also multiple tasks.

To achieve this, they have built the UnitY architecture, as per the image below:

This architecture is comprised of two parts:

  1. The X2T model for text generation

  2. Complete UnitY solution (X2T + T2U + HiFi-GAN) to synthesize speech

The X2T model takes in speech or text, and returns text. This model is enough for speech-to-text (S2TT), text-to-text (T2TT), and automatic speech recognition (ASR).

Although this may look daunting, if you watch carefully itā€™s always the same two basic elements found in almost every other multimodal solution:

  • A text-to-text Transformer, otherwise usually described as Large Language Models (LLMs), takes in a text sequence and returns another one (in this case the seamlessM4T-NLLB model)

  • An encoder for the other modality, in this case speech (the w2v-BERT 2.0 model).

The text decoder of the LLM is used for generating text independently of the input received, speech (for Speech-to-text or ASR solutions) or text (for the usual text-to-text generation)

But what if you want to generate speech from the text/speech inputs?

For that, you need the second part of the model (X2T + T2U + HiFi-GAN):

Taking the decoded text from the previous architecture as input, as the above architecture implies, we then use that text to generate the desired speech.

In other words, for text-to-speech or speech-to-speech cases, the initial text and speech inputs are first turned into text with the X2T model, and then that text is decoded into speech through the following two models:

  • A text-to-unit model that takes the text and transforms it into discrete acoustic units

  • The discrete acoustic units are then decoded into speech using a HiFi-GAN model

But what are these ā€˜discrete acoustic unitsā€™?

Instead of having to synthesize speech from continuous data, which is a very complex task, the standard approach today is to decompose the data to be decoded into speech into discrete elements, and then use those elements to generate audio.

In other words, for a given text to be turned into speech, this text is broken down into different acoustic signals that the vocoder will easily understand and generate speech from.

As it turns out, itā€™s much easier to generate speech when the input is discrete and not continuous, as this way the learning process is much more tractable.

Meta continues its strategy

In summary, SeamlessM4T is a unified model for machine translation that supports up to 100 languages in various formats: speech-to-speech, speech-to-text, text-to-speech, and text-to-text.

But more importantly, it showcases, once again, Metaā€™s approach to AI.

By making all their models open-source (not for commercial use, that being said), they seem to want to make sure that no company in the world develops a moat based on AI.

By continuously presenting state-of-the-art solutions and distilling the knowledge on how to build them to the masses, Meta guarantees that companies need to find other ways besides AI itself to grow.

I tend to support this vision, as I feel that AI will never (and should never) be a moat for any company.

All companies will be AI native, but none will have a competitive advantage based on it.

Thatā€™s my bet, but whatā€™s yours?

Will open-source eventually win the AI race?

Login or Subscribe to participate in polls.

šŸ«” Key concepts to retain from the paper šŸ«”

- The gold standard for AI speech-to-speech generation

- The unstoppable surge of solutions where one set of weights is optimized for several modalities and tasks. True multimodality is here.

- Metaā€™s AI strategy, democratizing knowledge so no company leads based on AI alone

šŸ”® Practical applications šŸ”®

- Universal translators, communicate with any human on this planet using real-time machine translation solutions

- Accessibility: Allow speech-impaired people to speak any language fluently

- Automation: Most call centers in 1-2 years will be almost entirely based on solutions like SeamlessM4T

šŸ‘¾ Other news of the week šŸ‘¾

šŸ„³ Fine-tuning ChatGPT is now possible

šŸ¤Ø AI-generated career portraits are a thing now

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • ā€¢ NO ADS
  • ā€¢ An additional insights email on Tuesdays
  • ā€¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more