• TheTechOasis
  • Posts
  • SALMONN, the First Model that Hears like Humans do

SALMONN, the First Model that Hears like Humans do

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: SALMONN, the First-Ever Audio-Language Multimodal LLM

  • Leaders: Why Biden’s AI Executive Order May Change the World Forever

🤯 AI Research of the week 🤯

People often underestimate the importance of hearing to function correctly in our world and, more importantly, as an essential tool for learning.

As the famed Helen Keller once said, “Blindness cuts us off from things, but deafness cuts us off from people” and let’s not forget that this woman was blind and deaf.

Therefore, it's only natural to see hearing as an indispensable requirement for AI to become the sought-after superior ‘being’ that some people predict it will become.

Sadly, current AI systems suck at hearing.

Yes, OpenAI’s Whisper understands speech pretty well, and other models capture audio events very efficiently.

But hearing is much more than that. Humans understand speech, encode random noises, and enjoy music, making ‘generic hearing’ one of the last features that AI couldn’t replicate from humans.

Now, a new model created by the company behind TikTok, ByteDance, challenges this vision.

SALMONN is the first-ever multimodal audio-language AI system for generic hearing, a model that can process random audio signals from the three main sound types: speech, audio events, and music.

What’s more, as we’ll see shortly, it showcases truly unique, never-seen-before capabilities like audio storytelling or audio-speech co-reasoning.

And today, we are understanding how it works.

Building True Multimodality

To build models that process data in more ways than text, we usually connect an encoder and a Large Language Model (LLM) through some sort of adaptor or cross-attention mechanism, with cases like the recently-covered LLaVa.

Image-language multimodality is very common, but audio, on the other hand, is much less visible today.

But in both cases, a latent problem persists.

Most of these models are great at standalone vision, audio, or language tasks, but fail miserably when facing cross-modal tasks, especially when talking about audio models.

For instance, if we send the model a sound file with the sentence “Why am I shouting?” it should give completely different answers depending on the situation of the person.

To the left, the audio would probably include gunshots or explosions, and the one on the right would include high music, singing, and so on.

Thus, both situations require different answers, but current audio models can’t handle such complex tasks.

Amazingly, SALMONN can.

A Tale of Three Steps

Although at first glance SALMONN looks similar to how other Multimodal LLMs are trained, it has some important distinctions.

Bridging data types

LLMs like ChatGPT take a text sequence and complete it by predicting the ‘next word’.

To do so, they have been trained by seeing millions upon millions of text passages.

But if we send them an image or audio, it won’t work because they won’t understand them, just like you won’t get an answer back if you ask a Japanese person the question “Where’s the Shibuya crossing?” while sightseeing in Tokyo, unless that person speaks English.

One thing you can do is show that person an image of the crossing, and he/she will naturally understand you want directions without having to speak English.

This works because a question about the Shibuya crossing and an image of the crossing itself refer to the same salient concept: the world-famous landmark.

We humans are multimodal, so the concept remains the same no matter the way it is presented to us, via speech or images.

But to achieve this in AI, we have to create a ‘joint-embedding space’.

The embedding space is a compressed representation of our world that the model builds during training, where similar things have similar vectors, and dissimilar things have widely different vectors,

Machines only understand numbers. Hence, real life concepts like ‘cat’ or ‘dog’ are represented in the form of vectors so that machines can process them. This is called an embedding.

LLMs work only with text, so for them to be able to comprehend that an image and a text describing the ‘Shibuya crossing’ refer to the same thing, we need the output vectors of the image and the text to be very similar.

In other words, as the sound encoders process the music, audio, or speech events and transform them into compressed representations of those audio signals (represent them in vector form), we apply a transformation that ‘maps’ those outputs into the LLM’s vector space.

This way, much like you can use a language translator to translate your English question into Japanese to ask for directions, the LLM will see the outputs of the audio and speech encoders in a form it understands.

This is what the Q-Former in the image does, allowing the LLM to process both modalities and answer back with information about both.

Let’s see this with an example:

  1. The audio input is sent to the sound encoders, which process it into a sequence of audio embeddings.

  2. The Q-Former then takes those outputs and projects them into a vector space that the LLM can process.

Using the previous example, you can think of this as Q-Former taking an audio of a person describing how the Shibuya crossing looks like, and transforming it into a series of text tokens describing the Shibuya crossing.

  1. In the meantime, the text prompt is processed through the LLM’s embedding look-up matrix, transforming text data into text embeddings.

LLMs like ChatGPT don’t have a text encoder but a “look up” matrix.

For instance, if the input sequence has the word ‘table’ the model searches for the equivalent embedding for ‘table’ in the matrix, instead of calculating it with an encoder.

Consequently, this matrix has as many rows as words the LLM has in its vocabulary.

  1. The LLM’s decoder then processes the projected audio embeddings and text embeddings as a unique sequence of data, performing the ‘word prediction’ exercise that any LLM makes.

As audio tokens are transformed into text tokens with the Q-Former, this exercise becomes just like any other sequence-to-sequence word prediction.

  1. As SALMONN can process generic audio, it is capable of separating the man’s speech from the background noises to process what it’s saying, while successfully grasping the context that those background noises provide.

  2. Finally, the embeddings are sent to the LLM’s decoder, using them to predict the next word.

This seemingly simple activity is what we call audio-language co-reasoning, a remarkable new feature for AI that is far from trivial because:

  • The model has to detach different types of sounds in a blended audio frame.

Audio events are usually processed in spectrogram form because they don’t require temporal correlations. On the other hand, speech is sequential, so they do. Thus, they require different types of encoders.

  • The model needs to showcase cross-modal capabilities, understanding the context around the speech to provide an accurate answer

  • And the model also needs to embed audio and language into a common embedding space to be able to answer the question

Additionally, you can ask the model to create stories based on audio files, an absolute first in AI too:

ByteDance bites first

SALMONN is, up to current knowledge, the first MLLM for generic audio and language, pulling us closer to creating models that “hear” as we do.

And just like that, think about the potential applications:

  • better hearing-aid products,

  • more sensitive and alert smart home assistants that detect when someone falls to call emergency systems, or when your baby starts crying,

  • or educational products that help you improve your music playing by giving you cues on how to improve, and many more.

Anyone can agree that with AI things can get ‘fishy’ in terms of the negative impact of its misuse, but with models like SALMONN, it’s in a good way.

Click here to access the GitHub repository and real examples.

🫡 Key contributions 🫡

  • ByteDance presents SALMONN, the first-ever MLLM that processes generic sounds and text

  • It’s capable of performing new, unprecedented tasks like audio storytelling or audio-image co-reasoning

🔮 Practical implications 🔮

  • Improving hearing-aid products

  • Next generation of Smart Home systems

  • Healthcare monitoring to track speech from patients and identify emotional states, or even health conditions

👾 Best news of the week 👾

🎥 Runway’s new video model is literally mind-blowing

😰 Instagram might be building a customizable AI friend

🫢 Microsoft presents its ground-breaking Phi 1.5 model

🥇 Leaders 🥇

Why Biden’s AI Executive Order May Change The World Forever

Things just got real.

In case you were still wondering if countries were taking AI risks seriously, Joe Biden has presented the AI Executive Order, the first real effort to regulate AI globally.

But wait, wasn’t Europe much closer to passing an AI Act?

Yes, by a landslide, and indeed the US Congress was nowhere near close to reaching an agreement.

But then, how?

Simple, a Cold War law from the 1950s has been enforced.

The Defense Production Act, a piece of legislation that grants the POTUS the capacity to bypass Congress in situations when there’s a real risk to the national security of the United States of America, has been used to enforce this Executive Order.

In other words, in the eyes of the White House, AI now represents a real risk to national security at the level of nuclear, biological, or chemical weapons.

Yes, the seemingly harmless chatbot you talk to every day to speed up your essays could very well be the precursor to weapons of mass destruction.

As expected, this executive order has been met with a good amount of praise…and also its fair share of heavy criticism.

Today we are diving deep into what this AI Executive Order really means to you, what could be the intentions behind it, and why it can change the course of history.

The War on Foundation Models

The whole Executive Order’s argument regarding AI risks is the fact that now everyone, including malicious actors, has access to “dual-use foundation models”.

But what is a dual-use foundation model?

When generalization became a bad thing

The critical discovery underpinning the success of AI during the last year is that we have finally figured out how to break what I call the “1:1 model to task ratio”.

Historically, AI models have always been designed to do one task, and that’s all.

But when the seminal AI architecture defined as the ‘Transformer’ was released in 2017, researchers realized they could train a model in a self-supervised manner at scale.

This meant that the supervisory signal, the way we tell a model if they are adequately predicting the task, is defined by the actual data, meaning that we don’t have to manually label it, and, thus, huge amounts of text could be used to train the models.

Exposing such quantities of data to models allowed for the creation of the first general-purpose models, models that could generalized to multiple seen and unseen tasks.

A Foundation Model according to DALL-E3

This is what we define as “foundation models”, models that can be used as the ‘foundation’ to leverage AI for a myriad of tasks, ranging from asking it to give you ideas for an essay… to building domestic bombs.

This concept, known as generalization, as the model learns to predict well with data not seen in training and the once AI’s holy grail, is now also considered a huge problem.

FLOPs and large models, enemies of the state

The biggest surprise of all has been the decision that models above certain floating point operations required for training will be heavily scrutinized, meaning that companies training them must consistently report progress and, more importantly, heavily invest in red teaming efforts.

Red teaming is an adversarial testing process where people try to induce the model into giving harmful responses like building bombs.

The number?

10 à la 26 operations, or 100 trillion trillion operations.

That may seem like an insurmountable number, but far from that.

It is rumored that GPT-4 required around 20 trillion trillion operations, so it’s fair to say we are approaching that threshold.

The rationale behind this idea is purely based on scaling laws, the belief that as models get larger they will become more and more capable and, thus, represent an increasing risk to humanity.

But, are they really?

Firstly, there is no proof of that to be the case; the only thing we know is that perplexity, the main metric to measure an LLM’s capacity to model language — predict the next word in a sequence — has shown no signs of saturation, so the natural inclination of researchers is to make models bigger.

Perplexity measures how ‘sure’ a model is regarding the next word. In mathematical terms, it’s about maximizing the probablity that the chosen word is the expected one and optimizing against that function.

And to make things worse, the requirements to improve models demand exponential increases in computing, as we can see below, where a simple error reduction from 5% to 3% on Imagenet requires two orders of magnitude more computation and the equivalent CO2 emissions of New York City in a month, according to MIT.

But the greatest fear that keeps AI doomsayers like Joe Biden up at night when thinking about making models bigger is what we describe as emergent capabilities.

One of the most fascinating mysteries of current frontier AI models, it has fueled the creation of The Frontier Forum between Google, Microsoft, OpenAI, and Anthropic, and represents one of the main reasons some of the brightest minds of our time, like Ilya Sutskever or Yoshua Bengio, have heavily shifted their research focus to AI as an existential threat to humanity.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room