TheTechOasis
Posts
AudioSep, separating audio on command

AudioSep, separating audio on command

Ignacio de Gregorio Noblejas
August 13, 2023

🏝 TheTechOasis 🏝

When the nuclear bombs were dropped in Japan, everybody knew the world had changed forever.

And after the Soviet Union rejected the idea of stopping the creation of nuclear bombs, the atomic race officially started.

Eventually, after the disaster at Bikini Atoll, all leading participants signed a document stating that they would cease nuclear tests in all forms except one:

Subterranean nuclear detonations.

Why?

Simple, at that time it was impossible to differentiate a subterranean nuclear detonation from an Earthquake with a seismograph, which meant that all countries continued nuclear testing anyway.

To solve this, one of the most important algorithms in history, if not the most, was created, the Fast Fourier Transform (FFT).

This algorithm broke down the vibrations gathered by the seismograph into its frequencies, allowing it to differentiate earthquakes from nuclear bombs, a discovery that, had it been done a few years earlier, would have changed the course of time forever.

If you’re curious, this video by Veritasium explains it beautifully.

But why am I telling you all this?

Simple, because Fourier transforms are an essential component of AudioSep, the first foundation model for language-queried audio separation, or the model with the power to separate almost any sound by simply asking it.

It’s all frequencies

Have you ever wondered why the same note played by a piano and by a trumpet sound different?

Even though you can clearly hear both playing the same note, pianos and trumpets sound completely different to the ear.

And the reason for this is what we call frequencies.

When we represent an audio wave, most sounds we hear in the world, such as having a conversation with your friend in a busy street, are complex, meaning that the sound of your friend’s voice is bundled with the sound of a car honking behind or a random baby crying.

Therefore, if we represent that audio clip as a waveform, it could look something like this:

However, this purely random wave can be, in fact, broken down into canonical sinus and cosine waves at different amplitudes and phases, that when summed up give you this random wave.

These waves have different frequencies, and to obtain them you have to apply the Fourier transform. By doing this transformation, we gain huge insight into the audio, and we can even reject unwanted frequencies if needed, critical to elements like music production.

This may seem super complicated, but it’s a very similar concept to how the last generation of AirPods cancel noise by creating antinoise, eliminating the undesirable frequencies.

Consequently, for any noisy mixture of audio like the one we were discussing before, we can break the mixture into different audios:

The voice of your friend
The honking car
The crying baby
etc.

But why is this so interesting?

A plethora of use cases

Noise separation has incredibly useful applications, such as:

Hearing Aids and Assistive Devices: Audio separation allows users to focus on a single speaker amidst background noise.
Forensic Audio Analysis: Extracting relevant dialogue or sounds from noisy or overlapped audio recordings
Media Production
Music Remixing
Voice Assistants and Smart Home Devices
Call Centers and Telecommunication
Video Conferencing

And many more.

Now, you have a tool that will do this on command.

How does AudioSep work?

AudioSep architecture is depicted below and is comprised of two components:

A text encoder
A Separation model

The former takes the natural language query (e.g. “Separate the voice of people speaking from the barking dog”) and encodes it using CLIP’s text encoder.

The encoder transforms the text into a vector that includes the meaning of the underlying text.

However, the fact they chose OpenAI’s CLIP encoder was far from being arbitrary, as CLIP is a model that aligns text and image vector embeddings in the same space.

In other words, it allows you to compare text and images that represent the same thing. For instance, encoding the text “a husky in Siberia” will give you a vector that will be very similar to the vector obtained from encoding an image of a husky in Siberia.

In layman’s terms, to separate a dog barking from people speaking, they simply had to use a YouTube clip of people speaking and a dog barking, using the actual images from the video as guidance thanks to CLIP’s image-text joint embedding space.

As for the latter, the separation model took as input, apart from the vector embedding of the text instruction, the spectrogram obtained from applying the Short-time Fourier Transform (STFT) to the mixed audio we want to separate, outputting two things:

A magnitude mask
A phase residual

What these two things did was basically tell the model which frequencies had to be emphasized and which ones not (magnitude mask) and help to better align and separate overlapping audio sources (phase residual).

In layman’s terms, the separation model tells AudioSep what frequencies should it keep and their amplitude and phase.

Next, in an inverse process to the one we discussed at the beginning, if we take those frequencies and sum them up using the Invert STFT, you get back the newly separated audio you asked for.

Sounds too good to be true, but the audios in this link will prove to you how real AudioSep is.

The Great Separator

In short, AudioSep is a tool that allows you to do stuff like:

Musical instrument separation
Audio Event separation
Speech enhancement

And all this on command.

Most probably, in less than a year most hearing-aid devices will include some sort of technology similar to AudioSep, to the point that hearing-impaired people will tell their hearing aids what sounds they should be focusing on.

Now, how exciting is that?!

Key AI concepts to retain from today’s newsletter:

- The critical importance of audio separation in today’s world

- We now have an AI model that separates audio in open-domain