• TheTechOasis
  • Posts
  • Moving Robots with Your Mind and the AI Data Opportunity

Moving Robots with Your Mind and the AI Data Opportunity

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: NOIR, Moving Robots with Brain Signals

  • Leaders: Envisioning a Future of AI Empowerment and Economic Opportunity

🤯 AI Research of the week 🤯

At this point, my mom thinks some AI applications have become nothing far from magic.

And today we are going to talk about one of those cases, NOIR.

Developed by Stanford University, this is a model that allows humans to move robots with their minds, something not short of extraordinary.

NOIR is a general-purpose model with the capacity to decode imagined motions through brain signals into interpretable robotic motions that execute the action imagined by the person.

Thanks to advancements like NOIR, one can envision a world where physically-impaired humans see their quality of life massively improved.

But first, how does it work?

A Tale of Two Components

To decode the brain signals, the researchers decided to use electroencephalography (EEG).

But EEG is hard to decode, especially considering complex behaviors like:

  1. Choosing an object visually

  2. Deciding how you’re going to pick it up

  3. Deciding what to do with it

  4. Perform the action

To fix this, NOIR is comprised of two components:

  • A modular goal decoder

  • A robot to perform actions

The modular decoder will break down the human goal into three parts: What object, How to interact with it, and Where to interact.

Focusing on the what, NOIR takes in SSVEP signals, a type of brain signal that matches the flickering frequency of certain stimuli.

The concept of frequency describes the number of times a certain thing happen in a second. Measured in Hertzs (Hz) if I am capable of jumping three times per second, my jumping frequency is 3 Hz.

In layman’s terms, if you watch an object flickering at a certain frequency, your brain will emit SSVEP signals at that exact frequency.

Consequently, if you have objects emitting light at different frequencies, the one with the highest correlation to the frequency of the SSVEP signal will be the object being observed by the human.

Traditionally, researchers incorporated LED lights into objects to make them flicker, but here the Stanford researchers leverage Meta’s DINO computer vision foundation model to simulate this.

When observing a certain scene, DINO segments all objects appearing in the image.

Then, the researchers simply add a virtual flickering mask (every mask at a slightly different flicker frequency to be able to discriminate) to every segmented object.

This way, we know what object the humans are looking at.

Object segmentation with SAM. Source: Meta

Next, to understand what the human wants to do with that object, the how and the where… things get a little bit trickier.

From Imagination to Motion

When we humans think about performing a movement, this thought generates brain signals too. This is what we describe as Motion Imagery (MI).

The challenge here is that we want to be able to see a certain signal and decode —interpret—what precise motion those signals are describing.

However, this is far from easy.

If you process the MI signals when a human thinks about “moving the left hand” or the right hand, being able to discern that from MI signals is not trivial.

In statistical terms, to make the signal interpretable, we want to maximize the variance of a certain motion regarding a certain signal while minimizing the variance of the other possible motions.

In layman’s terms, for NOIR to work properly, we need to be able to discriminate between signals so that each signal is assigned to its correct motion.

For this, researchers applied Common Spatial Pattern (CSP), a technique that identifies the components of a signal that express the most variance toward a certain movement.

I know that made no sense, so let’s see an example: Let’s say we analyze the brain signals of a person thinking about two movements, moving the left hand (red), and then moving the right one (blue).

You are registering these signals along two brain channels at distinct frequencies, each one represented by an axis of the graph.

Brains emit signals at different channels and frequencies. Here, we band-pass between 8-30 Hz, where MI signals tend to happen.

As you can infer from the above image, the channel depicted in the y-axis shows great variance regarding left-hand movement (red points), with values ranging from 10 to -7 or so.

However, it shows less variance regarding blue points (imagined right-hand movement) with almost all values ranging between -2 and +2.

In statistical analysis, when a feature shows variance toward a certain outcome, we deduce that that feature affects that outcome.

In this case, if a channel shows great variance toward a specific movement, we can infer that when such signals appear, the human might be thinking about that movement.

Long story short, we can infer some information from this graph, but would you be comfortable enough to say that the channel in the y-axis would always refer to left-hand imagined movements (red points)?

Probably not. Thus, you need a clear way to discriminate.

Now, watch the image below with CSP filtering applied:

Now it is a hell lot easier, as left-hand movement (red) is clearly influenced by the signals from the x-axis, and the opposite happens with the right-hand motion.

Next, using other statistical techniques I won’t get into, researchers extract the most important features from each signal that explain every movement.

If you care to know more about CSP, check this.

So, then, what’s the end-to-end process?

  1. The user views a certain environment and decides what he wants to do, like “pick a bowl and pour the water into the pan”

  2. While the human focuses on the bowl, NOIR uses DINO to segment the video frame and match the SSVEP signal to the frequency of the bowl’s mask

  3. Next, the user imagines the movement: move the right arm and use the hand to pick a bowl

  4. NOIR processes the MI signals generated from the user imagining that movement and determines what movement the user is thinking about.

  5. Then, it uses the required primitive, such as Pick(x,y) and Move(x,y) functions that signal the robot what movement it needs to perform.

Following this pipeline, NOIR is capable of performing various tasks, up to 20, becoming the first Brain-Robot Interface that is general-purpose and, who knows, the future of robotics.

But NOIR had one extra trick up its sleeve: adaptive behavior.

Leading by example

Even though NOIR’s capabilities are amazing, it takes time to make decisions, especially on the MI part.

Here, the researchers propose a method that allows the model to take previous experiences and actions to better interpret the current one.

Using image encoders, the model processes the scene into an image embedding, a vector representation of that image, inserting it into a ‘feature space’ where similar images have similar vectors representing other past experiences.

These past experiences also include the label of the object and movement previously executed.

For instance, if the image the model sees is a table with salt and a steak, the model processes the image, retrieves similar ones, and suggests that, based on previous similar situations, the outcome was picking up the salt and pouring it on the steak.

The results? Over 60% decrease in decoding time and a more adaptive behavior towards the user.

Undeniably, NOIR pulls us much closer to improving the lives of people, like quadriplegics, who desperately need to gain back control over their lives.

Read the full paper here, and video examples here

🫡 Key contributions 🫡

  • NOIR is the first general-purpose brain-robot interface that takes in brain signals and decodes them into robot movement

  • It also includes an adaptive behavior module that allows NOIR to learn from its user

🔮 Practical implications 🔮

  • 50 out of 100k humans are completely paralyzed. Solutions like NOIR could make their lives much better by allowing them to move objects around… or even their own body

  • Improved BRI decoders could help humans perform dangerous tasks without actually putting their bodies on the line

👾 Best news of the week 👾

🤨 As Sam Altman was fired, a coup by OpenAI could have him back

🫥 This is how far we are from AGI, according to Google

🥇 Leaders 🥇

Envisioning a Future of AI Empowerment and Economic Opportunity

Imagine a world where your digital footprint transcends its traditional spectator role, morphing into a valuable asset in a flourishing data economy.

At the forefront of this evolution are Foundation Models (FMs), AI systems that are becoming as ubiquitous as they are influential.

But what does this mean for you, the individual at the heart of this digital transformation?

Today, we are exploring how AI will create exciting new opportunities and economies centered on its most precious need, data, and how this imminent AI revolution could redefine your role in the digital world.

It’s All About Data

If we reflect on the state of today’s world, a lot has changed in just a year.

The birth of ChatGPT finally kick-started the age of AI, and nothing has stopped since.

But what would ChatGPT have been without data? Yeah, nothing.

All Models Learn the Same Way

Out of 10 models that are released today, nine are deep learning models. With neural networks, humans relinquish control over how the algorithm learns, as we simply do three things:

  • Define the architecture, although it’s fairly standardized these days with the Transformer

  • Define the task the model needs to learn to perform

  • Define the objective loss function so that the model learns how to perform that task

In fact, humans at this point are no longer even curating the data and telling the model what is wrong or right, as Transformer-based architectures like ChatGPT learn in a self-supervised manner.

In self-supervized learning, the right or wrong answer is automatically provided by the data.

Taking ChatGPT as an example, the task is to predict the next word in a sequence, and the objective loss function is cross-entropy loss.

In cross-entropy, essentially, if the model assigns a high probability to the actual next word, the loss is low. Conversely, if the model assigns a low probability to the actual next word, the loss is high.

As we can mathematically measure this loss, we simply tune the parameters—called weights—of the model so that it gets progressively better at predicting.

As you may realize, it’s a trial-and-error learning process, and this is the principle governing the learning of all frontier models today.

But for all of this to work, the essential component is data.

There’s no building without a ground to stand on

You may wonder, is this trial and error procedure efficient?

Far from it. Neural networks, despite how good models like ChatGPT have become, are terrible learners.

To create ChatGPT, clusters of thousands of GPUs were kept running for months, showing a model with billions of parameters to confidently predict the next word in a sequence.

As Yann LeCun, Chief Scientist @ Meta, always explains, “While a 17-year-old will take around 20 hours to learn to drive, autonomous driving vehicles, also governed by neural networks, take months to learn to do the same stuff.”

This means that the amount of data required to build better models is huge… and growing.

The Great Data Problem

In a famous paper from last year, a group of researchers claimed we would run out of data to train by 2026.

But seeing current developments, where some models are already managing datasets in the trillions of words, this problem might be arriving sooner than we thought.

As a matter of fact, and hinted by OpenAI, AI research labs are desperately seeking new ways to create data.

And this issue is exacerbated by the quality of the data.

Now, a great chunk of newly created data is AI-generated, meaning that to train new models a great part of their datasets would be generated by older versions of these models, which of course generates huge concerns in terms of data quality.

In fact, catastrophic forgetting, where new updates to a model cause it to forget critical data previously learned, is becoming a problem already, and training AI models with AI-generated content from the Internet, which tends to be not very trustworthy but very available, is one of the main causes.

The problem is so serious that some researchers claim that data generated from 2024 onward will be of ‘no use’ due to its very poor quality.

Therefore, data will become to technology development what white truffles are to cuisine or diamonds to jewelry; a very, very scarce but valuable resource.

But why is this so important?

Foundation Models, the New Standard

If we think we’ve seen it all with FMs, I’m here to tell you we are just getting started.

In a way, we humans are just seeing the tip of the iceberg, the dawn of general-purpose AI, with the end game being described as Artificial General Intelligence, or AGI.

An unstoppable force

Up until the birth of foundation models, what we called AI was in reality Narrow AI, models that excelled at one or very few tasks in a convincingly-superior manner compared to humans.

But just like humans can perform a derivative calculation and make their beds, AI could do one thing, not both, when the tasks were very different.

With FMs, we are already seeing models that can perform hundreds, if not thousands, of different tasks.

In other words, AI is now just starting to become a general purpose technology.

And why is this so important?

Well, because soon most of our interactions with machines will be declarative, or “I simply tell you what I want but I don’t have to tell you how to do it.”

And many more.

Long story short, we are moving to a declarative world where humans will get what they want from computers by simple communication.

The appeal of this vision is too tempting to not become a reality, and for this, we will need humongous amounts of data we currently don’t have.

But the importance of data will not only be viewed from the lens of quality to build better language or vision models, as governments and unions are already hinting about the need to make it privacy-preserving.

In fact, the precedence has already been set.

I Decide the Fate of my Data

One of the most important lessons of the recent Hollywood actors and actresses’ strike was how important will be to ensure that personal data is used with the consent of the creator/owner.

Among the many agreements, we can highlight:

  • Studios must compensate an actor if performances are used to train a model. 

  • Studios must secure an actor’s consent before using a synthetic likeness or performance, regardless of whether the replica was made by scanning the actor or extracting information from existing footage.

  • The actor has the right to refuse. If the actor consents, studios must compensate the actor for the days they would have worked, if they had performed in person. 

For the first time, people are going to get paid for the use of their faces and performances when training AI models.

Put simply, all pieces in the puzzle are finally set for you to start getting paid for your data.

And all this will be thanks to new AI jobs that will soon present a huge opportunity for all of us, and the technological enabler will be the fascinating concept of zero-knowledge proofs, one of the most elegant yet uber-powerful technologies ever created.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room