Bringing Images Back to Life, by Google

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

🤯 AI Research of the week 🤯

There are times when AI feels like magic.

And this is one of those times.

The gif you’re seeing below was actually an image that was brought “back to life” by Google Research.

And that dragging movement you see? That’s actually me, as you can try the model for free by inducing the motion you desire.

In short, Google has presented a new method that predicts the movement of oscillating objects from still images, turning them into videos of that object moving.

Welcome to 2023, when AI meets magic.

The Universal Approximator

When humans try to predict movements around us, we leverage complex physics laws, mathematical equations, and so on.

But do we really need to provide those laws to AI to help it learn to predict them?

Firstly, that would be unfeasible, as when studying nature’s oscillations you have to take into account too many variables: wind, water currents, respiration, or other natural rhythms.

Luckily, as neural networks are universal function approximators, they do not require learning those laws and principles directly, they can induce them indirectly through observation, just like a human would.

In layman’s terms, you can teach a model to predict oscillations by simply looking at them.

Therefore, this model wasn’t trained with the laws of physics and mathematics, but by simply observing videos.

That’s what makes Deep Learning unique and so powerful, given the right data, it can learn anything.

The importance of frequency

Next, another insightful tweak the researchers made to train the model was to do so in the frequency domain.

Frequency measures the amount of times an event occurs in a second. Why? Well, working in the frequency domain allows to uncover “hidden data”.

As a fun fact, the Fast Fourier Transform used in this study was invented back in 1965 to detect subterrenean nuclear bomb detonations, which were impossible to separate from earthquakes in the time domain.

In other words, instead of the model predicting the movement of the object in the time domain (in second 3 the object will be here, in second 6 here) the model did so in the frequency domain (the movement of the object at 0.2Hz is this, at 0.6Hz this).

But why?

If we want to bring an image back to life with a moving object, we need to predict the position of that object at consequent time frames, generate a new image for every time step, and concatenate it, thus creating a video where the object moves.

That means that, in the time domain, for every pixel in the first image, we need to predict the position of that pixel 2T times, ‘T’ being the number of new frames we are generating (which is a lot if we want the video to be smooth) and 2 for requiring ‘x’ and ‘y’ coordinates for every pixel.

For instance, for a common 1920 x 1080 Full HD resolution (not even close to high-resolution images), that’s 2,073,600 pixels.

If we generate 100 frames and multiply by 2, we need to predict 414,720,000 new positions, or almost 415 million.

Unfeasible.

However, as this study proved 15 years ago, oscillations seen in nature are primarily composed of low-frequency components.

In other words, we just need to predict movements at low frequencies, meaning that the number of variables to predict falls dramatically because we can ignore high-frequency motion components.

Source: Google

As you can see above, the primary frequencies for the motion in terms of amplitude are those between 0 and 3 Hz, meaning that the overall motion is heavily dependent on those.

You can think of this like with a Pareto principle<, where a small subset of variables have the highest impact.

So what does the model look like?

From Image to Video

As with any AI video model, the basis is always an image.

The model, for a given image, predicts the future positions of all pixels in the image across time (although modeled in the frequency domain, remember).

Once we have the positions of all pixels in each timestep, we feed this information to an image generation model that synthesizes each frame, creating the video.

Motion-prediction module

Looks very complicated, but it’s actually a very similar process to how diffusion models for generating images work with a Variational AutoEncoder (VAE) and a denoising module (in fact, they are using Stable Diffusion here).

But instead of generating an image that portrays what the person requested in the text, here the model predicts a series of motions across several frequencies.

Put simply, instead of generating a new image, the diffusion model predicts the positions of pixels in an image in future time steps, thus predicting the oscillation of the object.

For a more technical explantion of the architecture, the process is similar to what MAEs do, explained in last week’s newsletter. Put simply, with VAEs the model learns by reconstructing the original image.

Then, these ‘S’ motion textures expressed in the frequency domain are ‘reversed’ back into the time domain (my pixel ‘x’ is in position (‘y’,’z’) in time ‘t’).

Finally, we use these motions to create ‘future’ frames based on them:

Image-Rendering module

There’s a catch here. Stable Diffusion is a latent diffusion model, meaning that all transformations occur by diffusing the latent vectors, not the original image.

In other words, the new positions of pixels are not added to the actual image, but to the latent representations of that image. The image is first encoded into its latent vector, the motions are added to this vector, and then the newly warped vector is decoded into the new, motioned image.

Some Diffusion models like DeepFloyd do work in the pixel space, but they are less common.

Taking giant leaps

Google once again shows us that the potential of AI is boundless.

Bringing an image back to life was nothing sort of ‘impossible’ a few years ago, so how exciting is it to think about how many new things are going to be made available to us in the next few years?

🫡 Key contributions 🫡

  • Google takes a natural image and turns it into a video where the most salient objects move in AI-predicted patterns

  • The model works in the frequency domain, reducing the number of variables to model

  • Proves Neural Nets’ unique value proposition to approximate any motion without actually understanding the laws of physics, but predicting them indirectly

🔮 Practical implications 🔮

  • Interactive dynamics. Create interactive videos that allow the user to move the image on command

  • Looping videos. Take natural images from the Internet and turn them into looping videos, ideal for marketing and personalized content

👾 Best news of the week 👾

😍 Introducing Copilot for Windows 11, a magical experience

🥇 Leaders 🥇

This week’s issue:

Copilots are Here: The Biggest Transformation in History for White Collar Jobs

When talking about AI today, the first thing that comes to mind is ChatGPT and the myriad of chatbots that are becoming the norm in our lives these days.

But even as impressive as chatbots like ChatGPT are, their impact on our lives becomes a mere anecdote if we compare it with what Copilots will cause.

And Copilots are here. Finally.

This week we are embarking on a journey to decipher what Copilots are, how they look on the inside, the canonical principles of how they work, and why you should be going crazy for them.

Additionally, we will proceed to envision the future of these already-futuristic solutions, as they take the form of AI companions, the raison d’etre of what AI was meant to be all along.

If you understand Copilots, you will be much better prepared to understand how the lives of millions of people are about to change, including yours, in just mere months.

ChatGPT is just the tip of the iceberg

With Generative AI (GenAI), the world was presented with the first truly functional series of foundation models, the technological achievement that underpins last year’s Cambrian explosion of AI solutions.

Supported by the universal adoption of Transformers as the ‘de facto’ architecture, humans figured out a way to train AI using billions upon trillions of data points (text in most cases).

This led to the creation of models that had ‘seen it all’ and, thus, were capable of performing multiple tasks, even those not previously trained for.

Consequently, a new breed of AI systems emerged: general-purpose models, models that, much like humans, perform brilliantly across a manifold of tasks.

This innovation took its ultimate form with ChatGPT, humanity’s first time where experts and laypersons were baffled alike.

But it was only the beginning.

The Great Job Displacement

Suddenly, AI, mainly used for analytics purposes, became a general-purpose science that could support human users in multiple text-based tasks, becoming a productivity machine.

Of course, this has scared many, including those most unexpected.

What do you think the richest man on the planet, Elon Musk, and the CEO of OpenAI, Sam Altman, have in common?

Besides co-founding OpenAI, they also share a common view regarding one of the most communist-looking proposals ever: a basic universal income to combat the huge job displacement that AI will cause.

Wait, what? Getting paid no matter what for doing basically nothing?

Exactly.

But why?

If we analyze history, they both should be wrong. As proven by the graph below and by Harvard, technology doesn’t destroy jobs, it transforms them.

Net jobs aren’t destroyed, people get newly-created jobs adjusted to the changing times.

Source: US Bureau of Economic Analysis

But many fear that this time could be different.

The reason is simple, velocity.

The velocity at which AI could displace people from their jobs could outpace the speed at which new jobs are created.

And, as predicted by McKinsey or OpenAI, the outlook isn’t good.

The former estimates that 13 million people, in the US alone, will have to switch jobs by 2030.

The latter showcases that 80% of currently existing jobs are somehow (directly, indirectly, or totally) exposed to general-purpose technologies (GPTs), aka foundation models, and portrays a rude awakening for white-collar workers, claiming they are, for the first time in history, more exposed than blue-collar workers to this technological shift.

And that statement was probably directed at you, my friend.

Source: OpenAI

They asked both humans and GPT-4 about the exposure of jobs to GPTs (general purpose technologies), and displayed their estimations by annual wage

Elon and Sam’s proposal doesn’t seem that far-fetched now, huh?

However, the current state of AI is completely unprepared to achieve these predictions.

Put simply, this huge job displacement won’t be driven by ChatGPT.

Because ChatGPT lacks two things that prevent it from going beyond a mere knowledge-gathering/assistant tool:

  • It lacks the capacity to take action,

  • and more importantly, it lacks the capacity to plan and iterate

These two issues alone have a considerable impact on the real utility of GenAI.

In fact, despite the huge hype AI is receiving this year, having the fastest-growing software product in history in terms of customers with ChatGPT, the user metrics for AI products are actually horrible.

According to Sequoia Capital, one of the leading venture capital firms in the world, the one-month retention of GenAI products isn’t great.

Actually, far from it.

Only ChatGPT manages to surpass 50%, and falling drastically behind social media products.

But things get worse if we look at the DAU/MAU ratio, the ratio of monthly active users that use the product daily.

Yes, we must take this with a pinch of salt as this only shows mobile app usage (I personally think ChatGPT is mainly used at the desktop level) but the ratio is still actually horrible for ChatGPT, only managing to have a rickety 14%.

You may realize that I am using the words GenAI, foundation model, and general-purpose technologies interchangeably. You will see this happen pretty much elsewhere in AI literature.

When people refer to ‘foundation models’, they are almost always referring to LLMs, which in turn implies Generative AI technology as LLMs are sequence-to-sequence models, meaning they are meant to ‘generate’ something by design.

We do have foundation models in other fields like Computer Vision (models based on images) such as SAM or DINO, but those are only seen at academia level and in some, very specific, enterprise cases. At least today.

Amazingly, this utility gap that ChatGPT has is easily salvaged with Copilots.

And these solutions aren’t a thing of the future, they are here, as Microsoft is launching its Office 365 Copilot as soon as next week.

We will delve into much detail now, but at a glimpse…

Copilots have the power to change the way knowledge workers have worked during the last decades.

They are going to change our daily routines. They are going to influence your decision-making.

Ultimately, they are going to change you.

So, what really is an AI Copilot then?

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room