• TheTechOasis
  • Posts
  • Genie, converting images into videogames & the State of Hardware

Genie, converting images into videogames & the State of Hardware

Sponsored by

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Genie, Turning Images into Videogames

  • Leaders: Understanding NVIDIA’s Boom & The State of Hardware

🤩 AI Research of the week 🤩

Amid its current controversies with its misaligned LLM, Gemini, Google continues to innovate in other areas of AI.

Now, it has created an entirely new modality by announcing Genie, the first-ever model to create action-controllable environments.

Put simply, this could be the precursor to bespoke AI-generated videogames from one single provides image.

But it can be much more, as it has been framed as a potentially viable path to generalist agents, the milestone all AI labs in the world seem to be working toward.

So, what is Genie?

Genie in an AI Bottle

Genie is a spatiotemporal transformer that takes in a single image or sketch and action and generates a new frame where that action is executed.

As a video is nothing but a stacked set of images (known as frames) shown to you at a given frequency (number of frames shown by second), the key element here is that Genie allows you to ‘play the video’ by giving you the power, through actions, to decide what the next frame will be.

In other words, the world is created dependent on the actions you take.

For example, if we look at the image below, if you provide an image and the ‘jump’ action, the model generates a new frame where the now playable character is jumping while it adapts the entire background to that action.

Consequently, as every demanded movement is met with a new frame, you are essentially building a videogame in real-time.

But with examples like Genie, I think it’s best if you click on this link, see it for yourself, and come back to understand how it works.

First and foremost, the first key element is the concept of spatiotemporal transformer (ST), but what on Earth is that?

An ST transformer is a model that performs the attention mechanism, the technique that underpins products like ChatGPT, but applied on two dimensions: space and time.

Attention allows differents parts of a sequence to share information with each other. In the case of ChatGPT, it helps the different words in a sequence to ‘talk’ to each other to figure out the meaning of the overall sentence.

Here, we apply it to video.

To comprehend how this works, let’s say we have a video of a panda eating bamboo.

  • Space. By breaking each image into different patches, attention allows these patches to share information and figure out their role in the image. In the case of the panda, spatial attention will tell Genie what the panda is doing in that specific frame, like holding a bamboo stick with the mouth open.

  • Time. By performing attention over several frames, the time variable of the attention mechanism will tell Genie that the panda is, indeed, eating, as the concatenation of frames depicts that action.

This concept of spatiotemporal attention is then used to assemble the components that jointly create Genie.

A Tale of Three Models

Behind doors, Genie is made up of three components:

  • A video tokenizer

  • A latent action model

  • A dynamics model

Before diving into each component, let’s review the overall process. For a given image or sequence of images:

  1. A video tokenizer breaks down each frame into its latent representation (we will get there in a minute).

  2. Next, the latent action model observes each frame and predicts the action that was performed in each frame to reach the next one.

  3. Finally, a dynamics model takes in both inputs, the tokenized video and the set of per-frame actions, to predict the next frame.

But how does each component work?

Tokenizing video

The video tokenizer takes the video in and turns it into a set of vectors, one per each frame. These vectors capture the semantics of the frame (they describe what the frame represents) in a simpler and compressed way.

You may wonder why we need to do such a thing. The reasons are twofold:

  • Efficiency. If you work with a smaller, compressed version of the video, the costs of doing so are much smaller

  • Focusing on what matters. There’s an enduring belief among AI experts that by compressing data, only the key concepts of it are retained. That way, the model will focus on what matters, instead of analyzing every small detail, relevant or not, in the original video.

Think of embeddings as the equivalent of taking a PDF file and turning it into a zip file. This zip file is much smaller, but still includes all the necessary information.

Embeddings do the same thing for the input data. However, they incur certain loss of information as some data is indeed lost during this compression.

Moving on to the latent action model, things start to get interesting.

A World Observer and a Dynamics Generator

This component will take the history of frames up until a certain point and the actions that take one frame to the next, and predict what the next frame will be.

That way, by conditioning the next frame on the previous actions, you are indirectly creating a world model, a model that based on previous observations (frames) and actions, predicts what will happen next.

This takes us to the third component, the dynamics model.

This component, an autoregressive transformer (similar to ChatGPT but predicting images instead of text) will use the information provided by the past (the video tokenizer and the action model) and use it to predict the future.

For instance, if the previous images describe a dungeon and a player, and the player moves to the right, both inputs signal the dynamics model that the next frame should be the dungeon slightly moved to the right, creating the world as you explore it.

But Genie could mean much more than that… especially to the field of robotic agents.

A Precursor to Robot Agents?

Probably the biggest trait humans have that far exceeds today’s robot capabilities is common sense, most often referred to as world models.

Humans have built a representation of the world in our brains based on past experiences and sensory inputs that tell us, for example, to walk on the side of the road to avoid getting run over, to say hi back when people salute you, or to know that jumping off a cliff is probably not a great idea.

But common sense goes way beyond this high-level reasoning actions, we also understand the physics of the world, things like gravity, speed and acceleration, or movement.

These concepts are extremely complex for AIs to model, making world models one of the hardest problems to solve.

But with Genie, we might have a precursor to such a model, as Genie can observe its world and figure out what comes next, a necessary feature for robots humans intend to inhabit the world, with notable examples like 1X or Figure AI.

In fact, Google is so optimistic about this possibility that they trained a Genie for robots using videos of its RT-1 robot, to the point that Genie became capable of predicting what the robot was going to do next:

And the most fascinating thing?

That Genie is trained in a fully unsupervised manner, meaning that it learns to derive actions by simply looking at videos, videos that have absolutely no reference to the actions they are showing.

Despite this, Genie still manages to learn them by itself, a true statement of what a future evolution of Genie, which is fairly limited, might be able to do.

Also, it opens AI to the possibility of using Internet-scale video data, like YouTube videos, to learn from our world and unlock the emergent capabilities LLMs developed but to a new dimension, as videos are much more informant of the world than what words could ever be.

A leap in embodied intelligence, the capacity of AI to learn to move and inhabit our world, is approaching fast.

Very fast.

AI brews beer and your big ideas

What’s your biggest business challenge? Don’t worry about wording it perfectly or describing it just right. Brain dump your description into AE Studio’s new tool and AI will help you solve that work puzzle.

Describe your challenge in three quick questions. Then AI churns out solutions customized to you.

AE Studio exists to solve business problems. They build great products and create custom software, AI and BCI solutions. And they once brewed beer by training AI to instruct a brewmeister and then to market the result. The beer sold out – true story.

Beyond beer, AE Studio’s data scientists, designers and developers have done even more impressive things working 1:1 with founders and executives. They’re a great match for leaders wanting to incorporate AI and just generally deliver outstanding products built with the latest tools and tech.

If you’re done guessing how to solve work problems or have a crazy idea in your back pocket to test out, ask AI Ideas by AE Studio for free solutions, right now.

👾 Best news of the week 👾

🥹 Wikipedia downgrades CNET for its AI-generated content

🧐 Microsoft invest in Mistral, the new fairy queen of AI

😍 Copilot for Finance, an AI Copilot meant for Finance professionals

🥇 Leaders 🥇

Understanding NVIDIA’s Boom & The State of Hardware

The markets are clear: in the age of AI, few things are as valuable as AI hardware.

Just take a look at NVIDIA, the main global supplier of AI chips:

  • It’s now the third most valuable company in the world with arguably the same value as the entire German stock market, $2 trillion

  • Just in 2024 alone, it has amassed the entire value of Tesla’s stock into its valuation

  • And it had the single largest one-day valuation accrual, $277 billion, in the history of capitalism.

But the question is, why?

Today, we are deep-diving into the crown jewel of AI investing, its hardware, to put sense into how it works and the significance of its role in AI.

However, it’s not all sunshine and rainbows for the incumbents, as several possible disruptors, like the man himself Sam Altman, or the latest superstar start-up Groq, could paint a very different picture of the future of AI hardware and the industry as a whole.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room