• TheTechOasis
  • Posts
  • NVIDIA shocks the world with insane AI-generated videos

NVIDIA shocks the world with insane AI-generated videos

The fathers of Stable Diffusion bring to you VideoLDM, the new paradigm for AI

🏝 TheTechOasis 🏝

🤖 This Week’s AI Insight 🤖

NVIDIA’s VideoLDM is the most impressive AI in 2023

If an image is worth 1000 words, what’s a video worth?

I’ll let you decide by checking this fully AI-generated video below:

Yes, this is 100% AI.

NVIDIA and the creators of Stable Diffusion have presented VideoLDM, a new state-of-the-art video synthesis AI model that proves, once again, that the world will never be the same.

But how does this amazing new AI marvel work and what other amazing stuff can it do?

Hard problems require smart answers

When synthesizing (generating) video from AI, you run into two huge issues:

  • Cost: Generating videos in standard ways is computationally intensive.

  • Lack of training data: AI video needs video data. And while text and images are accessible, video data no.

But how on Earth did they pull it off?

Work smart, not hard

Videos are simply frames (images) sequenced in a specific timeframe.

  • The more frames, the longer the video

  • The bigger the pixel size, the better quality the video has

  • And the more frames you can fit per second, the smoother the video.

In other words, it’s all images.

So why not use our powerful image generators like Stable Diffusion (SD) as the basis of our video synthesis model?

This way, we can use our widely accessible image data sets to train our model, instead of relying on costly and inaccessible video data.

Sounds awesome, but there’s an obvious issue.

If you ask SD to generate a panda two times in a row, both pandas will be observably different.

It doesn’t give a damn, or remember, how the previous panda looked, it’s just trained to give you what you ask at the point in time.

Zero consistency.

Models like ControlNet allow you to apply consistency, but you’re conditioning (forcing) the model to output data in a certain way; here, we want fully-creative video synthesis with no human interference.

So what can we do to solve this?

Simple, give it temporal awareness.

And so, the NVIDIA researchers propose here an innovative architecture that interleaves temporal layers with the SD spatial ones.

But hold on, I know that sounds way too overcomplicated.

However, it’s quite simple to understand with the following intuition:

What learns what and where

A neural network is simply a set of parameters that learn “something”, usually defined as a training objective.

For instance:

  • ChatGPT’s parameters learn to predict the next word in a sequence.

  • SD’s parameters learn to generate an image from a text prompt.

But these models, if we look inside, are packed into blocks of neural networks, which are then packed into layers formed by neurons, which in turn are defined by certain parameters (the elements that eventually learn that “something”).

This means that, while some blocks and layers learn to do some stuff, others will learn to do other stuff.

For instance, in our case:

  • The spatial layers are the Stable Diffusion original layers, already capable of impressive image-to-text generation.

  • And the key innovation by NVIDIA, the temporal layers trained with video data, learn to apply consistency across several images

And this high-quality - thanks to the spatial layers - and consistent - thanks to the temporal layers - sets of images are what generate the videos.

For better intuition, here’s a visual representation of how these layers (formed by neurons) are interleaved in VideoLDM:

So, now that we understand the key innovation behind VideoLDM, how does it really work?

LDM + temporality = VideoLDM

VideoLDM has two objectives:

  • Generate high-quality and high-resolution driving scenarios (the video at the beginning)

  • Predict videos from text, creating the first-ever high-quality and high-resolution text-to-video generator

To do so, it comprises of a stack of three different “models”:

  • Stable Diffusion: The original text-to-image generator that generates the key frames in the video

  • A video interpolation model: A model that, between two key frames, generates a sequence of new, consistent images to increase the frame rate (frames per second)

  • DM upsampler: To generate high-resolution video, the model includes a final step were the output is upsampled (increased pixel size) to obtain higher resolution

But to make things even clearer for you, let’s go through an example of how this works when generating a video, using the below image as guidance:

  1. The SD-based ImageLDM generates a set of key images/frames. These images are consistent as these ImageLDM is temporally-aware thanks to the temporal layers.

  2. These key frames are then used to generate new, interpolated images between them in two different interpolation procedures. This is done to increase the number of images in the sequence

  3. The key and interpolated frames are then decoded into images (generated) from their latent representations

  4. Finally, to generate high-resolution videos, the sequence of frames is upsampled to a higher resolution of 1280 × 2048 pixels (the more pixels you have the better the quality of the video)

Et voilà:

“A sks frog writing a scientific research paper.“ Source: NVIDIA

Breaking new boundaries at hyper speed

This new innovation by NVIDIA represents a new (yet another) paradigm shift for AI, a technology breaking so many milestones every day it’s hard to follow up with.

And just like that, AI-generated Hollywood blockbusters are all of a sudden not that far off.

And what’s next?

Well, I guess we’ll have to wait for next week to see what other amazing stuff AI brings us.

For more examples of how these amazing tech works, click here

👾Top AI news for the week👾

🤲🏼 Google Brain and DeepMind become one, the new behemoth in the AI industry

🤩 MegaMedical takes medical image segmentation to the next level

😟 Bard is generating a lot of internal debate on Google

🚃 Adobe jumps on the GenAI train to generate effects and edit footage with text

🧠 Creators of Stable Diffusion launch family of LLM models, StableLM

🧐 Elon announces AI project, ‘maximum truth-seeking AI’

🤔 How intelligent is ChatGPT? Fascinating debate between Lex Fridman and Manolis Kellis