- TheTechOasis
- Posts
- NVIDIA shocks the world with insane AI-generated videos
NVIDIA shocks the world with insane AI-generated videos
The fathers of Stable Diffusion bring to you VideoLDM, the new paradigm for AI
š TheTechOasis š
š¤ This Weekās AI Insight š¤
NVIDIAās VideoLDM is the most impressive AI in 2023
If an image is worth 1000 words, whatās a video worth?
Iāll let you decide by checking this fully AI-generated video below:
Yes, this is 100% AI.
NVIDIA and the creators of Stable Diffusion have presented VideoLDM, a new state-of-the-art video synthesis AI model that proves, once again, that the world will never be the same.
But how does this amazing new AI marvel work and what other amazing stuff can it do?
Hard problems require smart answers
When synthesizing (generating) video from AI, you run into two huge issues:
Cost: Generating videos in standard ways is computationally intensive.
Lack of training data: AI video needs video data. And while text and images are accessible, video data no.
But how on Earth did they pull it off?
Work smart, not hard
Videos are simply frames (images) sequenced in a specific timeframe.
The more frames, the longer the video
The bigger the pixel size, the better quality the video has
And the more frames you can fit per second, the smoother the video.
In other words, itās all images.
So why not use our powerful image generators like Stable Diffusion (SD) as the basis of our video synthesis model?
This way, we can use our widely accessible image data sets to train our model, instead of relying on costly and inaccessible video data.
Sounds awesome, but thereās an obvious issue.
If you ask SD to generate a panda two times in a row, both pandas will be observably different.
It doesnāt give a damn, or remember, how the previous panda looked, itās just trained to give you what you ask at the point in time.
Zero consistency.
Models like ControlNet allow you to apply consistency, but youāre conditioning (forcing) the model to output data in a certain way; here, we want fully-creative video synthesis with no human interference.
So what can we do to solve this?
Simple, give it temporal awareness.
And so, the NVIDIA researchers propose here an innovative architecture that interleaves temporal layers with the SD spatial ones.
But hold on, I know that sounds way too overcomplicated.
However, itās quite simple to understand with the following intuition:
What learns what and where
A neural network is simply a set of parameters that learn āsomethingā, usually defined as a training objective.
For instance:
ChatGPTās parameters learn to predict the next word in a sequence.
SDās parameters learn to generate an image from a text prompt.
But these models, if we look inside, are packed into blocks of neural networks, which are then packed into layers formed by neurons, which in turn are defined by certain parameters (the elements that eventually learn that āsomethingā).
This means that, while some blocks and layers learn to do some stuff, others will learn to do other stuff.
For instance, in our case:
The spatial layers are the Stable Diffusion original layers, already capable of impressive image-to-text generation.
And the key innovation by NVIDIA, the temporal layers trained with video data, learn to apply consistency across several images
And this high-quality - thanks to the spatial layers - and consistent - thanks to the temporal layers - sets of images are what generate the videos.
For better intuition, hereās a visual representation of how these layers (formed by neurons) are interleaved in VideoLDM:
So, now that we understand the key innovation behind VideoLDM, how does it really work?
LDM + temporality = VideoLDM
VideoLDM has two objectives:
Generate high-quality and high-resolution driving scenarios (the video at the beginning)
Predict videos from text, creating the first-ever high-quality and high-resolution text-to-video generator
To do so, it comprises of a stack of three different āmodelsā:
Stable Diffusion: The original text-to-image generator that generates the key frames in the video
A video interpolation model: A model that, between two key frames, generates a sequence of new, consistent images to increase the frame rate (frames per second)
DM upsampler: To generate high-resolution video, the model includes a final step were the output is upsampled (increased pixel size) to obtain higher resolution
But to make things even clearer for you, letās go through an example of how this works when generating a video, using the below image as guidance:
The SD-based ImageLDM generates a set of key images/frames. These images are consistent as these ImageLDM is temporally-aware thanks to the temporal layers.
These key frames are then used to generate new, interpolated images between them in two different interpolation procedures. This is done to increase the number of images in the sequence
The key and interpolated frames are then decoded into images (generated) from their latent representations
Finally, to generate high-resolution videos, the sequence of frames is upsampled to a higher resolution of 1280 Ć 2048 pixels (the more pixels you have the better the quality of the video)
Et voilĆ :
āA sks frog writing a scientific research paper.ā Source: NVIDIA
Breaking new boundaries at hyper speed
This new innovation by NVIDIA represents a new (yet another) paradigm shift for AI, a technology breaking so many milestones every day itās hard to follow up with.
And just like that, AI-generated Hollywood blockbusters are all of a sudden not that far off.
And whatās next?
Well, I guess weāll have to wait for next week to see what other amazing stuff AI brings us.
For more examples of how these amazing tech works, click here
š¾Top AI news for the weekš¾
š¤²š¼ Google Brain and DeepMind become one, the new behemoth in the AI industry
š¤© MegaMedical takes medical image segmentation to the next level
š Bard is generating a lot of internal debate on Google
š Adobe jumps on the GenAI train to generate effects and edit footage with text
š§ Creators of Stable Diffusion launch family of LLM models, StableLM
š§ Elon announces AI project, āmaximum truth-seeking AIā
š¤ How intelligent is ChatGPT? Fascinating debate between Lex Fridman and Manolis Kellis