Lumiere's Text-to-Video & Rabbit vs Apple

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: Googleā€™s Lumiere Model Looks Amazing

  • Leaders: Will The Rabbit Eat The Apple?

šŸ¤© AI Research of the week šŸ¤©

Scott Galloway, the famous business influencer, bet that 2024 would be Googleā€™s AI year.

And itā€™s starting to look like it.

Now, they have launched Lumiere, a huge breakthrough in text-to-video.

Their approach to AI video synthesis is not only revolutionary, but it also showcases incredible video quality and a wide range of amazing downstream skills like video in/outpainting, image animation, and video styling, making it the new reference in the field.

But how does it work? Letā€™s get into the weeds.

The Everlasting Problem

Out of all data modalities, video is, without a doubt, the hardest.

Considering videos are simply a concatenation of images, called frames, displayed at a certain frame rate per second (the higher the fps the smoother the video) the logical path to building text-to-video (T2V) is to depart from a text-to-image model (T2I), like DALL-e or Stable Diffusion.

However, T2V adds an extra layer of complexity: Time.

In other words, itā€™s not enough to generate multiple frames (you can generate as many as you need with the T2I model) but they must be consistent over time.

And this has proven to be a huge problem, as this forces AI videos to be very short, and they nevertheless tend to showcase artifacts, visual imperfections like the orange blob that suddenly appears in this AI-generated video below.

Source: NVIDIA

And the reasons for these inconsistencies reside in how these models are built, and that Lumiere revolutionizes.

Source: Google

Originally, the video synthesis process involves three steps:

  1. A T2I prior generates a set of key frames set across the complete duration of the video.

  2. Next, several TSR (Temporal Super Resolution) models ā€œfill inā€ the gap between the keyframes with a set of new frames.

If the two key frames are a serious person and that same person smiling, the TSR models generate the complete set of frames that generate the smiling gesture.

  1. Then, a set of SSRs (Spatial Super Resolution) will take the low-resolution frames and upscale them to enhance the quality of the video, as most T2V models work in low-resolution pixel space for a more efficient process.

  2. Finally, the SSR outputs are ā€œstitchedā€ together and that gives you the video.

Bottom line, AI videos are simply taking an image generator and training it to generate consistent images through time in batches and piece them together.

And these worksā€¦ kind of.

Just like if you are filming an actor and midway through the sketch he breaks character and you try to finish the rest of the sketch by forcing him into that specific position to continue and avoid losing the first part, chances are that no matter how well you edit it, the cut is going to be visible.

Considering this limitation, video generation didnā€™t seem quite ready yet. With Google Lumiere however, we might be seeing the start of something big.

Space, Time, and MultiDiffusion

As their image counterparts, T2V are mostly diffusion models.

A diffusion model is a type of AI system that learns to map a noisy distribution of data into the target distribution through a denoising process.

In laymanā€™s terms, they take a noisy image and a text condition (what you want the end result to be) and they gradually take out the noise from the image until you get the result.

ā€œPortrait of a catā€

Think of the diffusion process as taking a marble block and, just like Michelangelo would, carving out the excess marble to ā€˜unearthā€˜ the statue.

However, instead of following the standard procedure we described earlier, Google has found an alternative using an STUnet.

But what is that?

A UNet is an architecture that downsamples, processes, and upsamples images.

Itā€™s standard in image and video generation because it allows you to do the hard processing work in a compressed version of the images and thus save money.

With every step, the input noisy images are downsampled (made smaller) using convolutions that identify spatial patterns in the image.

This way, not only we are making the images more tractable, but we are also capturing the complex structures in them.

Once we have compressed the images and understood their meaning, they are upscaled into pixel space once again, recovering the original size but with the new images.

However, unlike other cases, Lumiereā€™s process is time-aware, meaning that we not only compress the imagesā€¦ we also compress time.

This feels very abstract, but what STUnet is doing is not only understanding what each frame represents, but how different frames relate to each other.

In other words, itā€™s not only about capturing that the frames portray a panda, but what movements the panda should be doing over time.

This allows Lumiere to create all the frames in the video at once (instead of the usual key frame + cascading frame filling we discussed earlier), and thus the STUnet simply needs to focus on capturing the semantics of the frames and upscaling them into the actual video.

However, you still need many SSR models to upscale the image due to memory constraints, meaning that thereā€™s still some ā€˜stitchingā€™ going on at the end.

Thus, to avoid inconsistencies across the upscaled outputs of each SSR, they apply MultiDiffusion (Bar-Tal et al, 2023).

What this does is ensure consistency across the different generated frame batches by using a MultiDiffuser.

In very simple terms, the MultiDiffuser allows several image generation processes to take place at once over a frame.

For instance, you can create a ā€œblurred imageā€ and at the same time apply new parallel generations to patches of the image, like drawing ā€œa mouseā€ in a specific part of the image, and ā€œa pile of booksā€.

The key intuition is that the MultiDiffuser ensures that, no matter what you generate in those fragments of the image through separate diffusion processes, they are consistent with the overall piece.

Consequently, this component ensures that for several frame batches of the video that need to be pieced together, you can recreate the boundaries between the output of the different SSR models so that they are consistent, ensuring a smooth transition between segments.

And the results are, put mildly, very impressive (video and paper).

A New Age for Video

With Lumiere, the world gets a clear view of what the future holds for video generation, editing, and animation, among others.

Soon, anyone will have the power to create impressive videos from scratch in no time and generate a new world of possibilities.

And despite the impressive results, it feels like we are just seeing the tip of the iceberg.

2024 is going to be wild.

šŸ«” Key contributions šŸ«”

  • Google Lumiere is a new type of text-to-video model that achieves state-of-the-art quality and longer videos by generating the entire video at once and upscaling its quality.

  • It also applies MultiDiffusion to ensure that the different segments of the upscaled video transition well.

šŸ‘¾ Best news of the week šŸ‘¾

šŸŽ„ Google shows off MobileDiffusion

šŸ„‡ Leaders šŸ„‡

The Fight to Be the New Dominant Computing Platform

Probably the biggest news this year in technology, Apple has released the Vision Pro, the revolutionary AR/VR headset.

But as Jason Calacanis described it, today the Apple Vision Pro can be summarized as a ā€œbye, oh my, goodbyeā€.

An incredible piece of hardware that seems to be looking for a problem to solve.

The issue?

They should be getting their act together pretty soon, because their vision of the future of computing is being challenged by a product and company that intend to kill the entire app ecosystem Apple is trying to redefine, meaning that Appleā€™s success could be very much ā€˜short-livedā€™.

With one of the most impressive presentations in technology history, this piece of hardware could change the world and make the Vision Pro Appleā€™s ā€˜Google glassesā€™ moment and hand it its very grand defeat.

Will the Rabbit eat the Apple? Letā€™s see.

The Dawn of Spatial Computing

Unless youā€™ve been living under a rock, you probably already know what the Apple Vision Pro is, so Iā€™m not going to bore you with an extensive review.

But letā€™s cover the essentials.

What is this thing?

In short, itā€™s Appleā€™s bet on redefining what computers are.

In more explicit terms, itā€™s basically a computer headset where controls and interactions occur from what is undoubtedly the most advanced piece of consumer-end hardware in the world.

In other words, instead of using MacBooks or iPads, Apple is trying to make computer interaction a more immersive and all-around futuristic experience.

It includes some of the most advanced tech in the world, like unparalleled video passthrough (you can see an almost photo-realistic representationā€”itā€™s not the actual background but a real-time recreationā€”of your surroundings that is exposed to you in real-time).

The background is identical to the real one, but itā€™s not real.

Consequently, even though Apple is heavily marketing this product as an Augmented Reality headsetā€¦ itā€™s not, as everything you see is virtual anyways.

The reason is none other than to try to differentiate themselves from Metaā€™s Oculus Quest, or, to be more precise, to avoid being included in the Metaverse category.

If you ask Apple, this is not the Metaverse, but the creation of a new type of computing, Spatial Computing.

A new way of doing stuff

So, what is spatial computing?

It is a new type of computer where the different screens and interactions are done through a headset and the use of your hands.

Particularly:

  • It introduces an ultra-high-resolution display system using micro-OLED technology to pack 23 million pixels into two displays, offering more than a 4K resolution for each eye

  • A high-performance eye-tracking system that utilizes high-speed cameras and a ring of LEDs to project invisible light patterns onto the userā€™s eyes

  • Itā€™s powered by Apple silicon in a unique dual-chip design. The M2 chip provides powerful standalone performance, while the new R1 chip processes input from 12 cameras, five sensors, and six microphones to ensure content feels like it's appearing right in front of the userā€™s eyes, in real-time.

And many more features that turn your daily computer activities into a more visual and immersive experience.

Howerver, we need to point out an important thing. Today, they are positioning themselves not as another gaming headset, but a product to get things done.

But, for $3,500, what makes this product a must-buy?

Put bluntly, it isnā€™t unlocking anything we canā€™t do with our standard computers today, it is simply offering a new experience.

It could offer a wild experience in some cases like this F1 concept (not actually real) but it looks like an entertainment-focused product rather than the productive one they intended.

In other words, instead of finding a solution to a current problem, they are simply generating the need for something nobody asked for.

Classic Apple.

However, this next piece of hardware is another story, because they are not simply enhancing the status quo like Appleā€™s headset, they are trying to kill it.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • ā€¢ NO ADS
  • ā€¢ An additional insights email on Tuesdays
  • ā€¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more