• TheTechOasis
  • Posts
  • LEGO, the Future of TikTok & The Inevitable Death of the App

LEGO, the Future of TikTok & The Inevitable Death of the App

Sponsored by

šŸ TheTechOasis šŸ

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: LEGO, the Future of TikTok & Your Kids

  • Leaders: The Inevitable Death of Apps

šŸ¤© AI Research of the week šŸ¤©

ByteDance has announced a new model named LEGO that not only has impressive capabilities, but also gives us a clear idea of the future of its products, including the app your kids are glued to right now.

The world-famous TikTok. And you might not like what this future looks like.

At all.

Pay close attention, AI

Even though Multimodal Large Language Models like ChatGPT are all the rage right now, they are surprisingly limited.

Therefore, since the beginning of 2024, researchers have put a lot of focus on making these models better.

The promises of success are so many that some of the largest corporations in the world, like Apple with its recently covered by this newsletter model Ferret, are working on it.

But why are MLLMs so limited?

The attention problem

Most MLLMs today, except some notable exceptions like Fuyu, are based on the combination of two AI components:

  • A Large Language Model (LLM), aka the brain of the model. It processes all inputs and generates the response the MLLM will give back to you.

  • One or more modality encoders. An encoder is an AI model that takes an input, like text, audio, images, or video and outputs a representation of that input.

But what is a representation?

You will see this term everywhere in AI literature. Itā€™s what an AI model ā€˜sees and understandsā€™ from a certain input. Tangibly speaking, itā€™s the portrayal of the input, like the word ā€˜tableā€™ or a picture of one, as a vector of numbers, as thatā€™s the only way machines can process data.

This transformation is done with models like encoders. The key is that they capture the semantics of the input, so the numbers in the vector arenā€™t random. Thus, if you calculate the representation of two different tables and a cat, the vectors of the tables will be very similar, but very different to the vector representation of ā€˜catā€™, because they are semantically different concepts.

Next, researchers, to save money, do what we call ā€˜graftingā€™.

They take a pre-trained encoder, like OpenAIā€™s CLIP model to process images, and an LLM, and connect them using an ā€˜adapterā€™.

This adapter is required because, as the encoder and LLM have been trained separately, their respective representations of the same thing, like ā€˜tableā€™, may be different.

As the generative part of the MLLM always comes through the LLM, the usual practice is to use the adapter to transform the representation of the encoder into the ā€˜embedding spaceā€™ā€”the vector space of representationsā€”of the LLM.

Sounds complicated, but the intuition is simple.

Imagine a Japanese woman and a blind American man facing the sunset.

Even though the Japanese woman can explain what sheā€™s seeing perfectly in Japanese, the blind American man canā€™t understand her.

Thus, she uses Google to translate its Japanese description into English.

In other words, the encoder and the LLM basically ā€˜speak different languagesā€™, and the adapter transforms the image representations from the encoder into a language the LLM ā€˜understandsā€™.

Hence, using a multimodal dataset, we train this adapter to perform this transformation effectively and, this way, LLMs gain the power of sight.

If you have enough capital *cough* OpenAI *cough*, you can train both the LLM and encoder together from scratch, thus not requiring an adapter.

But hereā€™s the thing.

The encoders, be that an image or audio encoder, have not been trained to pay attention to fine-grained details, but to get a global flavor of the input.

And that leads to grounding problems.

Missing key context

MLLM grounding refers to the capability of an MLLM to base the input text on the specific key context provided by other inputs.

If the image portrays a husky and the question is ā€œWhat is this animal good at?ā€ the model should not only respond based on the fact that itā€™s a dog, but the fact that itā€™s a husky indicates the answer should be ā€œsled-pullingā€ which is certainly not the answer for chihuahuas.

That is grounding.

But for more complex tasks like ā€œWhat is the animal on the top-left of the image sitting right next to a tree doing with its pawsā€ most AI models, even frontier models like ChatGPT, suffer greatly.

And hereā€™s what LEGO comes to the stage.

One-Stop Shop

LEGO is the first MLLM that achieves multi-modal and fine-grained perception across modalities.

And itā€™s capable of doing amazing stuff like image grounding:

or previously unseen capabilities like fine-grained video grounding with spatial and temporal awareness (it can pinpoint the exact moment an element in the video does something):

In other words, itā€™s capable of paying close attention to detail in multiple modalities, not just images, like Appleā€™s Ferret.

Binding data all in one

Put simply, LEGO simply applies the adapter principle to images, video, and audioā€¦ all in one.

To achieve this, instead of fancy localized attention modules like Ferret, it achieved these capabilities by doing it the ā€˜hardā€™ way, building a dataset of data exclusively tailored for grounding purposes.

Then, they followed a three-step training process:

  1. Multi-modal Pretraining (grafting): Aligns each pre-trained multi-modal encoder with the LLM embedding space using multiple adapters. This stage focuses on enabling the model to comprehend multi-modal inputs.

  2. Fine-grained Alignment Tuning: Aims to enhance the model's understanding of spatial coordinates and timestamps. Here, the model learns to draw accurate bounding boxes over objects and signal exact timestamps on a video where the identified event occurs. It involves training the LLM and adapters, while the encoders for each modality remain frozen.

  3. Cross-modal Instruction Tuning: Further refines the model using generated data to improve alignment with human preferences and enhance multi-modal interactions. This stage also freezes the modality encoders while training the LLM and adapters.

In summary, in a very cost-effective way, LEGO achieves highly competitive results, despite not being completely revolutionary.

But if the model isnā€™t a game-changer technologically, why are we talking about it?

Easy, because it means a lot to how Tiktok is going to evolve under the hood soon.

A detective machine

TikTok become the app it is today due to its super-smart AI-based algorithm that glues you to the screen of your smartphone for hours on end.

The algorithm studies your viewing patterns and decides what content is going to make you want to doomscroll until you pass out, as the more content you view, the higher its ad revenues.

Considering one of every four Internet users uses ByteDanceā€™s video-sharing service, itā€™s fair to say itā€™s the most successful AI algorithm ever, even surpassing ChatGPT.

Now take these capabilities and add LEGO on top. With LEGO, the algorithm can:

  • Clearly understand the content of the video

  • Identify what points of the video resonated with you (you paused, you clicked like, etc.)

Et voilĆ , the most successful and addicting AI algorithm in history now can interpret your viewing patterns during the actual video to become even more successful and addicting.

You see, AI has a lot of good to create in our world, but some of the most addicting products are soon going to be flooded with AI features to better predict our buying patterns, or our content interests, making our already-hinged kids the perfect target for ads and a never-ending dopamine hit as the algorithms get to know them better than we do.

Sounds great, huh?

šŸ«” Key contributions šŸ«”

  • LEGO is the first MLLM that successfully grounds multiple modalities into one while paying attention to key details.

  • It has amazing localized temporal and spatial awareness, being on par with many recent optimized MLLMs and reaching new highs in tasks like video grounding.

Artificial Intelligence online short course from MIT

Study artificial intelligence and gain the knowledge to support its integration into your organization. If you're looking to gain a competitive edge in today's business world, then this artificial intelligence online course may be the perfect option for you.

  • Key AI management and leadership insights to support informed, strategic decision making.

  • A practical grounding in AI and its business applications, helping you to transform your organization into a future-forward business.

  • A road map for the strategic implementation of AI technologies in a business context.

šŸ‘¾ Best news of the week šŸ‘¾

šŸ§ Mark Zuckerberg declares his goal of creating AGI

šŸ«±šŸ¼šŸ«²šŸ¾ First-ever OpenAI deal with a University

šŸ¤Ø BMW is getting a fresh set of new robots that make coffee (and other stuff)

šŸ„‡ Leaders šŸ„‡

The Inevitable Death of the App

Itā€™s inevitable.

A new set of digital consumer products is here and will significantly change how we interact with machines.

Products like Rabbit will be to the App Store what ChatGPT was to search, and will dramatically change how companies around the world build or use digital products, and how investors make money.

Hereā€™s why.

The Key That Opens the Lock

With the power of Large Language Models (LLMs) the endgame has been always the same, democratizing human-machine interaction.

The raison dā€™ĆŖtre

Most foundation models today, general-purpose models that perform a plethora of tasks, are language models.

What these means is that they are sequence-to-sequence models that take a sequence of text as input and generate a continuation, thereby receiving the name of Generative AI (GenAI).

But the most powerful feature of these models is not how well they generate text, but how well they process text, as they can understand written instructions.

This canā€™t be understated, as it means that anyone with the capacity to write or speak a supported language can communicate with these models no matter their tech savviness.

But even though these models have indeed democratized our communications with machines, they are still very limited on some fronts, especially when it comes to actions.

Agents, the future of work and robotics

While ChatGPT can indeed make very valuable suggestions, recommendations, and so on, rarely can it take action.

Indeed, OpenAI includes ChatGPT plugins, with examples like Expedia, that allow you to search for flights in a conversational style.

However, when it comes to actually booking the flight, it has to provide you with a link to Expediaā€™s site so that you can finish the job yourself.

At that point, why not simply use the Expedia site from the beginning?

Despite its obvious limitations, this combination of LLM + plugin is what we describe as AI agents, or language models that take action.

There are two types of agents:

  • those that take action digitally,

  • and those that take action physically, with important design differences.

As for the former, the clearest example is Microsoftā€™s family of Copilot models, LLMs that are natively integrated with Microsoft products.

For instance, Copilot 365 allows you to interact with the Office suite by talking to it to build entire Word documents or PowerPoint presentations by simply asking for it.

This works by optimizing an LLM to, upon receiving an instruction, identify the product API required, like the PowerPoint API, and write and send the API request to the product.

As for the latter, models like RT-X by Google Deepmind leverage language models to create Vision Language Action models (VLAs), models that can perform several actions based on provided written instructions.

In this case, these models are trained to learn to output not just text tokens, but also the actions the robot connected to the model has to perform.

VLAs have a very similar architecture to the one we saw with LEGO, with the key difference being the fine-tuning data, which induces the model to output actions instead of bouding boxes and timestamps, as LEGO does. 

Now, this is all fine and great, but how does it translate into tangible products?

In the case of digital agents, thatā€™s in the form of declarative operating systems.

What ChatGPT started, Rabbit Could Finish

So, what is a declarative operating system?

The easiest way to grasp this concept is by checking this video on OpenAIā€™s first developer conference, where the presenter interacts with an LLM disguised as a tourist expert.

The point here is that the app itself is just the LLM, which is in charge of connecting to the different search APIs, creating the requests, and displaying the information.

In this paradigm, app creators donā€™t need to worry about all the back-end logic behind the app, they can simply use the LLM agent to perform the desired level of reasoning and actions when prompted by the user, leaving developers to focus on unique UI experiences and effective memory usage.

Truly amazing, but one company is trying to take it a step further into a paradigm called neuro-symbolic AI that learns in a very human way, and has presented a unique product you can already buy.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • ā€¢ NO ADS
  • ā€¢ An additional insights email on Tuesdays
  • ā€¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more