• TheTechOasis
  • Posts
  • Deep Fusion and the New Age of Software Development

Deep Fusion and the New Age of Software Development

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: CogVLM, a Revolutionary Multimodal Solution To the MLLM Problem

  • Leaders: OpenAI Shows Us the Future of Software Applications… and Much More

🤯 AI Research of the week 🤯

A group of researchers has presented a new model that revolutionizes the current multimodal AI design standards while blowing out of the water almost all competition.

The introduces an innovative concept, Deep Fusion, a new design that mitigates the biggest problem faced by Multimodal Large Language Models (MLLMs) today, the “shallow alignment problem”.

Were to deliver on its potential, the CogVLM model could become a seminal research paper that will draw the attention of researchers around the world to create a new family of MLLMs, Deep Fusion models.

The actual results? Impressive capabilities like coding math problems from images, and many more we’ll see shortly.

But first and foremost, what is the shallow alignment problem?

It couldn’t be that easy

Building an LLM is a cumbersome task.

You need a huge dataset of text documents, a team of world-class researchers, and a powerful GPU cluster. In other words, you need a ‘whole lotta monei’ and talent.

And if you want to make your model commercially available, you need to make it aware of what to say or not say.

For that alone, you will also need an army of human annotators to perform Reinforcement Learning from Human Feedback (RLHF), which in turn needs even more capital.

The result?

GPT-4’s initial model took $100 million to build end-to-end, and the first version of LLaMa’s 65-billion-parameter model burned $5 million in just 21 days.

But building an MLLM… that is a totally different beast.

Not only do you have to do the above process, but you also have to train an image encoder, the piece in the architecture that processes images, and make them work together.

Hence, unless you are part of the magnificent seven, or one of such companies is fiercely funding your efforts, you simply don’t have the means to train these models.

Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia and Tesla are considered the ‘magnificent seven’.

Consequently, open-source researchers, driving their efforts mainly through University funding, have to be smart about their designs.

Grafting, an elegant answer

To avoid having to build the complete model by scratch, the most common method of creating MLLMs is grafting, the process of connecting an already-trained image encoder and an LLM through an adapter, usually a Q-former (like the one we saw last week in SALMONN), or an MLP layer.

The rationale is clear. If you want to use components that are already trained, you need to align them, i.e. make them talk the same language.

To do so, once the image encoder processes the image, the resulting representation is projected into the embedding space of the LLM.

Frontier AI models represent the world (images, text, audio,…) in the form of vectors called embeddings, as:

1. machines only work with numbers, and

2. that way they can apply similarity between concepts. The more similar two concepts are, the similar their vectors should be, which is something a machine can calculate. This turns the concept of understanding concepts into a mathematical exercise OpenAI defines as ‘relatedness’.

But there’s a problem.

The inescapable truth is that the weights of the LLM are not meant to process images. They are not meant to capture the salient features of images like color, the absolute and relative position of objects in the space, and so on.

To comprehend this, let’s say we want the model to tell us “What object is placed on the right-hand side of the image?”.

Firstly, the model will process the image through the encoder and the text through the embedding lookup matrix (LLMs don’t have text encoders, they simply learn the embeddings previously for each word in their vocabulary).

The image encoder will identify all the objects pretty well; oranges, frying pans, ceiling lights, and so on, and these objects and their spatial relationships will be infused into the image embedding.

Next, this embedding is projected into the LLM’s ‘world’ with the adapter and processed by the LLM’s weights.

But here’s the thing, the LLM weights have not been trained to process features inherent to images, but only text-related features, meaning that the image-specific features are ‘lost in translation’, leading to hallucinations.

You can try to fine-tune the LLM with the image data, but that leads to catastrophic forgetting for NLP tasks as suggested by PaLM-E (Driess et al, 2023).

But with Deep Fusion, that is not a problem anymore.

Same cost, much smarter model

Put simply, CogVLM solves the “shallow alignment problem” by adding a visual expert module.

To the left-hand side, the process is the same as until now, an MLP Adapter transforms the outputs of the image encoder into the same dimension the word embedding has.

But here is where things get interesting.

If we look at the right-hand side, we see that the researchers have essentially duplicated the attention weights of the LLM.

The attention mechanism sits at the core of Transformers. For lack of a better term, makes words in the sequence ‘talk’ to each other.

This allows the model to capture the relationships between words and, thus, capture the meaning of the text.

The key intuition here is that, while we want to avoid updating the weights of the LLM, we duplicate these weights and train the new ones with the image features.

Therefore, these new weights will focus on the unique features of images, while the original weights that were trained on millions of text passages continue focusing purely on text.

And, more importantly, the results of both attentions are summed up, creating a new, combined representation defined as Deep Fusion by the researchers.

This new representation is like taking the image and the text and merging them into 'one thing', one unique multimodal representation that encapsulates the information from both modalities into a unique set of features.

Using the previous example, the LLM inside CogVLM not only can process the text request, but it now has specialized weights that have captured image representations otherwise lost like color, space, and absolute and relative position of objects.

This all makes questions like “What is the object on the right-hand side of the image?” a very simple task.

In a way, Deep Fusion is analogous to how the human brain integrates visual and textual information. Just as we naturally combine what we see with relevant textual or spoken context, this model fuses visual and textual data to create a more comprehensive understanding of the entire input sequence.

This new architecture is also very computationally efficient, because even though we are duplicating parameters, the number of FLOPs stays the same, because parameters will be used only when applying to their modality, input or text.

Amazing results

Unsurprisingly, CogVLM faires well when compared to other similar-sized models, beating all previous state-of-the-art models but PaLI-X-55b, a model three times larger.

It still beats this model too in most cases, meaning that few models besides GPT-4V are superior to CogVLM if any, besides the OpenAI beast.

Naturally, the model is capable of performing highly impressive tasks like complex counting:

Or advanced image understanding:

Innovation, innovation, innovation

In short, CogVLM is not simply another state-of-the-art solution that marginally improves previous models.

It represents a new way of creating multimodality, one that not only is computationally efficient, it also allows a fairly small model to excel at complex cross-modality tasks where traditional MLLMs struggle greatly.

Seeing CogVLM capabilities, I’m sure the entire world will be paying attention to this paper and soon the deep fusion architecture could become ubiquitously related to ‘multimodality’.

In the meantime, check their research paper for more info: Link to paper

🫡 Key contributions 🫡

  • CogVLM introduces Deep Fusion, a new architecture design that allows AI models to build hybrid representations that incorporate features from multiple modalities

  • It not only excels at common MLLM tasks, it greatly surpasses most MLLMs in cross-modality tasks

🔮 Practical implications 🔮

  • MLLMs have eyes and mouth, and will soon have ears (not to be confused with ChatGPT, which isn’t audio multimodal)

  • Truly multimodal MLLMs are a step forward toward embodied intelligence, where robots will have similar sensorial capacities to humans

👾 Best news of the week 👾

👀 Neuralink’s new chip wants to cure blindness

🧐 The Financial Times discusses if GenAI will transform business

🤨 Bill Gates’ view on the future of computing with LLM agents

🦾 Controlling robots with thoughts, a reality today

🥇 Leaders 🥇

OpenAI Just Showed Us the Future of Software Applications… and Much More

On the 10th of July 2008, the world saw the birth of the iPhone’s App Store.

This event completely changed the software landscape for 15 years, and its effects are being felt to this day.

Billion-dollar companies were born from that App Store.

Snapchat, Uber, Instagram, TikTok… companies with a combined value in the trillions are a thing today thanks to the App Store.

But this week, something new has been born.

A new type of application, the declarative app, will cause an equal or bigger shift in how humans build software and, more importantly, how we interact with it.

But some of the great minds of our time are taking it a step further. Andrej Karpathy, ex-director of AI at Tesla, co-founder of OpenAI, and currently working in this very same company, summed it best:

Seeing LLMs as chatbots is the same as seeing the early computers and saying they are calculators.

As we will for ourselves today, we are witnessing the democratization of the human-machine interface, the democratization of software development, and the dawn of an era where all of us, independently of our technical backgrounds, can become creators and builders in the digital world.

The Tool Stack that Will Reign the World

Back in March, with the launch of ChatGPT’s plugins, I told some of my friends that this was a golden opportunity and that billion-dollar companies would be born from them.

I claimed that ChatGPT was no longer a chatbot but a platform, or at least that’s what they were trying to do.

And although recent events proved I was right, I was wrong about timing. ChatGPT plugins were, put mildly, ‘not it’.

They really didn’t work that well, and people had to think through excuses to use them.

Even Sam Altman acknowledged that product market fit just wasn’t there.

But OpenAI had shown us their vision, and that vision finally came to fruition this week with the announcement of GPTs, the precursors of agents.

The birth of a new type of software

The Neandertals to the homo sapiens.

The dinosaurs to birds or crocodiles.

A new type of software. That’s what I felt when I saw Sam Altman presenting their new feature, GPTs.

GPTs are a new capability in the ChatGPT web browser and API products that allows you to build your custom ChatGPTs connected to third-party services, or agents.

Agents aren’t new, but the key thing here is how you build them. To build your custom new ChatGPT, no code is needed.

In what can be considered the first of its kind, the ChatGPT interface and playground allow you to create your GPTs by talking to them... and I mean literally.

As you can see in the image above, I’ve created an ‘Article Illustrator’ a GPT that takes in my article and suggests possible visuals that could be added to it.

It even suggests the icon we can give it.

After that, I had a discussion with it regarding how it should behave and what are the expectations.

Next, it was time to give it a try!

I simply fed it the article you previously read about CogVLM, and he suggested several images, including:

Sounds familiar? It’s the image you saw previously!

Of course, this is a very, very simple GPT, but to build what I did you require absolutely no technical knowledge.

But if you want to build more advanced GPTs, it’s quite easy too.

You can add a knowledge base, and you have native connectivity to DALL-E3 (used in my case), Web Browsing in case you need your GPT to be up to date, and even Code Interpreter, if you require your GPT your write and execute Python code for data analysis, Machine Learning, or even helping you with your Excel spreadsheets.

But if you still need to get more advanced, you can add other third-party services.

For the latter, you will require some technical expertise to make API requests, but unless you’re dealing with a poorly maintained API, that’s not hard at all.

Tip: Leveraging ChatGPT’s new JSON feature that guarantees stable JSONs, I suggest you provide the API documentation to ChatGPT to help you build those API requests.

Chances are you will still need some minor technical knowledge, but the heavylifting will be provided by ChatGPT.

This UI is great for consumer-end cases, as now the number of small agents you can embed into your life is only limited to your imagination.

But for enterprise use cases, you are going to need the Assistant API that OpenAI announced too.

It’s basically the same concept but tailored for more advanced, probably enterprise-grade, use cases.

In my opinion, the Assistant API will become the standard best practice for deploying Generative AI models into companies, and if you’re looking to try it, you can do so here.

As you may have realized by now, this is a radically different way of building applications. Put simply, these are a new type of applications conceived as declarative applications.

But what are they really?

Well, for starters, they are the apps of the future and, crucially, the real democratization of the human-machine interface.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room