• TheTechOasis
  • Posts
  • OpenAI's GPT-4o, Google's Project Astra, & More

OpenAI's GPT-4o, Google's Project Astra, & More

In partnership with

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

💎 Big Announcement 💎

I know you’re in full AI Overload and tired of low-impact daily newsletters with loads of hype and little substrate to learn from.

But worry not with TheWhiteBox, a community for high-quality, highly-curated AI content without unnecessary hype or ads, across research, models, markets, and AI products.

All in digestible language. But why would you want that?

  • Cut the Clutter: Value-oriented, focusing on high-impact news and clear intuitions to extract from them (why should you care).

  • Connect with Peers: Engage in valuable discussions with like-minded AI enthusiasts from top global companies.

  • Exclusive Insights: As the community grows, gain access to special content like expert interviews and guest posts with unique perspectives.

With TheWhiteBox, we guarantee you won’t need anything else.

No credit card information required to join the waitlist.

🚨 Week’s Update 🚨

Hello once again!

It has been a tremendous week, with the star presentations of OpenAI and Google. Although I focus on the ‘AI side’ in much more detail below, here are some overall highlights.

Google finally announced the model to compete with OpenAI’s Sora, named Veo. Not much is known, but the videos look very impressive.

They did not mention anything regarding video length (Sora reaches 1 minute) which might suggest that they aren’t quite there yet in terms of performance.

Google is also finally showing signs of changing the game for search. In a short video, they clarify where they stand: They envision a world where humans do the asking and AIs do the searching.

They call these ‘AI Overviews,’ with Gemini using ‘multi-step reasoning’ to break down your request, do all the searching for you, and give you relevant answers back.

This is literally a copy of Perplexity’s approach, which is a clear win for all of us… except websites. I feel this will dramatically reduce ad revenue for most websites in the world, as humans aren’t doing the clicking, AIs are. Google disagrees with this statement, as their head of Search believes the contrary will happen.

Hard to see how though. Anyway, the Internet business model is about to change fast, and I fear it will become a ‘pay-to-win’, so that your content is picked up by the AI.

This condensed video by The Verge has all announcements summarized.

Moving on to OpenAI, I explain in full detail below why its new model, ChatGPT-4o, is special. But you can see it with your own eyes, too, with this video, where it generates an entire game based on a screenshot.

But they also had interesting news, like a new desktop app that lets you have the model as your copilot or that the model shows character consistency in image generations by default.

Long story short, the take-aways is that they have reduced latency dramatically, that the model can reason across multiple modalities at once (see below why this is huge), and that it will be free to use (capped to a limit). If you want a full review on the announcements, read my Medium article.

Moving on to another of the great AI companies, Anthropic is finally landing in Europe. And it’s raising money in the process (again). And they aren’t the only ones, as Mistral is allegedly looking for a $600 million investment, too.

While Anthropic is reportedly having trouble to increase revenues, Mistral is extremely ‘GPU-poor’ (only 1.5k H100s), as acknowledged by its CEO. With OpenAI and Google’s announcements, the pressure to deliver is huge, so this isn’t surprising.

My thoughts?

I like Mistral’s odds better. Anthropic has to compete with OpenAI and Google at the frontier, Mistral doesn’t, as they are playing the efficiency game (the best model per a certain size) and they are beloved by European regulators and heavily lobbied by the French government.

Moving on, another day, another robot company. Kyle Vogt, founder of autonomous driving company Cruise, has raised $150 million to launch the Bot Company.

Unlike more generalistic approaches, this company’s sole focus will be home chores.

CRISPR is a gene-editing technique in which a specific protein cuts into the DNA, eliminating the ‘bad’ gene and inserting the correct sequence. This technique can, among other things, cure diseases.

With EVO, AlphaFold 3, GNOME, Med-Gemini, or Profluent, it’s really hard to not be extremely bullish on the intersection between AI and healthcare.

💎 Sponsored content 💎

Learn How AI Impacts Strategy with MIT

As AI technology continues to advance, businesses are facing new challenges and opportunities across the board. Stay ahead of the curve by understanding how AI can impact your business strategy.

In the MIT Artificial Intelligence: Implications for Business Strategy online short course you’ll gain:

  • Practical knowledge and a foundational understanding of AI's current state

  • The ability to identify and leverage AI opportunities for organizational growth

  • A focus on the managerial rather than technical aspects of AI to prepare you for strategic decision making

🧐 You Should Pay Attention to 🧐

  • What Makes ChatGPT-4o Special?

  • Google’s Project Astra & Gemini Surprises

🤖 What Makes ChatGPT-4o Special 🤖

Finally, OpenAI launched a new model over a year after GPT-4. This time, it’s also a GPT-4 variant, but with true multimodality and super efficient inference.

Despite being made generally available, it’s notoriously superior to the previous state-of-the-art across multiple benchmarks and includes very powerful features like real-time video processing, which are thought to be very expensive.

So, how does all this make sense, and what does it mean to the future you?

Multimodal In, Multimodal Out

For starters, we need to understand what makes ChatGPT-4o special. And the answer is that this is the first truly “multimodal in/multimodal out” model.

In layman’s terms, you can send the model audio, text, images, or video, and the model will respond with text, images, or audio (not video), depending on the requirement.

But haven’t previous versions of ChatGPT or Gemini already generated images or audio? Yes, but through standalone exogenous components. And that changes it all.

But how?

Beforehand, whenever you sent an audio to a model, this was the process:

In this procedure, the tone, rhythm, prosody, conveyed emotions, and the key pauses derived from natural speech were lost, as the speech-to-text component, the Whisper model, transcribed the audio to text the LLM could then process.

Then, the LLM would generate a text response and send it to another component, a text-to-speech model, that would generate the speech response eventually delivered.

Naturally, as human expression through language is much more than the actual words, a lot of vital information was lost, and the latency was far from ideal, as information had to be sent between separated elements.

But with ChatGPT-4o, although the components are highly similar, everything occurs in the same place:

At first, it may seem like not much has changed. Although the components are technically very similar, how they share information has changed. Now, the LLM sees a semantic representation of the speech instead of just the text.

In layman’s terms, instead of seeing just the text “I want to kill you!” the model now also receives information such as:

 speech: "I want to kill you!"; 
 emotion: "Happy";
 tone: "joyful";

This way, the model captures the nuances of the message, not just the plain text.

Although I used a JSON to explain, what the speech encoder generates for the LLM is a set of vector embeddings that capture the emotion, tone, rhythm, and other cues of speech besides the actual text. For a deep-dive on embeddings, read my blog post.

Therefore, the LLM generates a response that is much more grounded in the actual situation thanks to the fact the overall model also captured key characteristics in the message besides words.

This response is then sent to the audio decoder, which most probably generates a Mel spectrogram, which is then sent to the vocoder for audio generation.

You can think of spectrograms as a way to “see” sound. This short video by the Science Center of Iowa explains it pretty well.

But what is a mel spectrogram? Mel spectrograms are usually used in speech as they mimic the human ear’s response to sound.

Besides audio, the principle still applies for image processing and generation or video processing, as they have packed all the components into one single model.

In summary, ChatGPT-4o now captures information from other modalities besides text, including key cues in audio, image, or videos to generate much more relevant responses. Simply put, it no longer cares how the data comes in.

However, I probably still haven’t managed to convince you how crucial this change has been. So let me do this right now.

Semantic Space Theory

One of the most beautiful concepts in current AI is the latent space, where the model’s understanding of the world resides. Simply put, when we say that our model is multimodal, we go to the latent space to see if that’s really the case.

Whenever ChatGPT-4o sees an input, no matter its original form, it ends up as a compressed representation. In other words, the model takes the input and transforms it in a way that, while still capturing the key attributes of the data, it can be processed by machines that, remember, at their core can only interpret numbers.

In this latent space, one principle rules: relatedness. Just like in our world, concepts like gravity rule everything, semantic similarity governs everything in the world of MLLMs.

For the average Joe, this means that, in the latent space, semantically similar things are closer, and dissimilar concepts are pushed apart. ‘Dog’ and ‘cat’ share several attributes (animals, mammals, domestic, etc.); thus, their representations will be similar.

As we mentioned earlier, these representations take the form of vectors. This way, we represent any world concept in the form of numbers (required for machines to process) and by having them in vector form we can measure how mathematically similar they are.

Simply put, we transform the concept of understanding our world into a mathematics calculation (i.e. if ‘dog’ and ‘cat’ vectors are similar, they must be similar concepts in real-life).

That’s basically all you need to know regarding how AI models see our world.

from an image to a vector

However, the concept of ‘dog’ can be represented in various ways: through text, an image of a husky, or a bark. And this is the fundamental reason why we want true multimodality.

Beforehand, to ChatGPT, a dog was literally the word ‘dog.’ Now, audio, images, text, and video are natively part of the model. Thus:

  • The model now knows that an image of a golden retriever is a ‘dog’,

  • Knows an audio of a malinois barking is also a ‘dog’,

  • A video of a labrador running also represents a ‘dog’,

And so on. With multimodality, the model’s understanding of the world becomes similar to how humans interpret it. Consequently, it’s unsurprising that the model is now ‘smarter’, as it can now reason across all modalities equally.

But what do I mean by ‘reasoning across multiple modalities’?

If we use Meta’s ImageBind as an example, one of the first research papers to aim for truly multimodal latent spaces, the model develops a complex understanding of world concepts.

Using the previous example, if we provide the model with an image of a dog in a pool and the sound of a dog barking, the model correctly identifies the source of such sound:

You can also add an image of a clock and the sound of church bells, and the model is capable of identifying images of church clocks:

But how is ImageBind doing this? As you probably guess, they simply compute the representations of each data type and measure distances between vectors.

Long story short, ChatGPT-4o isn’t about giving models more power, but making them reason better. This is just adding another layer, cross-modal intelligence, which indirectly makes them much more smarter, and ‘human’, than before.

What We Think

The achievement of true multimodality from OpenAI has sent a stark message to the world:

Without making the model’s backbone (the LLM) more intelligent per se, a model that can reason across multiple modalities is inevitably more intelligent, as the model not only has more capabilities but is also capable of transferring knowledge across the different data types.

Humans' capacity to use all their senses is considered a key part of intelligence, and AI aims to claim that capability, too.

As a great perk, it also allows models to become much more efficient in inference (setting aside particular efficiencies they could have applied). The communication overhead of combining multiple exogenous components is eliminated, making the model much faster.

🔍 Google’s Project Astra & Gemini Surprises 🔎

As mentioned, Google also had a major event, Google I/O.

Importantly, it was Google’s opportunity to clap back at OpenAI’s event the day before. But did they deliver?

Besides the announcement of Veo we mentioned earlier, the CEO of Google DeepMind, Demis Hassabis, announced a series of updates across the Gemini family of AI models.

  • He introduced the new Gemini 1.5 Flash, a lightweight model designed for speed and efficiency,

  • increased the context window for Gemini 1.5 Pro,

  • and unveiled Project Astra, representing their vision for AI assistants' future.

Fighting Latency

As for the former, the Gemini 1.5 Flash model is optimized for high-volume, high-frequency tasks while featuring a long context window of 1 million tokens.

It allegedly excels at summarization, chat applications, image and video captioning, and data extraction from long documents and tables. And despite being lighter than the 1.5 Pro model, it maintains impressive multimodal reasoning capabilities.

But you may ask, how have they trained a model so small yet so similar in capabilities to larger models?

Demis’ blog mentions that this efficiency is achieved through "distillation," where essential knowledge from a larger model is transferred to a smaller one.

Although he did not provide more detail, we know exactly how that works. They use two levers:

  • The training dataset is prepared based on responses generated by the larger model (the teacher)

  • They add an extra term to the loss function to measure distribution divergence

The first point is obvious. If we want a smaller model (the student, Flash in this case) to imitate a larger one, a great start is forcing the student to output responses similar to the teacher’s.

However, the second one is less intuitive but still crucial.

If we limit ourselves to rote imitation, the chances the student will memorize the output style are high. Importantly, when faced with more complicated data, the limitations of the smaller model really shine through.

Thus, by forcing the student to learn the same probability distribution is akin to saying that the model learns to interpret data and handle uncertainty in the same way the teacher does, ensuring the reasoning processes are also similar.

Using the first point only would be like teaching a kid about maths and telling it to simply memorize whatever the teacher writes. The second point forces the smaller model to understand what it’s generating.

Think of Gemini 1.5 Flash as a smaller version of Gemini 1.5; surely a little bit less capable, but blazing fast, which is extremely important in cases where reasoning capabilities aren’t everything and real-time responses crucial.

Ginormous Context Window

But if there’s a metric Google is proud of, it is the context window of their LLMs, which was already at least five times bigger than everyone else.

The context window represents how much data a model can take up at once.

Now, they decided to double its size to 2 million tokens, or between 1.4 and 1.5 million words (depending on the tokenizer used). The differences, when compared to the competition, are substantial:

Although context windows aren’t everything, they are extremely important for two reasons:

  1. Some data types demand extremely long sequences, like DNA sequencing or long video processing

  2. LLM’s performance increases with longer context

While the first point is just a fact, as some use cases aren’t available unless you have a large context window, the second point might be less obvious.

One thing in common across all frontier AI models is that they are sequence-to-sequence models. In other words, they take a sequence in, and they give you a sequence back.

In text terms, to generate the latter, the model uses the words in the input sequence as the only context to generate the next word. Simply put, ChatGPT or Gemini are functions that take in a set of words and predict the next.

Consequently, the ‘quality’ of the response is proportional to the quality of the input. In other words, the more context the model is given, the better the outcome.

Additionally, longer context windows mean we can feed the model more context-relevant data into the prompt to improve accuracy.

For instance, if we want to train a model that tells us if we’re compliant with the company’s business practices, if the context window is long, you, well, feed the entire business practice documentation into the prompt, and there’s that.

However, as always, Google saved the best for last, Project Astra.

A Next-Generation Assistant

After ChatGPT-4o’s impressive release as a heavily anthropomorphized AI, there were huge expectations for Google.

And they kind of deliverd. Project Astra is a Gemini-based multimodal product that can be used in smartphones or even through glasses to provide real-time support.

As shown in the presentation video, the model effectively resolves any questions the user sends it in close-to-instant time, some of them as varied as code understanding, graph support, or even giving a funny name to a golden retriever and his toy.

You can consider Project Astra the equivalent to GPT-4o: extremely low latency for extremely high utility. And in case you’re wondering; yes, you will require Internet connection as both are cloud-based solutions.

What We Think

Overall, Google's presentation was strong. Besides the ‘AI-only’ improvements we covered, they introduced relevant AI enhancements to their search and workspace products, as discussed above.

Regarding search, they are clearly following Perplexity’s footsteps, accepting a future where humans do the asking and AIs do the searching. In my view, this opens serious questions on the economic viability of many companies that rely on their websites and in-site ads for survival.

As a caveat, especially when compared with OpenAI, Gemini is still multimodal in but not multimodal out. In other words, in terms of generation, it’s still text-only, as to generate images or audio it requires additional models, like Imagen 3. Going back to latent spaces; yes, the complete system is multimodal in outputs, but the latent space isn’t.

For that reason, I feel that OpenAI might have a slight edge right now.

🧐 Closing Thoughts 🧐

If we need to highlight something specifically, this week, ChatGPT-4o and Project Astra take the spotlight as proof that we have finally overcome the biggest pain in GenAI products: cross-modal reasoning and latency. Finally, AI virtual assistants are ready to become a reality.

As a final reflection, in terms of intelligence, we have been pretty much stuck in the same state of the art for over a year (and we still are). Based on this and the messages sent by executives at both labs, I bet they are closer than ever to unveiling the next frontier: long inference models.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]