• TheTechOasis
  • Posts
  • Unraveling Google's super AI week. Gemini and the super coding model AlphaCode2

Unraveling Google's super AI week. Gemini and the super coding model AlphaCode2

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.


  • AI Research of the Week: The Day ChatGPT was Finally Defeated

  • Leaders: AlphaCode2, Google’s proof that Coding will Never be the Same

🤯 AI Research of the week 🤯

The day has finally come.

GPT-4, at least in terms of benchmarks, is no longer the best-in-class model.

Gemini Ultra, the biggest model in Google’s Gemini family of Large Language Models (LLMs), or dare I say, MLLMs (for Multimodal), has claimed the throne.

Gemini Ultra completely obliterates ChatGPT’s best model, GPT-4, by beating it in 30 out of 32 popular benchmarks.

Sadly, not all that glitters is gold, as several issues have overshadowed some of Google’s claims, to the point of being accused of faking some of the announcements.

A perfectly crafted marketing campaign, or is Google truly at the lead now?

Let’s find out.

Multimodal at the Core

From your pocket to the world

Developed by Google, it includes three main sizes: Ultra, Pro, and Nano.

  1. Gemini Ultra: This is the most capable model in the family, designed for highly complex tasks. It delivers state-of-the-art performance across a range of benchmarks, including reasoning and multimodal tasks.

Gemini Ultra, to be released on Q1 2024 is the model that competes, and improves, over GPT-4.

  1. Gemini Pro: Optimized for performance in terms of cost and latency, this model demonstrates strong reasoning performance and broad multimodal capabilities across various tasks.

In an apples-to-apples benchmark comparison, Gemini Pro showcases similar results to ChatGPT-3.5 Turbo, although beating it most of the time.

  1. Gemini Nano: This model targets on-device deployment and comes in two versions, Nano-1 and Nano-2, designed for low and high-memory devices respectively. It is notable for its efficiency and is trained by distilling from larger Gemini models.

Nano-1 has 1.8 billion parameters and Nano-2 3.25 billion, and both have a 4-bit precision. This means that the former only requires 0.9 GB and the latter 1.7 GB, making them usable in many smartphones.

But what makes Gemini special is its multimodal approach.

From the ground up

Gemini represents the first frontier model—a best-in-class model—that is truly multimodal from the ground up.

As explained by Google, Gemini has been trained not only with loads of text, but text interleaved with images, audio, and even video, and is also capable of generating both images and text.

This means that, unlike most common multimodal solutions, where an LLM is frozen and the representations (vector embeddings) of other modalities are projected into the LLM’s input space, a process called grafting, in Gemini’s case all components are trained simultaneously so the model learns a global representation of features no matter their modality.

In layperson’s terms, a description of coffee, the sound of coffee being brewed, and an image of a cup of coffee will share very similar embeddings, signaling the model that they all refer to the same concept of ‘coffee’.

Also, unlike grafting, where weights that have been initially trained to work only with text have to deal with tokens from other modalities projected into their space, here the weights will deal with what they saw during training, allowing the capture of the important features from each modality.

Grafting is the easiest and cheapest way of creating multimodal solutions, but gives ‘suboptimal’ results, as mentioned by Orial Vinyals, VP of Research at Google Deepmind, in the Gemini presentation video.

Of all, video has generated the most hype, but there’s a catch.

The faked video

Gemini models process video by encoding it as a sequence of frames within a large context window.

It can also handle video data with variable input resolutions, enabling more computational resources to be allocated to tasks that require fine-grained understanding.

For instance, as you can see below, the model can process the image sequence and conclude that the person is imitating the famous Matrix scene.

So far, so good.

But the image you are seeing above and what was presented in Google’s ‘Hands-on with Gemini’ video are totally different things, igniting a huge controversy.

In the video, Gemini seamlessly interacts with the user in real-time, with no latency, and with very simple prompts, something not short of extraordinary.

But as pointed out by many, the video was faked.

As Oriol himself explained in a tweet, the prompts were shortened and latency was eliminated for a more impactful result to, in their own words, “inspire developers”.

Laughable. Honestly, why is it so hard to just tell the truth? 

However, they still did show some other pretty amazing stuff.

The First PRM

As we discussed in last week’s Leaders segment, back in May OpenAI released a paper that proved that Process-Supervized Reward Models, or PRMs, resulted in a great improvement in reasoning tasks in comparison to the standard reward models used in current RLHF pipelines, outcome-supervized (ORMs).

As Reinforcement Learning from Human Feedback, or RLHF, helps instruction-tuned generative pre-trained transformers like ChatGPT or Gemini improve factuality (among other things), PRMs review every step in the model’s reasoning process, instead of simply reviewing the final outcome (what ORMs do).

This is key in cases like mathematics, because if you make an intermediate mistake in the calculation, it improbable you will get the final correct answer.

Surprisingly, it seems—it has not been acknowledged by Google—Gemini is the first GenAI model to implement this reward modeling paradigm, as proven by this video.

For instance, the model is capable of pointing out the specific step in the process where the student went wrong.

More impressively, it’s capable of generating test examples based on the issue identified to help the student improve:

Very impressive stuff to say the least.

Google is ready to compete

Despite all this, Google's Gemini has received mixed reviews.

For example, critics note that it only now matches OpenAI's GPT-4 level, even though that model has been available since March.

However, the real takeaway is Google's demonstrated ability to compete in this space, a previously uncertain fact.

Crucially, the key release here was never Gemini but AlphaCode2, and its simultaneous release alongside Gemini isn't coincidental.

It signals a shift in focus from solely Multimodal Large Language Models (MLLMs) to a combination of MLLMs and sampling and search algorithms, fostering System 2 thinking.

With AlphaCode2, Google isn't just catching up; it's redefining the AI narrative towards a synergy of MLLMs and search, a narrative where only Google is officially playing right now, as OpenAI’s Q* model is only a rumor at this point.

But what is AlphaCode 2 and why it is so important?

More on that below…

🫡 Key contributions 🫡

  • Google’s Gemini model family proves to the world that Google is finally ready to compete for the AI throne

  • Through three different models, Google offers the first frontier multimodal model in four different modalities and the successful implementation of PRM models at scale

🔮 Practical implications 🔮

  • With models like Gemini, GenAI solutions stand to disrupt every industry you can think of, as they delve not only into the world of text, but also images, audio, and even video

  • Gemini can be used to generate synthetic data at scale, and has great use cases for industries like education, helping teachers create and supervise student development

👾 Best news of the week 👾

😍 Meta and Stanford present CHOIS, their text-to-interaction model

🤔 ChatGPT vs BARD. Who wins?

🥇 Leaders 🥇

AlphaCode2, Google’s weapon to claim the narrative in AI

While the entire world is talking about Gemini, Google swiftly released another model, AlphaCode 2, the latest version of their competitive programming model.

It’s a different type of beast. A new type of model.

Combining Gemini Pro and a fascinating sampling mechanism we will get into shortly, it is trained to solve competitive programming challenges that require advanced coding skills and super-strong reasoning capabilities.

Nonetheless, Google estimates that it is already at the 85th percentile of participants, meaning that it’s superior to 85% of what can be considered the best coders in the world.

Despite its amazing capabilities, Google seemingly lost an opportunity to give it the spotlight it deserves by releasing it simultaneously to the model that “surpasses GPT-4”.

However, they surely knew this.

Then, why did they do that?

Because they are implicitly shifting the narrative of this whole story to their benefit.

Google is changing the rules of the game, and you will soon understand why.

The Creation of Artificial System 2 Thinking

Last week, the entire AI space couldn’t get enough from the Sam Altman drama and, more importantly, the secret model they were creating, named Q*.

This model, not released but acknowledged by Sam Altman itself, proved that OpenAI was following the same thought process as Yoshua Bengio, Andrej Karpathy, or Google Deepmind regarding the “next big thing” in AI, the creation of the first AI model capable of performing “System 2 thinking”.

But what is that?

Thinking fast and slow 

Daniel Kahneman, in his book Thinking, Fast and Slow presents a model of the human mind that describes two distinct ways in which we form thoughts and make decisions, referred to as System 1 and System 2.

  1. System 1: This is the fast, automatic, and often subconscious way of thinking. It's responsible for quick judgments and immediate decisions, often based on intuition and gut reactions. This system is also more prone to biases and errors in judgment, especially in complex or unfamiliar situations.

  2. System 2: It’s the slow, deliberate, and more conscious way of thinking. It's involved in effortful mental activities that require attention, such as complex calculations, logical reasoning, and critical thinking. System 2 is more reliable for making thoughtful and rational decisions, but it is also more resource-intensive and slower.

But why does this matter?

Put simply, it elegantly explains why LLMs are so error-prone when facing complex reasoning tasks like even the simplest of mathematics.

To further understand this concept, if I ask you “What’s 2+2”, you will immediately answer back with “4” quickly, with no hesitation.

But if I ask you “16×21”, unless you know that by heart, you are going to take some time to perform the different steps in the calculation in your head before answering.

This is mainly because LLMs are, by design, System 1 thinkers. They immediately respond to your request, with no hesitation.

This is the reason we tend to improve results by using techniques such as Chain-of-thought (CoT), i.e. simple ask it to answer step-by-step.

With CoT, by inducing the model into deliberate thinking and dedicating more compute to answer the question by forcing it to “think slowly” and reason its way into the answer, CoT dramatically improves model performance.

Consequently, if we are capable of creating a model that, by design, performs “System 2 thinking”, we are looking into the eyes of the next frontier in AI development, a new type of model that can perform complex reasoning tasks at unparalleled scale, accuracy, and speed.

A model that would, literally, change the world. And here’s why AlphaCode 2 could be that model.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room