- TheTechOasis
- Posts
- Unraveling Google's super AI week. Gemini and the super coding model AlphaCode2
Unraveling Google's super AI week. Gemini and the super coding model AlphaCode2
š TheTechOasis š
Breaking down the most advanced AI systems in the world to prepare you for your future.
5-minute weekly reads.
TLDR:
AI Research of the Week: The Day ChatGPT was Finally Defeated
Leaders: AlphaCode2, Googleās proof that Coding will Never be the Same
š¤Æ AI Research of the week š¤Æ
The day has finally come.
GPT-4, at least in terms of benchmarks, is no longer the best-in-class model.
Gemini Ultra, the biggest model in Googleās Gemini family of Large Language Models (LLMs), or dare I say, MLLMs (for Multimodal), has claimed the throne.
Gemini Ultra completely obliterates ChatGPTās best model, GPT-4, by beating it in 30 out of 32 popular benchmarks.
Sadly, not all that glitters is gold, as several issues have overshadowed some of Googleās claims, to the point of being accused of faking some of the announcements.
A perfectly crafted marketing campaign, or is Google truly at the lead now?
Letās find out.
Multimodal at the Core
With Gemini, Google has launched a family of Generative AI models.
From your pocket to the world
Developed by Google, it includes three main sizes: Ultra, Pro, and Nano.
Gemini Ultra: This is the most capable model in the family, designed for highly complex tasks. It delivers state-of-the-art performance across a range of benchmarks, including reasoning and multimodal tasks.
Gemini Ultra, to be released on Q1 2024 is the model that competes, and improves, over GPT-4.
Gemini Pro: Optimized for performance in terms of cost and latency, this model demonstrates strong reasoning performance and broad multimodal capabilities across various tasks.
In an apples-to-apples benchmark comparison, Gemini Pro showcases similar results to ChatGPT-3.5 Turbo, although beating it most of the time.
Gemini Nano: This model targets on-device deployment and comes in two versions, Nano-1 and Nano-2, designed for low and high-memory devices respectively. It is notable for its efficiency and is trained by distilling from larger Gemini models.
Nano-1 has 1.8 billion parameters and Nano-2 3.25 billion, and both have a 4-bit precision. This means that the former only requires 0.9 GB and the latter 1.7 GB, making them usable in many smartphones.
But what makes Gemini special is its multimodal approach.
From the ground up
Gemini represents the first frontier modelāa best-in-class modelāthat is truly multimodal from the ground up.
As explained by Google, Gemini has been trained not only with loads of text, but text interleaved with images, audio, and even video, and is also capable of generating both images and text.
This means that, unlike most common multimodal solutions, where an LLM is frozen and the representations (vector embeddings) of other modalities are projected into the LLMās input space, a process called grafting, in Geminiās case all components are trained simultaneously so the model learns a global representation of features no matter their modality.
In laypersonās terms, a description of coffee, the sound of coffee being brewed, and an image of a cup of coffee will share very similar embeddings, signaling the model that they all refer to the same concept of ācoffeeā.
Also, unlike grafting, where weights that have been initially trained to work only with text have to deal with tokens from other modalities projected into their space, here the weights will deal with what they saw during training, allowing the capture of the important features from each modality.
Grafting is the easiest and cheapest way of creating multimodal solutions, but gives āsuboptimalā results, as mentioned by Orial Vinyals, VP of Research at Google Deepmind, in the Gemini presentation video.
Of all, video has generated the most hype, but thereās a catch.
The faked video
Gemini models process video by encoding it as a sequence of frames within a large context window.
It can also handle video data with variable input resolutions, enabling more computational resources to be allocated to tasks that require fine-grained understanding.
For instance, as you can see below, the model can process the image sequence and conclude that the person is imitating the famous Matrix scene.
So far, so good.
But the image you are seeing above and what was presented in Googleās āHands-on with Geminiā video are totally different things, igniting a huge controversy.
In the video, Gemini seamlessly interacts with the user in real-time, with no latency, and with very simple prompts, something not short of extraordinary.
But as pointed out by many, the video was faked.
As Oriol himself explained in a tweet, the prompts were shortened and latency was eliminated for a more impactful result to, in their own words, āinspire developersā.
Laughable. Honestly, why is it so hard to just tell the truth?
However, they still did show some other pretty amazing stuff.
The First PRM
As we discussed in last weekās Leaders segment, back in May OpenAI released a paper that proved that Process-Supervized Reward Models, or PRMs, resulted in a great improvement in reasoning tasks in comparison to the standard reward models used in current RLHF pipelines, outcome-supervized (ORMs).
As Reinforcement Learning from Human Feedback, or RLHF, helps instruction-tuned generative pre-trained transformers like ChatGPT or Gemini improve factuality (among other things), PRMs review every step in the modelās reasoning process, instead of simply reviewing the final outcome (what ORMs do).
This is key in cases like mathematics, because if you make an intermediate mistake in the calculation, it improbable you will get the final correct answer.
Surprisingly, it seemsāit has not been acknowledged by GoogleāGemini is the first GenAI model to implement this reward modeling paradigm, as proven by this video.
For instance, the model is capable of pointing out the specific step in the process where the student went wrong.
More impressively, itās capable of generating test examples based on the issue identified to help the student improve:
Very impressive stuff to say the least.
Google is ready to compete
Despite all this, Google's Gemini has received mixed reviews.
For example, critics note that it only now matches OpenAI's GPT-4 level, even though that model has been available since March.
However, the real takeaway is Google's demonstrated ability to compete in this space, a previously uncertain fact.
Crucially, the key release here was never Gemini but AlphaCode2, and its simultaneous release alongside Gemini isn't coincidental.
It signals a shift in focus from solely Multimodal Large Language Models (MLLMs) to a combination of MLLMs and sampling and search algorithms, fostering System 2 thinking.
With AlphaCode2, Google isn't just catching up; it's redefining the AI narrative towards a synergy of MLLMs and search, a narrative where only Google is officially playing right now, as OpenAIās Q* model is only a rumor at this point.
But what is AlphaCode 2 and why it is so important?
More on that belowā¦
š«” Key contributions š«”
Googleās Gemini model family proves to the world that Google is finally ready to compete for the AI throne
Through three different models, Google offers the first frontier multimodal model in four different modalities and the successful implementation of PRM models at scale
š® Practical implications š®
With models like Gemini, GenAI solutions stand to disrupt every industry you can think of, as they delve not only into the world of text, but also images, audio, and even video
Gemini can be used to generate synthetic data at scale, and has great use cases for industries like education, helping teachers create and supervise student development
š¾ Best news of the week š¾
š§ The truth behind Q*
š Meta and Stanford present CHOIS, their text-to-interaction model
š¤ ChatGPT vs BARD. Who wins?
š„ Leaders š„
AlphaCode2, Googleās weapon to claim the narrative in AI
While the entire world is talking about Gemini, Google swiftly released another model, AlphaCode 2, the latest version of their competitive programming model.
Itās a different type of beast. A new type of model.
Combining Gemini Pro and a fascinating sampling mechanism we will get into shortly, it is trained to solve competitive programming challenges that require advanced coding skills and super-strong reasoning capabilities.
Nonetheless, Google estimates that it is already at the 85th percentile of participants, meaning that itās superior to 85% of what can be considered the best coders in the world.
Despite its amazing capabilities, Google seemingly lost an opportunity to give it the spotlight it deserves by releasing it simultaneously to the model that āsurpasses GPT-4ā.
However, they surely knew this.
Then, why did they do that?
Because they are implicitly shifting the narrative of this whole story to their benefit.
Google is changing the rules of the game, and you will soon understand why.
The Creation of Artificial System 2 Thinking
Last week, the entire AI space couldnāt get enough from the Sam Altman drama and, more importantly, the secret model they were creating, named Q*.
This model, not released but acknowledged by Sam Altman itself, proved that OpenAI was following the same thought process as Yoshua Bengio, Andrej Karpathy, or Google Deepmind regarding the ānext big thingā in AI, the creation of the first AI model capable of performing āSystem 2 thinkingā.
But what is that?
Thinking fast and slow
Daniel Kahneman, in his book Thinking, Fast and Slow presents a model of the human mind that describes two distinct ways in which we form thoughts and make decisions, referred to as System 1 and System 2.
System 1: This is the fast, automatic, and often subconscious way of thinking. It's responsible for quick judgments and immediate decisions, often based on intuition and gut reactions. This system is also more prone to biases and errors in judgment, especially in complex or unfamiliar situations.
System 2: Itās the slow, deliberate, and more conscious way of thinking. It's involved in effortful mental activities that require attention, such as complex calculations, logical reasoning, and critical thinking. System 2 is more reliable for making thoughtful and rational decisions, but it is also more resource-intensive and slower.
But why does this matter?
Put simply, it elegantly explains why LLMs are so error-prone when facing complex reasoning tasks like even the simplest of mathematics.
To further understand this concept, if I ask you āWhatās 2+2ā, you will immediately answer back with ā4ā quickly, with no hesitation.
But if I ask you ā16Ć21ā, unless you know that by heart, you are going to take some time to perform the different steps in the calculation in your head before answering.
This is mainly because LLMs are, by design, System 1 thinkers. They immediately respond to your request, with no hesitation.
This is the reason we tend to improve results by using techniques such as Chain-of-thought (CoT), i.e. simple ask it to answer step-by-step.
With CoT, by inducing the model into deliberate thinking and dedicating more compute to answer the question by forcing it to āthink slowlyā and reason its way into the answer, CoT dramatically improves model performance.
Consequently, if we are capable of creating a model that, by design, performs āSystem 2 thinkingā, we are looking into the eyes of the next frontier in AI development, a new type of model that can perform complex reasoning tasks at unparalleled scale, accuracy, and speed.
A model that would, literally, change the world. And hereās why AlphaCode 2 could be that model.
Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- ā¢ NO ADS
- ā¢ An additional insights email on Tuesdays
- ā¢ Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more