- TheTechOasis
- Posts
- OpenAI Crushes, Mouth-Computer Interfaces, & Apple's New AI
OpenAI Crushes, Mouth-Computer Interfaces, & Apple's New AI
THEWHITEBOX
TLDR;
đ MouthPad, A Mouth-Computer Interface
đď¸ Kling Reveals Image-Editing Tool Kolors
đ The Challengers to NVIDIAâs Crown
𼾠OpenAI Destroys Competition
đŁ Googleâs NotebookLM Wil Leave You Speechless
đ¨ Rysanaâs âInstantâ Inversion Changes the Inference Game
đ Snap Reveals its AI Spectacles
[TREND OF THE WEEK] UI-JEPA, The Reason Apple Intelligence Got Delayed
We scour 100+ sources daily
Read by CEOs, scientists, business owners and more
3.5 million subscribers
NEWSREEL
MouthPad, a Mouth-Computer Interface
Yes, you read that right. Augmental has announced a new device that allows you to control computer or smartphone screens with⌠your tongue. As shown in the video, the user puts the MouthPad on his mouth and can control his iPhone.
TheWhiteBoxâs take:
This very cool technology could be key for disabled people who have difficulty expressing themselves freely. But if it feels like magic, it isnât.
As Iâve mentioned countless times, all AIs are maps between inputs and outputs, be that a sequence of words into the next one (ChatGPT), a sequence of noisy pixels into a new image (image/video generators), brain activity patterns into actions (Neuralink) and now a map from tongue pressure points on a mouth device to actions on a screen.
Therefore, we can reduce AI to the following: a machine that detects that âwhen x happens, y happens,â and our goal is simply to make this prediction useful to us. And I canât think of a more helpful technology than one that fights disabilities, so well done Augmental.
IMAGE EDITING
Kling Reveals Image Editing Feature âKolorsâ
Kling has released a new feature that allows you to paint new clothes on a target human model in a very realistic fashion (strangely, they used NVIDIAâs CEO, Jensen Huang, as the model). Seeing things like this, how are we going to differentiate whatâs true from whatâs not?
Another area gaining traction is video-to-video, with very impressive results already, from enhancing older video games to modifying entire videos completely (in both cases, it's Runwayâs Video-to-video Gen3 model).
HARDWARE
The Challengers to NVIDIAâs Crown
Talking about Jensen and its company NVIDIA, Spectrum's article provides a nice rundown on the main challengers, describing them and addressing the pros and cons of each one, from AMD and Cerebras, to Groq or SambaNova.
TheWhiteBoxâs take:
While competition is welcomed and will eventually force NVIDIA to drop margins, itâs hard to deny they have won the hardware lottery.
The main architecture used today, the Transformer, was purpose-built for GPUs
Most AI engineers learn to program on CUDA, NVIDIAâs GPU software, before any other platform, ensuring that all libraries, projects, etc, guarantee CUDA support (and not necessarily the rest)
Consequently, in terms of market growth, itâs not even close:
Source: SEC
However, not everything is perfect. The main issue with NVIDIA is latency; GPUs canât simply compete with hardware like Groqâs LPUs, Sambanovaâs RDUs, or Etched.ai's upcoming Sohu chip, the first Transformer ASIC (the chip has the Transformer architecture imprinted on it, guaranteeing a performance that is out of this world).
While latency is not an issue in training (we maximize batch size to ensure maximum parallelization instead of channeling fewer sequences to have lower latency), inference is where these companies, especially non-GPU ones, will have an opportunity to eat NVIDIAâs share.
BENCHMARKS
OpenAI Destroys Competition
The LMSYS team, behind what is considered the most respected LLM leaderboard, has published impressive results from its analysis of o1 models.
As shown above, these models are simply in a league of their own in areas like math. They are also the best overall for multi-turn, hard prompts, and more. Long story short, they are the best available general-purpose models, and itâs not close.
The key thing to note about the lmsys leaderboards is that they include âvotesâ introduced by people that give their opinion, too (to prevent biases, they have a battleground where you are shown two responses to a prompt and, without knowing what models they are, you choose the best one), so itâs an accurate reflection of crowd sentiment beside the standard benchmark evaluation.
In parallel, today Alibaba announced a new set of Qwen 2.5 models, up to twelve different models ranging in size and performance that, at the top line, are fairly competitive with Llama 3.1 405B, the crown jewel of open-source, despite being more than five times smaller (72 billion parameters).
TheWhiteBoxâs take:
There was little doubt this was going to be the case. However, itâs just proof how much we need new benchmarks that really challenge frontier models.
As we discussed on Sunday, if these models are put up against questions that require novelty (questions where they canât use their all-around knowledge to fake intelligence), like the ARC-AGI challenge, the model is still âas badâ as models like Claude 3.5 Sonnet.
And the list of examples of âsimple stuffâ they canât do is vast.
A group of researchers tested o1 models in simple multiplications. While they perform well up to 9Ă9, they severely struggle afterward (image below), using an absurd amount of reasoning tokens (thinking tokens) more than a human would require.
Interestingly, the same researchers proved that a GPT-2 level model (just 117 million parameters), hundreds of thousands of times smaller than o1 models, gets 99.5%, far exceeding the results of the frontier models, when using clever training techniques. Again, depth beats breadth!
To further understand how o1 works, read my Sunday piece if you havenât already.
The overarching lesson? o1 models are a progression toward better reasoning, but not necessarily toward AGI.
GOOGLE
NotebookLM Might Leave You Speechless
Google has given us one of those âwowâ moments that could really be remembered in the future. They have released a new feature, running on Gemini 1.5 Pro, on NotebookLM, a place to store your notes and data. The feature takes these files and creates an entire podcast with two people actively describing the contents with insane accuracy and extremely high-quality voice generation.
Here are a couple of examples that will blow your mind (here and here). And the best thing is that you can try it, too.
TheWhiteBoxâs take:
This is seriously remarkable.
While the idea that you can turn a random text into a scripted conversation isnât that impressive, considering thatâs precisely where LLMs shine at, the fact you can actually create the podcast with clearly different voices that perfectly overlap in conversation is mind-boggling.
Among the Premium-only weekly content, I uploaded an article discussing how the process looks most likely.
PRODUCT
Rysanaâs Inversion Changes the Inference Frontier
While we are still digesting the idea that test-time compute is now possible thanks to the o1 models, Rysana has announced the Inversion model family, a set of models that guarantees 100% structured outputs with 100x throughput, 10x lower latency, and 10,000x lower overhead.
The demos are, quite frankly, surreal, as the modelâs answers are instant and always structure-valid.
TheWhiteBoxâs take:
The idea is that they perform a runtime adaptation of the model through pruning, only keeping weights that matter and speeding up the process. They offer some indications of how they did it, which I delve into in a link at the end of this newsletter.
AI WEARABLES
Snap Launches Its AI Spectacles
Snap Inc. has finally released its new wearable, the Spectacles, that allow you to see things in an augmented reality fashion.
Thanks to OpenAIâs LLMs, the glasses can take requests and perform things such as Spectator Mode, which allows you to share your experience with friends so they can see from their own perspective; Phone Mirroring, which allows you to mirror your favorite apps directly into your Spectacles; and Controller, which allows you to use your phone as a six-degrees-of-freedom controller and play games.
TheWhiteBoxâs take:
Iâm a firm believer on the YCombinatorâs mantra, âbuild something people want.â (they also happen to have announced a YC company that turns images into 3D settings).
In the case of Augmented Reality products, I feel these CEOs are blinded by their own delusions of the future. I donât think anyone, ever, requested this product (same applies to the Apple Vision Pro, although that presents super-charged technology that may justify the product in the end).
Today, AI remains a bunch of empty promises with little to account for, and the problem is that instead of building what people want, we are trying to create new necessities that, while cool on a demo, donât make my life better and end up in my closet.
Cool tech, though.
TREND OF THE WEEK
UI-JEPA, The Reason for Apple Intelligenceâs Delay
I think I know why Apple delayed the release of Apple Intelligence, its suite of AI featuresâincluding a Siri revampâarriving to iPhone 15 Pro and upwards.
They have found a better wayâa new architecture typeâto deploy: UI-JEPA, which is more efficient, vastly cheaper, and produces as good results as state-of-the-art models while relying on a model that runs entirely on your iPhone.
This article encapsulates Appleâs AI strategy, insights into Apple Intelligence's future features, and an easy-to-understand view of how AIs understand our world, all in one.
Enjoy!
The Same Old Story
Having the most powerful technology humans have ever created is pointless if you donât have the means to use it. And that is a real risk for Generative AI.
It looks easier on paper.
These models are notoriously expensive to train and run (inference) because they need:
Huge amounts of data to learn (they are extremely sample-inefficient)
Vast amounts of parameters to capture the patterns in data
Nonetheless, most frontier models are TeraBytes in size, hundreds of times larger than an iPhone can store (just 8 GB, even for the new iPhone 16 family).
In case youâre wondering, AI models canât be stored in flash memory; they have to be stored on RAM due to the excessive amount of read/writes to memory. Storing them on flash would lead to unbearable latency.
Consequently, if you want to use top AI models on your smartphone, you need to connect to the Internet, as the AI will run on the cloud (server-side compute).
This is not only a latency issue (the model is serving you and potentially hundreds of other users simultaneously, and the information has to travel back and forth, so the time it takes to respond may be extended), but also a privacy issue.
As you share your personal data with these models, that data must travel through the Internet, creating a clear cybersecurity risk (even OpenAI isnât free from these attacks).
For Apple, a hardware company proud to be considered âprivacy-centric,â the combination of poor user experience due to high latency and privacy concerns makes using frontier AI models more of a problem than a solution. In other words, for Apple, running models on-device is a priority.
Naturally, Apple tried to solve this by training small models in the 3-billion parameter range that could fit inside an iPhone. However, due to the data-hungry and size-hungry nature of Generative AI models, while we could run these models on the smartphone, the experience was poor because the actual AI was not very powerful.
This has led to delays, and even Apple has had to reach deals with OpenAI (and soon Google or Anthropic) to include their models in the Apple Intelligence offering, creating an unmistakable sense of desperation and not great feedback from its customer base.
However, with Appleâs new model, things might take an unexpected turn, and they might not need OpenAI after all. To understand why thatâs the case, we need to understand the key intuition as to how machines understand our world: similarity.
Predicting Only the Necessary
To understand the dramatic change Apple has made with this research, we need to understand how models see our world.
AIs are still classical computers at the end of the day, so they can only see numbers. Therefore, when we want them to process text, images, or video, we must transform these into numbers, a process we call âencoding.â
ChatGPT and other LLMs are âdecoder-only,â meaning they use a look-up table to switch a word into an embedding (instead of using an encoder).
Either way, we are transforming each token (letâs assume the word âdogâ) into a list of numbers (a vector).
But these numbers arenât arbitrary; they tell a story about the underlying concept. In a way, you can look at these numbers as attributes, which draws similarities to my favorite analogy, sports video games.
In basketball, if I tell you a player has a 99% shooting accuracy, 9/10 dribbling skills, 8/10 passing skills, and 4 rings... you know itâs Steph Curry. Thus, Curry can be represented by the vector [99, 9, 8, 4].
And [90, 9.5, 9.5, 3]? With such few numbers, itâs hard, but that sounds about right for Lebron. That is a simple example of encodings, we represent real concepts like âSteph Curryâ into a set of numbers that tell us who Steph is in a compressed form.
But why do we need them in vector form? Well, because it enables the machine to apply similarity between concepts.
As dogs and cats share similar attributes (mammals, four-legged, domestic, etc.), they will have similar vectors, indicating to the AI that both concepts are semantically similar. Therefore, AIs build an encoding space, usually called âlatent spaceâ or âembedding space,â where similar things have similar vectors (clustered together) and dissimilar concepts are pushed apart.
For instance, the image below represents UI-JEPAâs latent space, where data is clustered by topic. Data related to banking is all grouped up, as education or finance, with some less obvious data points being more dispersed.
Therefore, whenever you send models like ChatGPT text, remember that the AI is seeing the vectors of each word, figuring out how they relate to each other (the attention mechanism), and figuring out the sentence's meaning. Consequently, the more accurate these representations (known as embeddings) are, the better the AI understands our world.
Sadly, while LLMs capture this meaning well, they require vast training and parameters to do so. But what if thereâs a more cost-effective way to find these representations?
And thatâs precisely what JEPAs offer and why some think Generative AI enthusiasts are, excuse my words, full of shit.
Learning Whatâs Necessary
Generative AI models, despite what markets and marketing will tell you, have a considerable amount of detractors.
Do We Need to Generate to See?
Among the most notable of these is Yann LeCun, Metaâs Chief AI Scientist, who argues that autoregressive transformers, known as Large Language Models (LLMs), are extremely overhyped and that the representations they learn (the vectors we discussed) do not adequately describe the underlying concept.
The biggest argument is that GenAI models like ChatGPT or Sora (AI-generated videos) need to generate the future to predict it. In other words, they have to generate the future to see it (this sounds weird but will make sense in a minute).
In other words, if they want to predict that a tree will appear next in a video frame, they have to generate the entire tree, with all the minor details that, while are indeed part of the tree, arenât necessary to identify the object as a tree.
Using the Steph Curry example, a generative model has to generate every single hair on the playerâs body to identify it as Steph Curry. Sure, thereâs a very high chance heâs the only player in the NBA with that exact amount of body hair, but is counting body hair an optimal way of predicting an NBA player?
Didnât we earlier see how we just need four attributes to identify Steph Curry from any other player in the NBA?
In a nutshell, what Iâm trying to tell you is that Generative AI models predict everything, forcing them to learn necessary and unnecessary characteristics of a concept. In turn, that means that the representations they learn about the world are, most often, overblown in detail, which explains why these models take so long to learn stuff.
This leads us back to Yann, who has proposed JEPA (Joint-Embedding Predictive Architectures) as an alternative to LLMs to learn about the world, and an architecture that Apple has now used to create UI-JEPAs, a state-of-the-art model in a key area of Apple Intelligence: UI understanding.
The UI-JEPA Architecture
The UI-JEPA framework by Apple combines two models and excels at UI understanding, on par with state-of-the-art models like Claude 3.5 Sonnet or GPT-4 Turbo while having just 4.4 billion parameters, hundreds of times smaller, and requiring at least 50 times less training.
But what do we mean by UI understanding? In laymanâs terms, it can see videos of users interacting with the screen of a smartphone and identify their âintent.â For instance, if the user clicks the Clock app and sets a timer, a good model will output, âThe user is setting a timer using the Clock app.â
This model could be essential to understanding the userâs intentions and, in the future, taking action on its behalf (if it can characterize the intent of a particular interaction, it can perform the same interaction when that intent is requested of it).
Particularly, UI-JEPA is composed of two elements:
A ViT encoder
A small LLM decoder (Microsoftâs Phi in this case), used to generate the text describing the intent
The objective is clear: train an encoder using JEPA-style training to help it learn better representations more efficiently (faster and cheaper). Then, we feed these improved representations to a small LLM (much worse than a GPT-4 level model) so that the LLM can describe the intent.
But what do we mean by JEPA-style training?
The key difference between JEPAs and Generative models is that they make predictions in the latent space, avoiding the need to predict unnecessary things. In other words, it doesnât predict Steph Curry by generating every single detail about him (something LLMs would do), but it predicts the vector that represents Steph Curry, ensuring it only needs to predict the essential components of what makes âSteph Curryâ different from the rest.
But how do JEPAs learn?
I wonât go into too much detail for length, but the idea is to show the AI a video of a user interacting with an iPhone, hide certain parts of a given frame or entire future frames, and make the AI predict the representations (vectors) of the missing parts.
For instance, if we are masking a frame where a dog appears and we mask the area where the tail is, the model should be able to predict that the mask is hiding a dog tail.
By masking parts of a given frame (above image, middle), the model learns spatial features, âhow things look like, and how they relate at any given time. ' And by masking parts of future frames (right), the model learns temporal features, âhow things evolve.â
In case youâre wondering, this is radically different from how LLMs learn. They learn by âgenerating the next thingâ and checking whether that generation is accurate or not, which is a much harder task (and thus requires larger models and data sets).
And the results speak for themselves, as these models match (or sometimes, exceed) the performance of state-of-the-art models despite being much faster to run (due to their size) and much smaller (easier to host, fitting inside an iPhone); a best of both worlds result.
With Appleâs endorsement, JEPAs might finally manage to steal the spotlight from Generative AI models, especially in situations of constrained compute or size.
TheWhiteBoxâs take
Technology & Product.
I have no doubt Apple Intelligence will continue with this approach, using smaller models that learn faster and better using a new emergent architecture class, JEPAs.
Apple isnât worried about âpushing the veil of ignorance backâ but about making AI products that improve its customers' lives while respecting one of Appleâs dogmas: privacy first.
And if JEPAs offer that sweet spot, they are taking it.
Markets:
To this day, I feel that markets arenât really paying attention to technological breakthroughs. The feeling coming from Silicon Valleyâand thus, marketsâis that we donât need more algorithmic breakthroughs, and we simply need to scale these beasts as much as possible.
I think this is approach is not only hard to attain due to energy constraints, I feel itâs fundamentally flawed, arrogant, and doom to fail. It seems like Apple and Meta have acknowledged this (and partly, Microsoft) and are trying to make this technology more sustainable before we hit the wall.
THEWHITEBOX
Join Premium To Access More Unique Content
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
This week includesâŚ
Understanding The Last Standing Benchmark, ARC-AGI (All Premium members)
NotebookLM, A âWowâ Moment for AI (cited above) (All Premium members)
Inverted Inference, A New Approach to LLM Inference with 10,000x results (Only Full Premium members)
Native and automatic Model Routing, a New AI Engineering Paradigm (Only Full Premium members)
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]