• TheTechOasis
  • Posts
  • OpenAI Crushes, Mouth-Computer Interfaces, & Apple's New AI

OpenAI Crushes, Mouth-Computer Interfaces, & Apple's New AI

In partnership with

THEWHITEBOX
TLDR;

  • 👅 MouthPad, A Mouth-Computer Interface

  • 🖌️ Kling Reveals Image-Editing Tool Kolors

  • 👑 The Challengers to NVIDIA’s Crown

  • 🥵 OpenAI Destroys Competition

  • 🔣 Google’s NotebookLM Wil Leave You Speechless

  • 💨 Rysana’s ‘Instant’ Inversion Changes the Inference Game

  • 👓 Snap Reveals its AI Spectacles

  • [TREND OF THE WEEK] UI-JEPA, The Reason Apple Intelligence Got Delayed

The Daily Newsletter for Intellectually Curious Readers

  • We scour 100+ sources daily

  • Read by CEOs, scientists, business owners and more

  • 3.5 million subscribers

NEWSREEL
MouthPad, a Mouth-Computer Interface

Yes, you read that right. Augmental has announced a new device that allows you to control computer or smartphone screens with… your tongue. As shown in the video, the user puts the MouthPad on his mouth and can control his iPhone.

TheWhiteBox’s take:

This very cool technology could be key for disabled people who have difficulty expressing themselves freely. But if it feels like magic, it isn’t.

As I’ve mentioned countless times, all AIs are maps between inputs and outputs, be that a sequence of words into the next one (ChatGPT), a sequence of noisy pixels into a new image (image/video generators), brain activity patterns into actions (Neuralink) and now a map from tongue pressure points on a mouth device to actions on a screen.

Therefore, we can reduce AI to the following: a machine that detects that ‘when x happens, y happens,’ and our goal is simply to make this prediction useful to us. And I can’t think of a more helpful technology than one that fights disabilities, so well done Augmental.

IMAGE EDITING
Kling Reveals Image Editing Feature ‘Kolors’

Kling has released a new feature that allows you to paint new clothes on a target human model in a very realistic fashion (strangely, they used NVIDIA’s CEO, Jensen Huang, as the model). Seeing things like this, how are we going to differentiate what’s true from what’s not?

Another area gaining traction is video-to-video, with very impressive results already, from enhancing older video games to modifying entire videos completely (in both cases, it's Runway’s Video-to-video Gen3 model).

HARDWARE
The Challengers to NVIDIA’s Crown

Talking about Jensen and its company NVIDIA, Spectrum's article provides a nice rundown on the main challengers, describing them and addressing the pros and cons of each one, from AMD and Cerebras, to Groq or SambaNova.

TheWhiteBox’s take:

While competition is welcomed and will eventually force NVIDIA to drop margins, it’s hard to deny they have won the hardware lottery.

  • The main architecture used today, the Transformer, was purpose-built for GPUs

  • Most AI engineers learn to program on CUDA, NVIDIA’s GPU software, before any other platform, ensuring that all libraries, projects, etc, guarantee CUDA support (and not necessarily the rest)

Consequently, in terms of market growth, it’s not even close:

Source: SEC

However, not everything is perfect. The main issue with NVIDIA is latency; GPUs can’t simply compete with hardware like Groq’s LPUs, Sambanova’s RDUs, or Etched.ai's upcoming Sohu chip, the first Transformer ASIC (the chip has the Transformer architecture imprinted on it, guaranteeing a performance that is out of this world).

While latency is not an issue in training (we maximize batch size to ensure maximum parallelization instead of channeling fewer sequences to have lower latency), inference is where these companies, especially non-GPU ones, will have an opportunity to eat NVIDIA’s share.

BENCHMARKS
OpenAI Destroys Competition

The LMSYS team, behind what is considered the most respected LLM leaderboard, has published impressive results from its analysis of o1 models.

As shown above, these models are simply in a league of their own in areas like math. They are also the best overall for multi-turn, hard prompts, and more. Long story short, they are the best available general-purpose models, and it’s not close.

The key thing to note about the lmsys leaderboards is that they include ‘votes’ introduced by people that give their opinion, too (to prevent biases, they have a battleground where you are shown two responses to a prompt and, without knowing what models they are, you choose the best one), so it’s an accurate reflection of crowd sentiment beside the standard benchmark evaluation.

In parallel, today Alibaba announced a new set of Qwen 2.5 models, up to twelve different models ranging in size and performance that, at the top line, are fairly competitive with Llama 3.1 405B, the crown jewel of open-source, despite being more than five times smaller (72 billion parameters).

TheWhiteBox’s take:

There was little doubt this was going to be the case. However, it’s just proof how much we need new benchmarks that really challenge frontier models.

As we discussed on Sunday, if these models are put up against questions that require novelty (questions where they can’t use their all-around knowledge to fake intelligence), like the ARC-AGI challenge, the model is still ‘as bad’ as models like Claude 3.5 Sonnet.

And the list of examples of ‘simple stuff’ they can’t do is vast.

A group of researchers tested o1 models in simple multiplications. While they perform well up to 9×9, they severely struggle afterward (image below), using an absurd amount of reasoning tokens (thinking tokens) more than a human would require.

Interestingly, the same researchers proved that a GPT-2 level model (just 117 million parameters), hundreds of thousands of times smaller than o1 models, gets 99.5%, far exceeding the results of the frontier models, when using clever training techniques. Again, depth beats breadth!

To further understand how o1 works, read my Sunday piece if you haven’t already.

The overarching lesson? o1 models are a progression toward better reasoning, but not necessarily toward AGI.

GOOGLE
NotebookLM Might Leave You Speechless

Google has given us one of those ‘wow’ moments that could really be remembered in the future. They have released a new feature, running on Gemini 1.5 Pro, on NotebookLM, a place to store your notes and data. The feature takes these files and creates an entire podcast with two people actively describing the contents with insane accuracy and extremely high-quality voice generation.

Here are a couple of examples that will blow your mind (here and here). And the best thing is that you can try it, too.

TheWhiteBox’s take:

This is seriously remarkable.

While the idea that you can turn a random text into a scripted conversation isn’t that impressive, considering that’s precisely where LLMs shine at, the fact you can actually create the podcast with clearly different voices that perfectly overlap in conversation is mind-boggling.

Among the Premium-only weekly content, I uploaded an article discussing how the process looks most likely.

PRODUCT
Rysana’s Inversion Changes the Inference Frontier

While we are still digesting the idea that test-time compute is now possible thanks to the o1 models, Rysana has announced the Inversion model family, a set of models that guarantees 100% structured outputs with 100x throughput, 10x lower latency, and 10,000x lower overhead.

The demos are, quite frankly, surreal, as the model’s answers are instant and always structure-valid.

TheWhiteBox’s take:

The idea is that they perform a runtime adaptation of the model through pruning, only keeping weights that matter and speeding up the process. They offer some indications of how they did it, which I delve into in a link at the end of this newsletter.

AI WEARABLES
Snap Launches Its AI Spectacles

Snap Inc. has finally released its new wearable, the Spectacles, that allow you to see things in an augmented reality fashion.

Thanks to OpenAI’s LLMs, the glasses can take requests and perform things such as Spectator Mode, which allows you to share your experience with friends so they can see from their own perspective; Phone Mirroring, which allows you to mirror your favorite apps directly into your Spectacles; and Controller, which allows you to use your phone as a six-degrees-of-freedom controller and play games.

TheWhiteBox’s take:

I’m a firm believer on the YCombinator’s mantra, “build something people want.” (they also happen to have announced a YC company that turns images into 3D settings).

In the case of Augmented Reality products, I feel these CEOs are blinded by their own delusions of the future. I don’t think anyone, ever, requested this product (same applies to the Apple Vision Pro, although that presents super-charged technology that may justify the product in the end).

Today, AI remains a bunch of empty promises with little to account for, and the problem is that instead of building what people want, we are trying to create new necessities that, while cool on a demo, don’t make my life better and end up in my closet.

Cool tech, though.

TREND OF THE WEEK
UI-JEPA, The Reason for Apple Intelligence’s Delay

I think I know why Apple delayed the release of Apple Intelligence, its suite of AI features—including a Siri revamp—arriving to iPhone 15 Pro and upwards.

They have found a better way—a new architecture type—to deploy: UI-JEPA, which is more efficient, vastly cheaper, and produces as good results as state-of-the-art models while relying on a model that runs entirely on your iPhone.

This article encapsulates Apple’s AI strategy, insights into Apple Intelligence's future features, and an easy-to-understand view of how AIs understand our world, all in one.

Enjoy!

The Same Old Story

Having the most powerful technology humans have ever created is pointless if you don’t have the means to use it. And that is a real risk for Generative AI.

It looks easier on paper.

These models are notoriously expensive to train and run (inference) because they need:

  1. Huge amounts of data to learn (they are extremely sample-inefficient)

  2. Vast amounts of parameters to capture the patterns in data

Nonetheless, most frontier models are TeraBytes in size, hundreds of times larger than an iPhone can store (just 8 GB, even for the new iPhone 16 family).

In case you’re wondering, AI models can’t be stored in flash memory; they have to be stored on RAM due to the excessive amount of read/writes to memory. Storing them on flash would lead to unbearable latency.

Consequently, if you want to use top AI models on your smartphone, you need to connect to the Internet, as the AI will run on the cloud (server-side compute).

This is not only a latency issue (the model is serving you and potentially hundreds of other users simultaneously, and the information has to travel back and forth, so the time it takes to respond may be extended), but also a privacy issue.

As you share your personal data with these models, that data must travel through the Internet, creating a clear cybersecurity risk (even OpenAI isn’t free from these attacks).

For Apple, a hardware company proud to be considered ‘privacy-centric,’ the combination of poor user experience due to high latency and privacy concerns makes using frontier AI models more of a problem than a solution. In other words, for Apple, running models on-device is a priority.

Naturally, Apple tried to solve this by training small models in the 3-billion parameter range that could fit inside an iPhone. However, due to the data-hungry and size-hungry nature of Generative AI models, while we could run these models on the smartphone, the experience was poor because the actual AI was not very powerful.

This has led to delays, and even Apple has had to reach deals with OpenAI (and soon Google or Anthropic) to include their models in the Apple Intelligence offering, creating an unmistakable sense of desperation and not great feedback from its customer base.

However, with Apple’s new model, things might take an unexpected turn, and they might not need OpenAI after all. To understand why that’s the case, we need to understand the key intuition as to how machines understand our world: similarity.

Predicting Only the Necessary

To understand the dramatic change Apple has made with this research, we need to understand how models see our world.

AIs are still classical computers at the end of the day, so they can only see numbers. Therefore, when we want them to process text, images, or video, we must transform these into numbers, a process we call ‘encoding.’

ChatGPT and other LLMs are ‘decoder-only,’ meaning they use a look-up table to switch a word into an embedding (instead of using an encoder).

Either way, we are transforming each token (let’s assume the word ‘dog’) into a list of numbers (a vector).

But these numbers aren’t arbitrary; they tell a story about the underlying concept. In a way, you can look at these numbers as attributes, which draws similarities to my favorite analogy, sports video games.

In basketball, if I tell you a player has a 99% shooting accuracy, 9/10 dribbling skills, 8/10 passing skills, and 4 rings... you know it’s Steph Curry. Thus, Curry can be represented by the vector [99, 9, 8, 4].

And [90, 9.5, 9.5, 3]? With such few numbers, it’s hard, but that sounds about right for Lebron. That is a simple example of encodings, we represent real concepts like ‘Steph Curry’ into a set of numbers that tell us who Steph is in a compressed form.

But why do we need them in vector formWell, because it enables the machine to apply similarity between concepts.

As dogs and cats share similar attributes (mammals, four-legged, domestic, etc.), they will have similar vectors, indicating to the AI that both concepts are semantically similar. Therefore, AIs build an encoding space, usually called ‘latent space’ or ‘embedding space,’ where similar things have similar vectors (clustered together) and dissimilar concepts are pushed apart.

For instance, the image below represents UI-JEPA’s latent space, where data is clustered by topic. Data related to banking is all grouped up, as education or finance, with some less obvious data points being more dispersed.

Therefore, whenever you send models like ChatGPT text, remember that the AI is seeing the vectors of each word, figuring out how they relate to each other (the attention mechanism), and figuring out the sentence's meaning. Consequently, the more accurate these representations (known as embeddings) are, the better the AI understands our world.

Sadly, while LLMs capture this meaning well, they require vast training and parameters to do so. But what if there’s a more cost-effective way to find these representations?

And that’s precisely what JEPAs offer and why some think Generative AI enthusiasts are, excuse my words, full of shit.

Learning What’s Necessary

Generative AI models, despite what markets and marketing will tell you, have a considerable amount of detractors.

Do We Need to Generate to See?

Among the most notable of these is Yann LeCun, Meta’s Chief AI Scientist, who argues that autoregressive transformers, known as Large Language Models (LLMs), are extremely overhyped and that the representations they learn (the vectors we discussed) do not adequately describe the underlying concept.

The biggest argument is that GenAI models like ChatGPT or Sora (AI-generated videos) need to generate the future to predict it. In other words, they have to generate the future to see it (this sounds weird but will make sense in a minute).

In other words, if they want to predict that a tree will appear next in a video frame, they have to generate the entire tree, with all the minor details that, while are indeed part of the tree, aren’t necessary to identify the object as a tree.

Using the Steph Curry example, a generative model has to generate every single hair on the player’s body to identify it as Steph Curry. Sure, there’s a very high chance he’s the only player in the NBA with that exact amount of body hair, but is counting body hair an optimal way of predicting an NBA player?

Didn’t we earlier see how we just need four attributes to identify Steph Curry from any other player in the NBA?

In a nutshell, what I’m trying to tell you is that Generative AI models predict everything, forcing them to learn necessary and unnecessary characteristics of a concept. In turn, that means that the representations they learn about the world are, most often, overblown in detail, which explains why these models take so long to learn stuff.

This leads us back to Yann, who has proposed JEPA (Joint-Embedding Predictive Architectures) as an alternative to LLMs to learn about the world, and an architecture that Apple has now used to create UI-JEPAs, a state-of-the-art model in a key area of Apple Intelligence: UI understanding.

The UI-JEPA Architecture

The UI-JEPA framework by Apple combines two models and excels at UI understanding, on par with state-of-the-art models like Claude 3.5 Sonnet or GPT-4 Turbo while having just 4.4 billion parameters, hundreds of times smaller, and requiring at least 50 times less training.

But what do we mean by UI understanding? In layman’s terms, it can see videos of users interacting with the screen of a smartphone and identify their ‘intent.’ For instance, if the user clicks the Clock app and sets a timer, a good model will output, “The user is setting a timer using the Clock app.”

This model could be essential to understanding the user’s intentions and, in the future, taking action on its behalf (if it can characterize the intent of a particular interaction, it can perform the same interaction when that intent is requested of it).

Particularly, UI-JEPA is composed of two elements:

  1. A ViT encoder

  2. A small LLM decoder (Microsoft’s Phi in this case), used to generate the text describing the intent

The objective is clear: train an encoder using JEPA-style training to help it learn better representations more efficiently (faster and cheaper). Then, we feed these improved representations to a small LLM (much worse than a GPT-4 level model) so that the LLM can describe the intent.

But what do we mean by JEPA-style training?

The key difference between JEPAs and Generative models is that they make predictions in the latent space, avoiding the need to predict unnecessary things. In other words, it doesn’t predict Steph Curry by generating every single detail about him (something LLMs would do), but it predicts the vector that represents Steph Curry, ensuring it only needs to predict the essential components of what makes ‘Steph Curry’ different from the rest.

But how do JEPAs learn?

I won’t go into too much detail for length, but the idea is to show the AI a video of a user interacting with an iPhone, hide certain parts of a given frame or entire future frames, and make the AI predict the representations (vectors) of the missing parts.

For instance, if we are masking a frame where a dog appears and we mask the area where the tail is, the model should be able to predict that the mask is hiding a dog tail.

By masking parts of a given frame (above image, middle), the model learns spatial features, ‘how things look like, and how they relate at any given time. ' And by masking parts of future frames (right), the model learns temporal features, â€˜how things evolve.’

In case you’re wondering, this is radically different from how LLMs learn. They learn by ‘generating the next thing’ and checking whether that generation is accurate or not, which is a much harder task (and thus requires larger models and data sets).

And the results speak for themselves, as these models match (or sometimes, exceed) the performance of state-of-the-art models despite being much faster to run (due to their size) and much smaller (easier to host, fitting inside an iPhone); a best of both worlds result.

With Apple’s endorsement, JEPAs might finally manage to steal the spotlight from Generative AI models, especially in situations of constrained compute or size.

TheWhiteBox’s take

Technology & Product.

I have no doubt Apple Intelligence will continue with this approach, using smaller models that learn faster and better using a new emergent architecture class, JEPAs.

Apple isn’t worried about ‘pushing the veil of ignorance back’ but about making AI products that improve its customers' lives while respecting one of Apple’s dogmas: privacy first.

And if JEPAs offer that sweet spot, they are taking it.

Markets:

To this day, I feel that markets aren’t really paying attention to technological breakthroughs. The feeling coming from Silicon Valley—and thus, markets—is that we don’t need more algorithmic breakthroughs, and we simply need to scale these beasts as much as possible.

I think this is approach is not only hard to attain due to energy constraints, I feel it’s fundamentally flawed, arrogant, and doom to fail. It seems like Apple and Meta have acknowledged this (and partly, Microsoft) and are trying to make this technology more sustainable before we hit the wall.

THEWHITEBOX
Join Premium To Access More Unique Content

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

This week includes…

Until next time!

For business inquiries, reach me out at [email protected]