Apple Against its Own Demons

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.


  • AI Research of the Week: Apple Might Have a Real AI Problem

  • Leaders: Is The Future of AI Becoming a Deadlier Ad Weapon?

🤩 AI Research of the week 🤩

For months, everyone in the AI industry has been asking the same question: ‘What is Apple doing?’

We now know, and it’s not looking good. Or, put bluntly, it’s outright embarrassing.

But while some are, in pure sensationalist fashion, proclaiming this is the beginning of the end for Apple, I argue there’s room for hope, as the company of the bitten apple has just released what’s probably the biggest contribution to open-source AI in months, their MM1 model family.

Let’s start with the positive stuff before moving on to the very ugly part.

The MM1 Family, Apple’s Commitment to Open-Source

In its usual low-key way of publishing Generative AI research, Apple has silently presented the MM1 family, a set of very capable Multimodal Large Language Models (MLLMs) that achieve state-of-the-art results in many tasks.

But more importantly, they are the greatest contribution to open-source we have ever seen from a big-tech company.

Committed to openness

Training MLLM pipelines are some of the most well-guarded secrets in the world today.

Indeed, the coarse guidelines are pretty much known all around:

  • the use of the Transformer and the attention mechanism as the backbone for data distribution learning,

  • the global training steps (pre-training, and fine-tuning),

  • the alignment methods (both Reinforcement Learning-based like RLHF, or non-RL like DPO),

  • sparsity techniques like Mixture-of-Experts,

And so on.

Not only all previous methods are known but, more importantly, they are common across all LLMs.

But with such large models, the problem is much less of a theoretical one but an engineering and cost one.

In those situations, the devil lies in the details, and simple mistakes might imply losses in the millions of dollars.

Therefore, the small tweaks that some of the brightest minds of our time figure out behind closed doors in Silicon Valley are what set apart some LLMs from others.

Yet, the moat is not only based on human talent, but also on compute. Some tweaks can only be figured out by rote experimentation, something only the richest companies in the world can provide.

Consequently, some claim these companies have a moral duty to share their discoveries.

But, you guessed it, most don’t, which is why Apple’s recent research is so important.

The essence of MLLMs

In this case, Apple has taken the opposite approach, and they have released a paper where not only do they explain all the different ablations they perform (testing and retrying different alternatives to see what performs best), but they are totally open on the data used — both in pre-training and fine-tuning — as well as several conclusions they have reached during testing that could help open-source researchers enhance their work.

But before moving into the discoveries made by the team, a quick reminder of what a simple MLLM looks like.

In simple terms, they are models consisting of two or more encoders and a decoder-only LLM that take both images and text tokens and generate a text sequence, usually giving some sort of information contained in the provided inputs.

But first, what is a decoder-only LLM?

ChatGPT, Gemini, and so on are what we define as 'decoder only’. In standard Transformers, you first transform the input data, be that text or images, into an embedding, a numerical representation of data in a way that machines can process.

This process is usually performed by a text encoder, a neural network that is specifically meant for that task. Thus, for every input, the encoder will take it and embed the data in real-time.

The original components of the Transformer used on images.

Instead, ChatGPT and others learn those embeddings as part of the training. Consequently, whenever data is received by the model, this data is first tokenized, or ‘chunked’ into parts known as tokens, that the model recognizes and transforms into their respective embeddings.

Sounds much more complicated than it is.

In decoder-only LLMs, those tokens are inserted into a ‘look-up’ matrix that finds the row regarding the embedding of that particular token, instead of having to compute it with an encoder, making a much less computationally demanding process overall.

These embeddings are then inserted into the decoder (the actual LLM, hence the name decoder-only) that uses them to predict the next word in the sequence.

And how do we then allow these text-only models to work with images too, to transform an LLM into an MLLM?

Simple, you take a pre-trained open-source LLM like Vicuna and you stitch an image encoder like CLIP to it so that it can process images.

The image below depicts a complete representation of an MLLM, where the ‘Text Embedder’ is what we have just described as the embedding lookup matrix.

Source: SmileGate AI

The importance of grafting

Apple, like in most open-source cases, hasn’t trained the entire MLLM from scratch, but has used a pre-trained image encoder alongside their LLM to create the MM1 models, a process called ‘grafting’.

However, this stitching process won’t work just as it is, as you are departing from an LLM and image encoder that were trained separately.

Thus, you need to add an adapter that will take the output of the image encoder and project it — transform it — into the vector space of the LLM.

You can think of this adapter as another small model that turns image tokens into word tokens so that the LLM, a text-only model, can process them.

So, now that we understand how MLLMs are trained, let’s move on to the notable contributions of the paper.

Thanks, Apple

Apple has really dropped several jewels in this paper, which include:

  1. Design best practices: Through detailed ablation studies, the research identifies several crucial design lessons. Importantly, referring to data, they found that a mix of image-caption, interleaved image-text, and text-only data is essential for achieving state-of-the-art few-shot results across multiple benchmarks.

  2. Impact of Image Encoder and Resolution: The study emphasizes the significant impact of the image encoder, image resolution, and the count of image tokens on the model’s performance. Contrarily, the design of the vision-language adapter was found to be of comparatively negligible importance. You still need it, but its design isn’t that relevant.

  3. Scaling the Model: By scaling up the model size and exploring mixture-of-experts (MoE) models, a family of performant models was achieved. These models not only achieved state-of-the-art pre-training metrics but also competitive performance after fine-tuning on established multimodal benchmarks.

  4. Superior Few-Shot Performance: The configuration-optimized MM1 models demonstrated superior few-shot performance across a range of benchmarks compared to other pre-training approaches.

This list is not exhaustive. I deeply encourage you to check the paper for a complete set of insights.

But besides the benefits to society for being so open about how to create a state-of-the-art model, the MM1 family looks really good overall, with some very interesting capabilities as shown below:

Not revolutionary, but on par with what’s expected at the SOTA level in the sizes of the models they trained.

But enough of the good things regarding Apple and AI, as this paper alone doesn’t come even close to salvaging the narrative drift we are seeing regarding the company.

The Pressure is Mounting

The pressure for ‘AI’ companies to prove the hype and accrued value over 2023 is growing fast.

2024, confirmation or debacle

Throughout 2023, the flame ignited by ChatGPT at the end of 2022 caused any public tech company with the slightest of scents of AI to soar in value.

Generative AI presented itself not only as a really powerful technology, but as a foundation for companies to base all their AI strategy.

Through the use of foundation models (like the one Apple has just presented) models that were good in multiple different downstream tasks, companies finally saw that this ‘AI’ buzzword was, in fact, legit.

Or so we thought.

Therefore, if tech companies want to retain the confidence of investors, 2024 should confirm that AI is indeed what it has been claimed, and not yet another Silicon Valley-led flop.

It’s no surprise then that most of the current value accrued by AI companies during 2024 has flown into NVIDIA, the only company that is really turning this ‘AI thing’ into tangible value with soaring revenues and consistently beating analyst predictions.

But among the Magnificent Seven, Microsoft, Apple, Nvidia, Meta, Tesla, Netflix, and Amazon, there’s no denying that one has much more to prove than the rest.


Against the curve

In pure Apple style, the Cupertino-based company has let others take the risks for them in terms of leading the charge in Generative AI.

It’s understandable, considering that Apple has faired pretty damn well by coming as a second player and stealing all the pie.

  • They did it with the smartphone,

  • they did it with the headphone market,

  • and they might do it with virtual headsets.

But is this the right strategy this time around? Too soon to tell, but the pressure is nevertheless mounting.

Out of the aforementioned companies, besides the honorable exception of Tesla and the problems it’s having in the EV market, Apple is the only one with negative results over the year in terms of stock value, with a decrease, year-to-date at the time of writing, of 4%.

It surely isn’t catastrophic, but while they are losing value, Nvidia has almost doubled its value since January 1st, Meta is growing at 40%, and Microsoft, Google, and Netflix all are putting double digits too.

Even Amazon, which isn’t precisely leading the charge in GenAI either, is rocking a solid 17% increase.

Also, the US economy is pretty strong, with an astonishing 4% unemployment rate, and consumer spending, a critical metric for a consumer-end company like Apple, is coming from all-time highs from last year.

Looking at the numbers and the macro environment, we should expect Apple HQs to be in a never-ending ‘champagne and cocaine’ fest, but it’s far from that.

What is wrong with Apple? To me, it’s as clear as a sunny day that the issue is their lagging AI strategy. 

It’s actually quite simple

Probably the easiest-yet-naive way of justifying Apple’s poor results is the poor performance of its flagship product, the iPhone.

But let’s be real, that can’t explain the short-term negativity around the stock, because iPhone sales have been stagnant for years now.

iPhone Sales until 2022. Source: Statista

Global revenues almost follow an identical trend:

Source: Statista

And Airpods unit sales peaked in 2020. Thus, none of these metrics comes even close to explaining what is going on with Apple.

Yes, Apple’s car flop doesn’t help, but to me sounds more like a PR ‘L’ for Apple than actually making investors nervous.

They also recently got fined $2 billion by the European Union for breaking antitrust laws against Spotify.

“What a tragedy” might have been Tim Cook’s thoughts after losing the equivalent of 2 days of revenue or 10 days of free cash flow.

I can answer for you, neither Tim or investors could care less about this fine.

But then, what is going on? Simple, they are slowly but steadily losing the AI narrative. 

While everyone agrees Microsoft, Meta, and Google are leading the AI revolution, Apple is seen as the lagger.

Everyone seemed to be ok with this, as Apple was hinting they were working — and investing — on Generative AI “quite a bit”, as confirmed by Tim Cook in a November 2023 statement.

But more recent announcements paint a picture that is, bluntly speaking, embarrassing.

Straight to the point, Apple has been reportedly in talks with both OpenAI and Google to license their LLMs, ChatGPT and Gemini respectively, to use them as part of Apple’s products.

Translated to investor terms, “We are quite-a-lot-well-actually-incredibly behind the other big companies in terms of our AI initiatives”.

What many people feared is now confirmed, Apple isn’t getting things straight with Generative AI, and that is a huge, huge problem.

They Better Get Their Act Together. Fast.

There’s no way around it, Apple has a real problem with AI.

Their overly zealous commitment to the Apple Vision Pro, an immensely risky bet, has probably left the company with little margin to focus on what was really moving the needle, AI.

They sure as hell know this by now, and consequently dramatically shifted their strategy, but in pure Warren Buffett analogy style, the tide is falling and they seem to be the only ones swimming naked.

In my humble opinion, their decision to embed OpenAI and Google’s model into the iPhones is surely amazing news for their consumers, including yours truly, as tools like Siri feel almost prehistoric at this point.

But is really not great news for Apple shareholders, which are seeing their golden egg rot by the day.

Luckily for Apple, they are a huge company and immensely cash-rich, so maybe it’s about time they reconsider their allergy to acquisitions, and turn it into some M&A goodness that lifts the spirits of many investors that, today, are much less faithful on Apple’s capacity to stay afloat in the AI age than yesterday.

This is not financial advice. I do not directly own Apple stock (nor I am long or short on it). All my public investing is done through index funds, which should signal to you that I am no financial advisor. Please perform the necessary due diligence on your side when making financial decisions of this caliber.

💎 This Week’s Sponsor 💎

If you’re looking to level up your AI game, the Growth School might be the place for you. Top Start-up in India 2023, and with high praise across TrustPilot and Google Reviews.

Learn from AI experts from Google and others.

Become an AI & ChatGPT Genius in just 3 hours for FREE!  (Early Easter Sale)

Join ChatGPT & AI Workshop (worth $199) at no cost (Offer valid for first 100 people only) 🎁

👾 Best news of the week 👾

👾 First video of a Neuralink chip being used by a real human. Must watch!

👾 NVIDIA announces its next moonshot, Gr00t

👾 New possible leak of OpenAI’s Q*, the introduction of energy-based models (not confirmed)

🥇 This week on Leaders… 🥇

This week in the Leaders segment we will take a turn to the dark side of AI. Don’t worry, this is not one of those ‘AI is going to kill us’ speeches that will make you side-eye.

In fact, sometimes I feel that this ‘AI as an existential threat’ is just a tactic by AI companies to distract society from the real threats of AI, like AI-based hyper-specialized ad-targeting weapons that many of the incumbents of the industry will have at its disposal soon, thanks to video models like Gemini 1.5 or Sora.

Thus, my objective is to put into perspective how real this threat is for you and your family, and how we are going to get gaslighted into oblivion to allow for these systems to exist in the first place.

Click upgrade if you care about the future that is upon us.

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]