TheTechOasis
Posts
Why is AI Failing?

Why is AI Failing?

Ignacio de Gregorio Noblejas
May 26, 2024

🏝 TheTechOasis 🏝

part of the:

In AI, learning is winning.

While Thursday’s newsletter discusses the present, the Leaders segment on Sundays will inform you of future trends in AI and provide actionable insights to set you apart from the rest.

10-minute weekly reads.

😟 Why Is AI Failing? 😟

Here’s a number that will shock you.

Since the explosion in popularity of AI with the release of ChatGPT, Microsoft, Alphabet, and Amazon have added around $2.5 in market cap value.

Yet, to date, these same companies have generated around $20 billion directly from AI applications, 120 times less. Unsurprisingly, this failure pattern extrapolates to almost every corner of the industry (except hardware), especially when deploying AI.

Now, with Google’s latest - yet absolutely hilarious - AI product failure in their new AI-based search we talked about on Thursday, it got me thinking:

Why are GenAI products failing so badly?

Today, we are addressing just that. But besides the current issues, the truth is that most people use them wrong, and with just a few simple yet high-signal tips, you can go from an unimpressive experience into a remarkable one.

And we are covering those, too.

💎 Big Announcement 💎

Next week, I am officially launching my community, TheWhiteBox, a place for high-quality, highly-curated AI content without unnecessary hype or ads, across research, models, markets, and AI products.

With TheWhiteBox, we guarantee you won’t need anything else.

No credit card information required to join the waitlist.

A Glorified Shithole

Two sides of a coin; even if AI is undoubtedly legit, most current applications suck big time. And we have a very long list of examples, both at software and hardware levels.

Thus, we must analyze the most notable examples to understand what is happening. And you will be surprised by some of the names.

Your Inferior Copilot

When Microsoft Copilot was presented to the world, everyone, including me, was completely amazed. Over a year later, underwhelming is the word.

Slow, not the smartest chatbot you’ve ever used for sure; it fails a lot, and with multiple Reddit threads crying out their pains.

All in all, very far from the high expectations they set. While the idea is brilliant, it was way ahead of its time. Also, as long as the model is cloud-based, the speed will never be amazing.

Furthermore, when not used in a strict Microsoft Office workspace, it’s basically unusable, and the reasoning and planning capabilities are lacking, as the models simply aren’t good enough.

The biggest proof of this is that during Microsoft’s last earnings call, Satya Nadella did not mention direct revenues from Copilot, which they definitely would have if they had the opportunity to brag.

He did mention they already had 1.3 million paid GitHub Copilot users, a product that is well-considered among developers. However, this product is a net loss for Microsoft, as infrastructure costs go well beyond what users pay (at times up to four times).

They did’t mention revenues either so that could very well still be the case.

On a positive note, the recent releases of ChatGPT-4o, already available to Copilot, and Phi-3 Silica, an on-device LLM for Microsoft’s new AI PCs, might change the wind’s direction.

These models strike a nice balance between performance and size, with minimal latency, improving the experience. Either way, to this day, Copilot is far from being considered a great product.

But things are about to get much, much worse.

Come ChatGPT, do something.

While OpenAI is undoubtedly pushing the industry forward, many of its products have been underwhelming, too. For instance, GPTs are certainly not a life-changing experience.

Even I, an avid user of ChatGPT, never use GPTs, and I mean never. They might give in to some interesting conversations, but that’s not ‘the new Apps store’ as some claimed (yours truly included); that’s a toy you play with for 20 minutes and then never use again.

And although OpenAI might still be growing its revenues, the multiples some of these companies measured against current run rates are simply absurd, as illustrated by The Information:

And these multiples aren’t measured against real sales. They are projected revenues, meaning they are based on what these companies expect to generate, making them look even more hysterical.

But why aren’t GenAI products exploding in revenues? Well, there seem to be two main reasons: retention and utility.

For starters, the products just don’t stick. As shown by Sequoia Capital, one of the biggest venture capital firms in the world, the monthly retention for the average GenAI product is terrible.

These numbers are months old, but the products haven’t changed that much to justify any significant change.

Also, daily use, a key metric to evaluate the health of subscription-based products (an overwhelming majority business model in the space), shows that users rarely use the product daily, indicating that real value just isn’t there.

In fact, the painted picture is even worse if we look at these products' DAU/MAU ratio. For instance, only 12% of monthly active users use ChatGPT daily:

Now, this is a huge problem. We mustn’t forget these products are the result of billions, with a ‘b’, of dollars poured into these research labs… only to build products that users really don’t need that much.

Frankly, for AI companies to ever hope of justifying their valuations and for investors to make outsized returns, these products must become an extension of their users, ubiquitous in their lives, just like social media - sadly - is.

But today, that need, that ubiquitousness, is simply not there. And if we look at the other side of the coin, costs, things get worse.

As we discussed a few weeks ago, LLM providers’ actual profits aren’t particularly enticing either, with the now-famous article about Anthorpic’s unimpressive 50-55% gross margins, well below the expected margins of a software company.

Of course, you can claim that costs will decrease over time. Indeed, over the last few years, the prices of products like ChatGPT have decreased by orders of magnitude.

But so did prices, as the intensive competition and the, quite frankly, little to no differentiation features and the very small switching costs of a purely API-based model, give all the signals that all these companies are in a clear race to the bottom in prices unless one of them comes with an overwhelmingly superior model, which is far from being the case today.

However, basing our whole analysis on text-based models only would be very unfair. Thus, how do things look in other modalities?

Multimodality Opens Doors… and Bad Stuff Too

With the advent of more powerful multimodal models that can process data in various ways besides text, the number of possible use cases increases considerably.

Having models who can see what you’re seeing, talk with you, and provide you with helpful support throughout the day is clearly a future some people look forward to.

However, we can’t truly make a case for or against multimodal models today, as their participation in actual products is nonexistent. That said, some products will soon have these capabilities available, which could be a game changer.

Meta and RayBan collaborated to create the Meta glasses. Today, these glasses run on text-only models, but with more powerful yet efficient multimodal models like Meta’s Chamaleon already presented, it’s a matter of time before you start seeing people around you rocking these products.
Brilliant Labs presented the Frame glasses, standard-looking glasses connected to OpenAI’s models to provide a real-life experience that could be highly reminiscent of what Google presented in their I/O conference with Project Astra.

However, multimodality’s potential value creation is based on pure speculation today. Also, the fact we are giving models permission to see everything we do, as Microsoft’s newest AI feature for their laptops, “Recall”, has some people claiming we are on the path to massive AI surveillance.

Continuing on non-text modalities, we have to talk about video generation models like OpenAI’s Sora or Google’s Veo, as these models hope to create a new range of powerful AI products with unique value propositions.

Sadly, the negative trend not only continues… it worsens.

Great Promises, Still Underwhelming

With Sora, we again end at the same place: the product promises way more than it actually delivers.

For instance, models like Sora or Pika are available through Adobe Premiere, as we mentioned a few weeks back. But have these models resonated with its users?

The answer seems to be unequivocally no. For starters, you can actually see it for yourself in Washed Out’s music video for ‘The Hardest Part.’ The scenes are chaotic, almost like you are full-on drugs.

For a more hands-on perspective, I found this blog on Sora particularly eye-opening, clarifying what many feared with the initial release: the thing is basically unusable for anything resembling a professional environment.

Another Sora user, Mike Seymour, wrote that Sora was, again, totally unusable for film-making.

The reasons for such statements were that, besides being painfully slow in its generation, the model was completely inconsistent with objects and frames. The former writer summed it up clearly: “It seems Sora knows how things look but doesn’t know what they are.”

But, without a doubt, the worst reviews in the entire industry have come from GenAI hardware products, even reaching scam claims.

Frankly embarrassing

Since the release of ChatGPT, transitioning from screens to voice interfaces has become one of the hottest conversations in Silicon Valley.

The idea was simple: akin to smart glasses, create a piece of hardware you could communicate with through voice interaction with an underlying AI model.

One example is the Humane AI Pin, a magnetic product you can attach to your clothes. And for the rather elevated price of $749 plus a monthly subscription, you can also have conversations with OpenAI models.

But after the release, the results have been catastrophic, as illustrated by our friend Marques Brownlee, who had a really painful-to-watch review that, I’ll save you the hustle, concluded that it was the worst product he had ever reviewed.

Humane has not recovered from the blow, to the point that it’s allegedly looking to be acquired.

However, things got much worse with our next flop, the Rabbit R1, a product I was very optimistic about due to its unorthodox approach to AI models.

Rabbit claimed the R1 was neurosymbolic, meaning they have a neural prior to interpret voices and screens, and a human-crafted reasoning engine to interact with the different applications through a ‘super host’ cloud service you can’t directly see.

Truthfully, Marques’ review on this wasn’t as bad, but he also concluded that the product was still very far from worth buying.

But that’s not all: As YouTuber Coffeezilla highlighted on Friday, this product could very well be a scam.

Hysterically, the proprietary AI behind the R1 doesn’t seem to exist. In reality, it’s all off-the-shelf OpenAI models and Chrome web extensions like Playwright.

In layman’s terms, it’s ChatGPT-3.5 and a handful of ‘if/else’ statements, meaning that the model isn’t really understanding the screen, but performing human-crafted actions over it.

Thus, the action fails if the screen suffers any minor change (aka it’s dumb as a rock), so the model is extremely buggy today.

Painful to Watch

Long story short, if we have to summarize the GenAI experience as a headline, it would certainly be:

“A range of overpromised yet underwhelming products way ahead of their time by companies that are desperate to justify absurd valuations, even scamming you if necessary.”

However, even though the surface might look like coal, you will find hidden diamonds beneath it if you look carefully. In fact, simply by applying a few tips most people won’t do, you can have a great experience with AI today despite the unnerving results.

The Do's

Although these companies still have a way to go in making their products outstanding, the truth is that people use them incorrectly most of the time.

I'm not trying to sound condescending; nobody has really explained to society how to go beyond the hype and use them as designed.

Thus, I am going to give you a set of key tips that you can use immediately to increase your results.

Chatting with Your Data

As the name implies, Generative AI models are ‘generative.’ But what does that mean? Well, simply put, they are trained to ‘regenerate’ the training data.

In other words, their success depends on how well they have learned the training data and can regenerate it. In layman’s terms, they are trained to imitate the training data.

This is why frontier AI models aren’t that intelligent today. Just like memorizing math theorems doesn’t make you a mathematician, imitating human language doesn’t make you a human-level ‘being.’

Therefore, the most important principle when using GenAI products is to use them only in situations requiring knowledge elicitation.

But what do I mean by that?

One way you are guaranteed to have a great experience is in ‘Chat with your data’ use cases. In other words, you hand the model data, via PDF or prompt, and you question the model about it.

Today’s models excel at in-context learning, meaning they can efficiently work with data they haven’t seen before. However, don’t get too carried away with this capability; use it wisely.

In layman’s terms, ask questions about the data; don’t use that data to ask questions beyond it.

In other words, the fact that the model can process the data in your PDF file doesn’t mean that it can use that data to generate new thoughts or insights.

Remember, it’s all imitation, and nothing is new or ‘emergent,’ as some in Silicon Valley will want you to believe.

The bottom line is that any use cases focusing on knowledge management and elicitation are a great suitor for GenAI today.

Leverage the Training

As models are trained on imitation, research labs have modified their training datasets so that the model imitates certain behaviors. This is absolutely key.

Nevertheless, even though they are considered general-purpose models and thus can generalize into multiple tasks… they can as long as they have been well-trained in them.

Luckily, just taking a quick look at some of the open datasets used to train these models will give you great insights into how to use them:

The example above, used to train Microsoft’s Orca models, shows many things:

The instruction provided by the user is clear and detailed.
The system prompt, an additional layer of behavior control, specifies what the model should expect.
Finally, the response is what the model should respond to in that case.

This gives us plenty of intuition on how to use these models:

These models aren’t trained to work well with vague questions. They are the epitome of GiGa: Garbage inputs will give garbage outputs. Hence, be clear, structured, and detailed in your requests. In fact, if you can define a conversational script the model has to follow, LLMs excel at it, which explains how valuable they are for customer support already.
LLMs love impersonation. Tell the model it’s a boat expert if your question is about boats. As proven by research, “an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert.“

Why does this happen? Well, it’s the name of the game: Imitation!

As we saw in the previous example, when the question was about birds, the trainer added the ‘bird expert’ part to the training example. Thus, by asking it to act as a bird expert, you maximize the chances the model imitates that precise question and behavior, eliciting its knowledge better.

Stick to meta tasks. Don’t ask your LLM to solve world poverty; it has most definitely not seen that request in training as nobody knows the answer. The model was trained for summarization, listing, or language translation tasks. Leverage the training.

Embrace expertise

Another important aspect of these models is that people get carried away by their names. Brands like ‘ChatGPT’ or ‘Claude’ carry a lot of weight, but they are rarely, if ever, better than models fine-tuned specifically for a task.

Therefore, instead of making decisions based on branding, take an expertise-based approach.

For example, Predibase proved that, on average, by spending just $8, they could fine-tune a Mistral 7B model to outperform GPT-4 in up to 25 tasks.

If you aren’t willing to go through the process of learning how to use open-source models, frontier models are still a great choice but will require extensive prompt engineering (carefully crafting the prompt to maximize efficiency).

As we have seen, several small tips can really make a difference.

However, knowing where not to use these models is just as important. Thus, avoiding the following mistakes is crucial too, especially “the Kahneman test”.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• NO ADS
• An additional insights email on Tuesdays
• Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more