Biggest News of the Week and Transfusion

In partnership with

THEWHITEBOX
TLDR;

  • 🫢 OpenAI’s Japan CEO Leak

  • 🔣 Paradigm, Reinventing spreadsheets with AI

  • 😎 Gemini Comes for Free to your Chrome browser

  • 🤔 OpenAI’s SearchGPT Gets Mixed Results

  • 💰 Ilya Sutskever Raises $1 Billion to Crate Safe SuperIntelligence

  • 🥵 NVIDIA Subpoenaed by DOJ, or not?

  • 🤨 Oprah Winfrey’s AI Special Sparks Huge Controversy

  • [TREND OF THE WEEK] Transfusion Architectures, a deep dive into Meta’s new architecture that combines the two key models in AI today to create the perfect multimodal solution.

Timely, budget-aligned software delivery meeting your business goals

  • Quick start and timely project delivery

  • Industry-leading expertise covering your entire SDLC

  • ELEKS-owned responsibility for your project's success

NEWSREEL
OpenAI’s Japan CEO Leak

OpenAI Japan’s CEO, Tadao Nagasaki, might have leaked the biggest news of this year, potentially confirming the release of a new OpenAI model this fall, something that The Information leaked, too, and which I covered in TheWhiteBox Notion knowledge base in much more detail than the latter.

In the image above, Tadao illustrates a potential 100x increase in ‘model intelligence’ (y-axis) regarding their current flagship model, GPT-4o, indicating that we should see a step-function improvement over model quality before the end of the year.

TheWhiteBox’s take:

From what we uncovered in our Notion article linked above, OpenAI seems to be following a multiple-model strategy.

It will first release a model codenamed ‘Strawberry,’ most likely the GPT-Next shown by Tadao, which would already be a considerable improvement. Furthermore, they are allegedly using this model to train an even better one, codenamed Orion, which will require trillion-token-scale Strawberry synthetic data to train in what could become GPT-5, which will not arrive anytime soon (end of 2025, probably 2026).

Now, how legit is all this?

Well, OpenAI are master storytellers. They control the narrative; they are the iconic brand in AI. However, for the past year, most of these stories have been more hype than truth, with nothing to show for their grandiose promises.

And with a profoundly unprofitable business losing billions every year and a potentially huge investment round coming around the corner in an industry (LLMs) that open-source is rapidly commoditizing, the need for hype is larger than ever, so take this news with a pinch of salt.

However, it’s no secret that OpenAI needs a banger. And they need it soon.

PRODUCT
Paradigm AI Wants To Reinvent Spreadsheets

In a remarkable demo, Paradigm AI, a Y-Combinator start-up, has emerged from stealth with a product that turns every cell in your spreadsheet into a ‘GenAI environment.’

The link includes a small video demonstration showing the product filling the spreadsheet in seconds based on a user request.

TheWhiteBox’s take:

As with any GenAI product demo, you can ‘feel the magic.’ But, sadly, most of these demos rarely translate into reality as expected, as Generative AI’s worst enemy, unreliability, kicks in when actually deploying them.

Having a ‘spreadsheet AI’ that can turn your analytics game into a declarative one, where you do the questioning and the AIs do the work, is extremely compelling. But nobody will implement a product you can’t trust, especially when spreadsheets are highly deterministic and used in error-sensitive tasks.

Therefore, while the demo is inspiring and worth sharing, and despite the company already having users like McKinsey or Bain, don’t expect me to openly recommend this product to you like I did with Cursor anytime soon.

GOOGLE
Talk with Gemini with Chrome for Free

Google has released Gemini to be queried from your Chrome search bar. Simply type ‘@’ into your search bar, and click ‘Chat with Gemini’ et voilà.

Of course, you must pay to access Gemini Advanced, which hosts the best models, but it’s a neat new feature nonetheless.

TheWhitebox’s take:

With this, Google has gone an extra step in introducing its Multimodal Large Language Model (MLLM) into our lives (whether you like it or not), probably at a considerable loss (nobody is making a profit, ask OpenAI).

But for a company that has just spent $13 billion in AI CAPEX this last quarter alone (page 6), a humongous $52 billion run rate for the year, and a value higher than the car manufacturer Volvo’s entire market cap, they are desperate for MLLMs to become tools of everyday use.

Will they succeed? Trillions of market dollars are on the line awaiting the answer.

AI SEARCH
Shocking: SearchGPT has Mixed Reviews

Yes, that was sarcasm. In July, OpenAI announced SearchGPT, their product aimed to democratize AI-based search, search where, unlike in traditional engines like Google or Bing, the user asks the question, and the AI does the search, providing an immediate response and follow-up links.

This product has been released to some power testers, and, unsurprisingly, the results were not great, especially regarding hallucinations, the misnomer—confabulations or bullshitting are much-preferred terms—referring to situations where the AI generates wrong answers.

Undoubtedly, as long as these models can’t be trusted, it’s tough to fathom a paradigm shift in search that many have proclaimed (including myself). And while I do believe this paradigm shift will inevitably occur, current models are simply not ready.

For real examples of SearchGPT, Matt Berman uploaded a video trying the product.

In the meantime, You.com has raised $100 million in a series B round to create productivity engines. These search engines leverage AI for every individual and team task, making the already crowded search industry as competitive as ever.

SUPERINTELLIGENCE
Ilya Sutksever Raises $1 Billion to Build Superintelligence

Ilya Sutskever, OpenAI co-founder, ex-Chief Scientist, and the person who fired Sam Altman back in November 2023, has long departed to build ‘safe superintelligence’ which is the literal name of its new company.

For months, many speculated on what ‘Ilya had seen’ to become so nervous about progress to making the controversial decision of firing the star CEO as they assessed that he was going ‘too fast.’

Many assume that the model that spooked Ilya was Strawberry, the model that could be released this fall.

Now, he and his co-founders have raised $1 billion at an alleged $5 billion valuation (undisclosed) from top VC funds like Sequoia Capital and a16z, and will use the money to buy compute power and hire top talent.

TheWhiteBox’s take:

SSI is, by far, one of the biggest ‘what if they are right’ stories in history. While some famous researchers like Yann LeCun or François Chollet believe we are decades away from achieving Artificial General Intelligence, how far is Artificial SuperIntelligence (ASI) that Ilya aims to build?

Well, Ilya claims it’s pretty close, and he’s distraught that we get there before we know how to control such systems. However, by now, you will have realized I do not buy into all this hype from Silicon Valley.

Specifically, many of these researchers citing that ‘AGI is near’ are heavily proponents of Rich Sutton’s The Bitter Lesson, which argues that compute is all that matters and that architectural or data breakthroughs are secondary.

Under this view, AGI is evidently near as the unit costs of compute drop orders of magnitude every year. For instance, GPT-4 is hundreds of times more expensive than GPT-4o mini despite being worse and their releases being separated just a year from each other.

From a market perspective, it’s hard not to be amazed by the valuation. A year ago, I wrote a viral piece on Medium about Mistral, three guys who raised $100 million without a product, just a vision and their CVs. I was amazed (they still delivered though).

Now, we have three other guys raising $1 billion, ten times more, while having no product and a website that looks literally like this.

It seems outrageous, but don’t forget that Hyperscalers (Google, Amazon, Microsoft, and Meta) will pour $210 billion this year into AI data centers, making SSIs round almost like bread crumpets.

And to put into perspective how little $1 billion is today, here’s a fact: Microsoft needs just three days of operational free cash flow to earn $1 billion.

REGULATION
NVIDIA Subpoenaed By DOJ?

NVIDIA’s stock fell sharply (10%, or $280 billion) after reports were released about a potential subpoena by the US Department of Justice for possibly breaking antitrust laws.

However, NVIDIA denied the accusation a few hours later, claiming they had not been subpoenaed.

TheWhiteBox’s take:

Although this seems to be a nothing burger, it gives a striking idea of how reactive and nervous markets are about the valuations of some of these companies.

Sure, from multiple perspectives, NVIDIA doesn’t look too expensive. Its forward P/E (the multiple of expected profits for the year over the price of the stock) is 37, nothing too extraordinary considering that during the Dot-com bubble, the Nasdaq’s average P/E was 60 (the 100 companies in the Nasdaq had that average multiple).

But, in my humble opinion, investors are worried about the nature of its profits.

In a nutshell, NVIDIA is seen as the thermometer of the AI industry. Everyone is in because they want to be exposed, but they also want to be the first out if things go south. Markets are very nervous.

CONTROVERSIES
Oprah’s ‘AI Special’ Sparks Criticism

Last Thursday, ABC announced an Oprah TV special, 'AI and the Future of Us: An Oprah Winfrey Special, which will air on September 12th to cover AI and the future that awaits us all.

However, the announcement has been criticized, as people claim that the TV special will simply become an “AI sales pitch” for Bill Gates, Sam Altman, and others.

TheWhiteBox’s take:

Looking at the complete list (listed in the second link), the criticism is completely understandable.

We are about to see an hour-long show-off of ‘AI visionaries’ with billions on the line telling us how Generative AI, a technology that has generated negligible value compared to its costs, will ‘change the world’ or ‘solve all physics,’ the latter of which is laughable to the very least.

But what worries me the most is FBI Director Christopher Wray's participation, as he will discuss "the terrifying ways criminals and foreign adversaries are using AI.”

While I don’t want to get ahead of myself, prepare to be gaslighted into thinking that AI is the most dangerous tool humanity has ever created and the tremendous existential risks humanity will face because of ‘its creation’ unless its closed source… with absolutely no proof to show for it.

Why? Well, because that threat does not exist today.

Current frontier AI models are compressions of the Internet; they can’t generate anything that isn’t already publicly available on the Internet. While they will reject to answer questions on how to build cheap bombs, that information is 3-5 clicks away from you in Google.

While I’m not into conspiracy theories, this special comes at a particularly uncanny timing, as Governor Gavin Newsom could sign the SB-1047 into law this very month, a bill openly rejected by the vast majority of the industry and even by politicians at both parties that could kill open-source completely (in California).

Thus, it takes no genius to realize that scaring the public from AI lays the ground perfectly for the bill to pass.

If that happens, AI creators could be liable for the misuse of their products by third parties. This would inevitably deter anyone but extremely rich corporations from creating AI, consolidating an unequivocally powerful technology in the hands of a culturally undiversified and small group of people in Silicon Valley.

PREMIUM CONTENT
Premium Content You Have Missed…

TREND OF THE WEEK
Transfusion: One Architecture To Rule Them All

Meta has done it again.

They have presented Transfusion, a new architecture that has fulfilled the dream of many: uniting the worlds of the two dominant architectures, autoregressive models and diffusion transformers, while reaching state-of-the-art performance in both, something that neither OpenAI, Anthropic, nor Google can claim.

The Bleeding State-of-the-Art

Today, there are two prominent types of models in AI:

  • Autoregressive LLM Transformers. Models like GPT-4o (ChatGPT) generate the output to a user input one token (word/subword) at a time.

  • Diffusion Transformers. Models like Flux (used by Grok-2), Dall-e by OpenAI, or Imagen (used by Gemini) generate images (or video, like OpenAI’s Sora) using a process known as diffusion, where the entire output is generated at once.

Do they have something in common? Yes, they both rely on the Transformer architecture, meaning they use the omnipresent attention mechanism.

The attention mechanism allows tokens in a sequence (words in text or pixel patches in images) to talk to each other. By doing so, each token can update its meaning regarding the rest of the sequence.

For instance, a patch of pixels may represent the ear of a Siberian Husky, a Malamute, or a Wolf. Thus, the patch by itself only knows that it’s an ear, but not the animal. Hence, it talks to the rest of the patches in the image to figure out what animal’s ear it’s representing. This idea underpins all frontier AI research today.

For a closer review of this mechanism, read here.

However, on a first-principle basis, these models work differently by dissimilarly applying this attention mechanism.

  • In autoregressive transformers, words can only look back in the sequence; they can only pay attention to words that have come before them (masked attention). This ensures that the next prediction is only conditioned on previous words, not future ones (because the adequateness of a word in a sequence solely depends on the words that came earlier, at least from a purely grammatical standpoint).

  • On the other hand, diffusion transformers apply bidirectional attention, meaning pixel patches in an image can look at any other patch. Unlike text, this is required in images because every patch in the image comes out together and are all influenced by each other.

Another critical difference is that they generate the output differently.

  • As mentioned, autoregressive LLMs like ChatGPT stream the response; users can see the different tokens generated one step at a time instead of seeing the entire sequence generated in one go.

ChatGPT is rumored to use speculative decoding to generate multiple tokens at once for faster generation, so it isn’t really generating one token at a time, but many. It’s still autoregressive nonetheless.

  • In diffusion, the model takes an entire canvas in a noisy state and generates the output by denoising it conditioned by the user’s request. As the original noisy canvas is random noise, the actual outcome in each generation varies slightly.

Albeit their differences, they have become the standard in their fields:

  • Autoregressive LLMs for text and audio processing and generation

  • Diffusion for image, video, and, lately, audio (autoregressive techniques are still dominant, but diffusion audio models are gaining traction).

Therefore, for years, researchers have tried to unite both worlds and create a standard architecture that works seamlessly with both techniques. In the meantime, all frontier AI labs have had to resort to patching their products with various models.

  • GPT-4o still uses a DALL-e model to generate images

  • The same applies to Gemini with Imagen 3

  • And the same applies to Grok-2 with Flux.

But now, we might finally have found the solution, proving once again that open-source is unstoppable if not crippled by lobbied regulation.

A Combination That Creates Wonders

As you may have realized, despite their differences, they share a common structure: the Transformer.

Thus, Meta has created Transfusion, a model that combines both worlds into one by quickly identifying whenever it needs to generate text or audio and swiftly alternating between techniques.

But how?

As we discussed, while Autoregressive LLMs and Diffusion Transformers are unequivocally state-of-the-art in their modalities, their differences are palpable, mainly with the attention mechanism and generation format.

To solve this issue, they crafted a method where a Transformer backbone autonomously switches between modes.

As seen above, while it might be generating text tokens autoregressively (one at a time with tokens only looking back, not forward), if the model identifies the need to create an image, it generates a special token, <BOI> (Beginning of Image), that signals it has to switch modes.

Next, the model enters the ‘diffusion mode,’ lays out the entire set of tokens representing the image (the number will depend on the size of the image being generated), and denoises them all combined by applying bidirectional attention.

Once the image patches have no noise left, they are outputted simultaneously, outputting the image in one go. After creating the image, the model generates the special token <EOI> (End of Image), which signals it has to switch to â€˜autoregressive mode’ to generate further text.

Everything seems perfect until now, but does combining both worlds introduce any performance impact? Well, no.

As seen below, the Transfusion model is compared to other LLMs of similar size and training budget and still comes out on top.

Looking at the table below you might think they are cherry-picking the comparison by comparing Llama 1 and 2 and not Llama 3.1.

However, that would be an unfair and, quite frankly, erroneous comparison from a research perspective because Llama 3.1 8B was trained with 15 trillion tokens, 15 times more compute than Transfusion (1 trillion), which obviously would correlate with better performance.

To the surprise of many, the model competes with the best of the best in the text while also being on par (or mostly better) when compared to image models.

In other words, when trained on similar amounts of tokens and with a similar-sized model, Transfusion matches the performance of both text and image models while using the same number of parameters for both.

Thus, Transfusion is the first multimodal parameters model that, instead of combining text parameters and image parameters in a patched architecture, speak both modalities at once with no performance degradation and unmatched efficiency.

TheWhiteBox’s take

Technology:

If proven at scale (that the performance continues to be great with larger models), there’s absolutely no reason not to use a Transfusion architecture as the standard. It is much more efficient parameter-wise and much easier to put into production, as you don’t have to deal with calls to external models.

Products:

Meta previously hinted about going fully multimodal with Chamaleon. Now, they have an even better architecture (look at the table above) that matches the performance of close-to-state-of-the-art image and text models.

Next stop? Llama 4, which could be starting training soon.

Markets:

With Magic Dev’s 100 million context window model, which is mostly guaranteed to be a hybrid architecture (not Transformer-only), and now Transfusion, which is computationally less demanding than a separate combination of modality-only models, investors might expect AI labs to embrace these architectures once and for all, especially knowing how unprofitable OpenAI remains despite having almost $4 billion in revenue.

Markets will soon no longer tolerate inefficient architectures.

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]