- TheTechOasis
- Posts
- Biggest News of the Week and Transfusion
Biggest News of the Week and Transfusion
THEWHITEBOX
TLDR;
𫢠OpenAIâs Japan CEO Leak
đŁ Paradigm, Reinventing spreadsheets with AI
đ Gemini Comes for Free to your Chrome browser
đ¤ OpenAIâs SearchGPT Gets Mixed Results
đ° Ilya Sutskever Raises $1 Billion to Crate Safe SuperIntelligence
𼾠NVIDIA Subpoenaed by DOJ, or not?
𤨠Oprah Winfreyâs AI Special Sparks Huge Controversy
[TREND OF THE WEEK] Transfusion Architectures, a deep dive into Metaâs new architecture that combines the two key models in AI today to create the perfect multimodal solution.
Timely, budget-aligned software delivery meeting your business goals
Quick start and timely project delivery
Industry-leading expertise covering your entire SDLC
ELEKS-owned responsibility for your project's success
NEWSREEL
OpenAIâs Japan CEO Leak
OpenAI Japanâs CEO, Tadao Nagasaki, might have leaked the biggest news of this year, potentially confirming the release of a new OpenAI model this fall, something that The Information leaked, too, and which I covered in TheWhiteBox Notion knowledge base in much more detail than the latter.
In the image above, Tadao illustrates a potential 100x increase in âmodel intelligenceâ (y-axis) regarding their current flagship model, GPT-4o, indicating that we should see a step-function improvement over model quality before the end of the year.
TheWhiteBoxâs take:
From what we uncovered in our Notion article linked above, OpenAI seems to be following a multiple-model strategy.
It will first release a model codenamed âStrawberry,â most likely the GPT-Next shown by Tadao, which would already be a considerable improvement. Furthermore, they are allegedly using this model to train an even better one, codenamed Orion, which will require trillion-token-scale Strawberry synthetic data to train in what could become GPT-5, which will not arrive anytime soon (end of 2025, probably 2026).
Now, how legit is all this?
Well, OpenAI are master storytellers. They control the narrative; they are the iconic brand in AI. However, for the past year, most of these stories have been more hype than truth, with nothing to show for their grandiose promises.
And with a profoundly unprofitable business losing billions every year and a potentially huge investment round coming around the corner in an industry (LLMs) that open-source is rapidly commoditizing, the need for hype is larger than ever, so take this news with a pinch of salt.
However, itâs no secret that OpenAI needs a banger. And they need it soon.
PRODUCT
Paradigm AI Wants To Reinvent Spreadsheets
In a remarkable demo, Paradigm AI, a Y-Combinator start-up, has emerged from stealth with a product that turns every cell in your spreadsheet into a âGenAI environment.â
The link includes a small video demonstration showing the product filling the spreadsheet in seconds based on a user request.
TheWhiteBoxâs take:
As with any GenAI product demo, you can âfeel the magic.â But, sadly, most of these demos rarely translate into reality as expected, as Generative AIâs worst enemy, unreliability, kicks in when actually deploying them.
Having a âspreadsheet AIâ that can turn your analytics game into a declarative one, where you do the questioning and the AIs do the work, is extremely compelling. But nobody will implement a product you canât trust, especially when spreadsheets are highly deterministic and used in error-sensitive tasks.
Therefore, while the demo is inspiring and worth sharing, and despite the company already having users like McKinsey or Bain, donât expect me to openly recommend this product to you like I did with Cursor anytime soon.
GOOGLE
Talk with Gemini with Chrome for Free
Google has released Gemini to be queried from your Chrome search bar. Simply type â@â into your search bar, and click âChat with Geminiâ et voilĂ .
Of course, you must pay to access Gemini Advanced, which hosts the best models, but itâs a neat new feature nonetheless.
TheWhiteboxâs take:
With this, Google has gone an extra step in introducing its Multimodal Large Language Model (MLLM) into our lives (whether you like it or not), probably at a considerable loss (nobody is making a profit, ask OpenAI).
But for a company that has just spent $13 billion in AI CAPEX this last quarter alone (page 6), a humongous $52 billion run rate for the year, and a value higher than the car manufacturer Volvoâs entire market cap, they are desperate for MLLMs to become tools of everyday use.
Will they succeed? Trillions of market dollars are on the line awaiting the answer.
AI SEARCH
Shocking: SearchGPT has Mixed Reviews
Yes, that was sarcasm. In July, OpenAI announced SearchGPT, their product aimed to democratize AI-based search, search where, unlike in traditional engines like Google or Bing, the user asks the question, and the AI does the search, providing an immediate response and follow-up links.
This product has been released to some power testers, and, unsurprisingly, the results were not great, especially regarding hallucinations, the misnomerâconfabulations or bullshitting are much-preferred termsâreferring to situations where the AI generates wrong answers.
Undoubtedly, as long as these models canât be trusted, itâs tough to fathom a paradigm shift in search that many have proclaimed (including myself). And while I do believe this paradigm shift will inevitably occur, current models are simply not ready.
For real examples of SearchGPT, Matt Berman uploaded a video trying the product.
In the meantime, You.com has raised $100 million in a series B round to create productivity engines. These search engines leverage AI for every individual and team task, making the already crowded search industry as competitive as ever.
SUPERINTELLIGENCE
Ilya Sutksever Raises $1 Billion to Build Superintelligence
Ilya Sutskever, OpenAI co-founder, ex-Chief Scientist, and the person who fired Sam Altman back in November 2023, has long departed to build âsafe superintelligenceâ which is the literal name of its new company.
For months, many speculated on what âIlya had seenâ to become so nervous about progress to making the controversial decision of firing the star CEO as they assessed that he was going âtoo fast.â
Many assume that the model that spooked Ilya was Strawberry, the model that could be released this fall.
Now, he and his co-founders have raised $1 billion at an alleged $5 billion valuation (undisclosed) from top VC funds like Sequoia Capital and a16z, and will use the money to buy compute power and hire top talent.
TheWhiteBoxâs take:
SSI is, by far, one of the biggest âwhat if they are rightâ stories in history. While some famous researchers like Yann LeCun or François Chollet believe we are decades away from achieving Artificial General Intelligence, how far is Artificial SuperIntelligence (ASI) that Ilya aims to build?
Well, Ilya claims itâs pretty close, and heâs distraught that we get there before we know how to control such systems. However, by now, you will have realized I do not buy into all this hype from Silicon Valley.
Specifically, many of these researchers citing that âAGI is nearâ are heavily proponents of Rich Suttonâs The Bitter Lesson, which argues that compute is all that matters and that architectural or data breakthroughs are secondary.
Under this view, AGI is evidently near as the unit costs of compute drop orders of magnitude every year. For instance, GPT-4 is hundreds of times more expensive than GPT-4o mini despite being worse and their releases being separated just a year from each other.
From a market perspective, itâs hard not to be amazed by the valuation. A year ago, I wrote a viral piece on Medium about Mistral, three guys who raised $100 million without a product, just a vision and their CVs. I was amazed (they still delivered though).
Now, we have three other guys raising $1 billion, ten times more, while having no product and a website that looks literally like this.
It seems outrageous, but donât forget that Hyperscalers (Google, Amazon, Microsoft, and Meta) will pour $210 billion this year into AI data centers, making SSIs round almost like bread crumpets.
And to put into perspective how little $1 billion is today, hereâs a fact: Microsoft needs just three days of operational free cash flow to earn $1 billion.
REGULATION
NVIDIA Subpoenaed By DOJ?
NVIDIAâs stock fell sharply (10%, or $280 billion) after reports were released about a potential subpoena by the US Department of Justice for possibly breaking antitrust laws.
However, NVIDIA denied the accusation a few hours later, claiming they had not been subpoenaed.
TheWhiteBoxâs take:
Although this seems to be a nothing burger, it gives a striking idea of how reactive and nervous markets are about the valuations of some of these companies.
Sure, from multiple perspectives, NVIDIA doesnât look too expensive. Its forward P/E (the multiple of expected profits for the year over the price of the stock) is 37, nothing too extraordinary considering that during the Dot-com bubble, the Nasdaqâs average P/E was 60 (the 100 companies in the Nasdaq had that average multiple).
But, in my humble opinion, investors are worried about the nature of its profits.
NVIDIA has a very concentrated set of customers (Hyperscalers and a few more companies).
Itâs being weaponized by the US Government against China, concentrating its set of customers even more.
The gap between AI hardware profits (NVIDIA) and AI-generated profits (their customers creating products people actually use) is only increasing, reaching more than $600 billion, according to Sequoia Capital.
Soon enough, any top customers could curb spending as investors urge them to do so if AI revenues do not grow.
In a nutshell, NVIDIA is seen as the thermometer of the AI industry. Everyone is in because they want to be exposed, but they also want to be the first out if things go south. Markets are very nervous.
CONTROVERSIES
Oprahâs âAI Specialâ Sparks Criticism
Last Thursday, ABC announced an Oprah TV special, 'AI and the Future of Us: An Oprah Winfrey Special, which will air on September 12th to cover AI and the future that awaits us all.
However, the announcement has been criticized, as people claim that the TV special will simply become an âAI sales pitchâ for Bill Gates, Sam Altman, and others.
TheWhiteBoxâs take:
Looking at the complete list (listed in the second link), the criticism is completely understandable.
We are about to see an hour-long show-off of âAI visionariesâ with billions on the line telling us how Generative AI, a technology that has generated negligible value compared to its costs, will âchange the worldâ or âsolve all physics,â the latter of which is laughable to the very least.
But what worries me the most is FBI Director Christopher Wray's participation, as he will discuss "the terrifying ways criminals and foreign adversaries are using AI.â
While I donât want to get ahead of myself, prepare to be gaslighted into thinking that AI is the most dangerous tool humanity has ever created and the tremendous existential risks humanity will face because of âits creationâ unless its closed source⌠with absolutely no proof to show for it.
Why? Well, because that threat does not exist today.
Current frontier AI models are compressions of the Internet; they canât generate anything that isnât already publicly available on the Internet. While they will reject to answer questions on how to build cheap bombs, that information is 3-5 clicks away from you in Google.
While Iâm not into conspiracy theories, this special comes at a particularly uncanny timing, as Governor Gavin Newsom could sign the SB-1047 into law this very month, a bill openly rejected by the vast majority of the industry and even by politicians at both parties that could kill open-source completely (in California).
Thus, it takes no genius to realize that scaring the public from AI lays the ground perfectly for the bill to pass.
If that happens, AI creators could be liable for the misuse of their products by third parties. This would inevitably deter anyone but extremely rich corporations from creating AI, consolidating an unequivocally powerful technology in the hands of a culturally undiversified and small group of people in Silicon Valley.
PREMIUM CONTENT
Premium Content You Have MissedâŚ
đ Putting into Perspective Elon Muskâs Game-changer Colossus 6 billion dollar data center. How large will Grok-3 be? (Only Full Premium Subscribers)
𫢠Understanding Cursorâs unique features (All Premium)
đ Personhood Credentials (All Premium)
TREND OF THE WEEK
Transfusion: One Architecture To Rule Them All
Meta has done it again.
They have presented Transfusion, a new architecture that has fulfilled the dream of many: uniting the worlds of the two dominant architectures, autoregressive models and diffusion transformers, while reaching state-of-the-art performance in both, something that neither OpenAI, Anthropic, nor Google can claim.
The Bleeding State-of-the-Art
Today, there are two prominent types of models in AI:
Autoregressive LLM Transformers. Models like GPT-4o (ChatGPT) generate the output to a user input one token (word/subword) at a time.
Diffusion Transformers. Models like Flux (used by Grok-2), Dall-e by OpenAI, or Imagen (used by Gemini) generate images (or video, like OpenAIâs Sora) using a process known as diffusion, where the entire output is generated at once.
Do they have something in common? Yes, they both rely on the Transformer architecture, meaning they use the omnipresent attention mechanism.
The attention mechanism allows tokens in a sequence (words in text or pixel patches in images) to talk to each other. By doing so, each token can update its meaning regarding the rest of the sequence.
For instance, a patch of pixels may represent the ear of a Siberian Husky, a Malamute, or a Wolf. Thus, the patch by itself only knows that itâs an ear, but not the animal. Hence, it talks to the rest of the patches in the image to figure out what animalâs ear itâs representing. This idea underpins all frontier AI research today.
For a closer review of this mechanism, read here.
However, on a first-principle basis, these models work differently by dissimilarly applying this attention mechanism.
In autoregressive transformers, words can only look back in the sequence; they can only pay attention to words that have come before them (masked attention). This ensures that the next prediction is only conditioned on previous words, not future ones (because the adequateness of a word in a sequence solely depends on the words that came earlier, at least from a purely grammatical standpoint).
On the other hand, diffusion transformers apply bidirectional attention, meaning pixel patches in an image can look at any other patch. Unlike text, this is required in images because every patch in the image comes out together and are all influenced by each other.
Another critical difference is that they generate the output differently.
As mentioned, autoregressive LLMs like ChatGPT stream the response; users can see the different tokens generated one step at a time instead of seeing the entire sequence generated in one go.
ChatGPT is rumored to use speculative decoding to generate multiple tokens at once for faster generation, so it isnât really generating one token at a time, but many. Itâs still autoregressive nonetheless.
In diffusion, the model takes an entire canvas in a noisy state and generates the output by denoising it conditioned by the userâs request. As the original noisy canvas is random noise, the actual outcome in each generation varies slightly.
Albeit their differences, they have become the standard in their fields:
Autoregressive LLMs for text and audio processing and generation
Diffusion for image, video, and, lately, audio (autoregressive techniques are still dominant, but diffusion audio models are gaining traction).
Therefore, for years, researchers have tried to unite both worlds and create a standard architecture that works seamlessly with both techniques. In the meantime, all frontier AI labs have had to resort to patching their products with various models.
GPT-4o still uses a DALL-e model to generate images
The same applies to Gemini with Imagen 3
And the same applies to Grok-2 with Flux.
But now, we might finally have found the solution, proving once again that open-source is unstoppable if not crippled by lobbied regulation.
A Combination That Creates Wonders
As you may have realized, despite their differences, they share a common structure: the Transformer.
Thus, Meta has created Transfusion, a model that combines both worlds into one by quickly identifying whenever it needs to generate text or audio and swiftly alternating between techniques.
But how?
As we discussed, while Autoregressive LLMs and Diffusion Transformers are unequivocally state-of-the-art in their modalities, their differences are palpable, mainly with the attention mechanism and generation format.
To solve this issue, they crafted a method where a Transformer backbone autonomously switches between modes.
As seen above, while it might be generating text tokens autoregressively (one at a time with tokens only looking back, not forward), if the model identifies the need to create an image, it generates a special token, <BOI> (Beginning of Image), that signals it has to switch modes.
Next, the model enters the âdiffusion mode,â lays out the entire set of tokens representing the image (the number will depend on the size of the image being generated), and denoises them all combined by applying bidirectional attention.
Once the image patches have no noise left, they are outputted simultaneously, outputting the image in one go. After creating the image, the model generates the special token <EOI> (End of Image), which signals it has to switch to âautoregressive modeâ to generate further text.
Everything seems perfect until now, but does combining both worlds introduce any performance impact? Well, no.
As seen below, the Transfusion model is compared to other LLMs of similar size and training budget and still comes out on top.
Looking at the table below you might think they are cherry-picking the comparison by comparing Llama 1 and 2 and not Llama 3.1.
However, that would be an unfair and, quite frankly, erroneous comparison from a research perspective because Llama 3.1 8B was trained with 15 trillion tokens, 15 times more compute than Transfusion (1 trillion), which obviously would correlate with better performance.
To the surprise of many, the model competes with the best of the best in the text while also being on par (or mostly better) when compared to image models.
In other words, when trained on similar amounts of tokens and with a similar-sized model, Transfusion matches the performance of both text and image models while using the same number of parameters for both.
Thus, Transfusion is the first multimodal parameters model that, instead of combining text parameters and image parameters in a patched architecture, speak both modalities at once with no performance degradation and unmatched efficiency.
TheWhiteBoxâs take
Technology:
If proven at scale (that the performance continues to be great with larger models), thereâs absolutely no reason not to use a Transfusion architecture as the standard. It is much more efficient parameter-wise and much easier to put into production, as you donât have to deal with calls to external models.
Products:
Meta previously hinted about going fully multimodal with Chamaleon. Now, they have an even better architecture (look at the table above) that matches the performance of close-to-state-of-the-art image and text models.
Next stop? Llama 4, which could be starting training soon.
Markets:
With Magic Devâs 100 million context window model, which is mostly guaranteed to be a hybrid architecture (not Transformer-only), and now Transfusion, which is computationally less demanding than a separate combination of modality-only models, investors might expect AI labs to embrace these architectures once and for all, especially knowing how unprofitable OpenAI remains despite having almost $4 billion in revenue.
Markets will soon no longer tolerate inefficient architectures.
THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]