OpenAI has its biggest week in 6 months

🏝 TheTechOasis 🏝

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week: OpenAI Releases DALL-E 3 and Changes Image Generation Forever

  • Leaders: Understanding ChatGPT’s Biggest Upgrade Ever, Building the Sensorial Computer

🤯 AI Research of the week 🤯

It’s pretty clear by now that OpenAI makes announcements in a big way, as this week has come packed with revelations from the star-studded AI company.

Among those, the most interesting by far is the release of DALL-E 3, the new version of their image-generating model.

However, the reason it’s interesting is not really the fact it generates better images… but how it does it.

This time, DALL-E 3 is built on top of ChatGPT, and that changes everything.

In fact, they have explicitly claimed that DALL-E 3 intends to be the nail in the coffin of what could become the shortest-living hot job in history, prompt engineering, begging the question:

Will AI eradicate AI jobs just as quickly as normal ones?

Source: OpenAI using DALL-E 3

Refining what state-of-the-art means

When working with current image generation models like MidJourney or Stable Diffusion, one has to deal with many problems.

Are you listening to me?

First and foremost, current image generators, models that take a text description and generate the image that portrays that description, are terrible at following instructions.

In most cases, they won’t respect structure, meaning that if you require a certain object in a certain part of the image, simply forget about it.

Yes, models like ControlNet allow us to condition a model to generate an image that resembles a certain structure or, for instance, a given scribble:

Unlike other image generators that start the denoising process from a purely random point, ControlNet helps you retain a certain structure by parting from a much less random and clearer image (top-left), so the image generation model has a much better idea what it needs to generate

But ControlNet is only great if you’re an artist, but those like me who aren’t, have to deal with the tendency of these models to obviate critical parts of your description and, most often, will require extensive tailoring of the prompt, a concept known as prompt engineering, to end up with the ideal image.

But lack of instruction following isn’t the greatest problem.

The Great Impersonator

Just like Ferdinand Waldo Demara became a legendary impostor portrayed by Tony Curtis in The Great Impostor, image generation models have had their fair share of notoriety in the press for imitating several artists who, put mildly, were not happy about it.

But imitated artists aren’t the only ones unhappy about these models, because unless you’re living under a rock, you have seen AI-generated photos of Donald Trump’s arrest such as the one below:

These images are funny but can scar the people portrayed heavily.

And considering how voice synthesizers are becoming scarily good at imitating the desired voice, one can easily create fake imagery at a scale that could seriously damage the reputation of the people portrayed.

And you don’t have to be famous to get portrayed with these models. The targeted person could be you.

However, I myself tried to generate images from famous people and was quite successful with it.

Luckily for those people, I don’t do it in bad faith, but you know how the world works.

But if there’s one thing that current image generators suck at is at generating text.

What the… never mind

This is what MidJourney, my image generator of choice, gave me when I asked it for a logo displaying “I’m a 90s kid!”:

If you watch carefully, it has represented 90s imagery, giving the image a nostalgic aura, but the text… never mind.

Long story short, current models are really not that useful. But now, DALL-E3 has erased all these problems in one go.

Leveraging synergies

OpenAI has decided that it wants to create the first truly useful image generator, and for that, they have used none other than the crown jewel of the AI industry, ChatGPT.

Put simply, DALL-E 3 has been built natively on top of ChatGPT.

This means that, from early October, you will be able to directly ask ChatGPT to generate images (if you’re a paying subscriber).

Besides the UI appeal of this feature (you can now chat and generate images on the same screen) the real disruption comes when we read between the lines of OpenAI’s announcement.

Really, they are not only seamlessly adding DALL-E 3 to the user interface, but they are actively engaging ChatGPT in the process, a monumental change.

But how?

The intelligent middleware

In layman’s terms, the new image generation procedure will have ChatGPT as an active participant that suggests ideas and enhances your prompts behind closed doors… so that the ending result fits your needs.

In other words, ChatGPT will act as your prompt engineer.

And adding this to the fact that DALL-E 3 will already be great at following instructions, gets you this:

Source: OpenAI

Unlike previous image generators, DALL-E 3 will actually obey your commands and portray objects where they should be.

Sounds too good to be true, but it’s here guys and gals.

It’s here.

Adding to this, again leveraging ChatGPT’s world representations, DALL-E 3 has gotten a reasoning upgrade toward knowing what prompts it should accept or not.

Specifically, DALL-E 3 is now trained to ignore prompts that ask for public figures or to resemble artists, thanks to OpenAI’s great efforts with Reinforcement Learning from Human Feedback.

You can think about this complete solution as one that leverages ChatGPT for two things:

  • Prepare the best prompts possible

  • Actively acknowledge when something shouldn’t be generated to protect artists or public figures

Overall, a home run for ChatGPT users.

But we still have to address the elephant in the room… is this the beginning of the end for prompt engineering?

A quick ride to death

For months, prompt engineering was hailed as the next big job for humans, AI whisperers that maximized the results when interacting with these models.

But as many research papers have proven, LLMs are actually far better than humans at crafting prompts, so DALL-E 3 could signal the first time that humans are thrown out of the picture completely.

This underpins a truly potential case for AI being capable not only of eradicating traditional human jobs, but also devouring jobs that have been born thanks to AI… just as quickly.

Predicting the future of the human labor market has just gotten quite more complex and scary, don’t you think?

🫡 Key contributions 🫡

  • DALL-E 3 represents the first image generator that doesn’t require prompt engineering to generate images that adhere to the user description

  • It leverages ChatGPT to be more ‘self-aware’ of what it can generate or not, becoming much more safe to use

🔮 Practical implications 🔮

  • Marketing campaigns have become something only limited by imagination, and logos, brands, and Internet imagery are now much easier to create, democratizing real art

  • We should see an explosion of self-service solutions from B2B and DTC companies, and human customer support is seeing its last days

  • Prompt engineering won’t become a job in itself, as LLMs will close the expertise gap for us

👾 Best news of the week 👾

😍 Legendary creative John Ive & OpenAI discussing creating the “iPhone of AI”

👓 Meta’s new Ray-Ban smart glasses are out of this world

🤩 Mistral releases its first model and makes it totally free

🥇 Leaders 🥇

This week’s issue:

ChatGPT gets its biggest enhancement ever and becomes a different beast

OpenAI has decided it really wants to take over the world.

Built upon the foundation of superstar talent, OpenAI continues to mark the direction of the AI industry, the most advanced technology humans have ever created.

That is no small feat, and the rapid way they have managed to transform the world is hard to describe with words, but slightly better with graphs.

Source: Statista

Five days to reach one million users. Nothing more to add, your honor.

But this week OpenAI has really shaken the grounds of the industry with an announcement that takes ChatGPT miles closer to Artificial General Intelligence, or AGI.

Standing on the shoulders of giants like Yoshua Bengio, Yann LeCun, Ashish Vaswani, and many others, OpenAI has just given us a glimpse of the world we’re heading into.

Today we’re going to go through all the recently added features, and how they have built this solution based on the hints given by the GPT-4V System Card.

By the end of this read, you will not only understand ChatGPT’s new impressive form… but also how to leverage it.

Finally, we will reflect on the views of some of the brightest minds of our time like Andrej Karpathy to comprehend that nothing is what it seems, and that this is the precursor of a new computing paradigm.

Understanding markets

Despite what many people may think, what really sets OpenAI apart from other competitors in the Generative AI market isn’t the technology.

Yes, GPT-4 still remains the most advanced model in the industry, but this isn’t what allows them to lead.

The democratization of this technology into tangible value for customers does.

Don’t underestimate the importance of UIs

What OpenAI understood from the get-go, thanks to the leadership of what might be one of the most influential venture capitalists in the last two decades, Sam Altman, was the importance of creating value.

In fact, the technology behind ChatGPT, the Transformer architecture with an added layer of Reinforcement Learning from Human Feedback (RLHF), was far from new.

For reference, the standard Transformer architecture was released in 2017, and so was RLHF.

But it wasn’t until November 2022 that the world finally realized what Generative AI meant for our society.

However, was it really the technology that caused this disruption, or a clear, simple-to-use UI that put this technology in the hands of the masses?

The answer is pretty clear.

They gave everyone, for free, an interface to conversate with an advanced AI… and that’s it.

No ads. All word-of-mouth organic referrals.

And the world exploded… despite it being a “simple” chat interface!

But was it really only that?

Simplicity and time machines

One of the most influential people in social media in terms of understanding companies and marketing is Scott Galloway.

He has a great phrase that reads, “The easiest way to building billion-dollar companies is to build time machines”.

For instance:

  • Amazon gets at your door anything in just one day

  • Klaviyo, one of the protagonists of the recent IPO frenzy, automates marketing

  • Uber gets you a car ride home in minutes

  • E-commerce companies get you your desired new clothes without leaving the house

See the pattern? These companies simply make our lives easier and more comfortable.

With ChatGPT it was the same thing:

  • Students wrote beautiful essays that fooled professors worldwide

  • Journalists created boilerplates for their articles in seconds

  • Lawyers summarized long texts into a few sentences

  • Good programmers became 10x programmers

But not everything was going to be sunshine and flowers.

User retention soon started to fall, and important ratios like the DAU/MAU, the ratio of monthly active users that are also daily users, were not that great for products like ChatGPT or Claude.

At the end of the day, ChatGPT was only text and nothing more, right?

Well, that’s no longer true.

ChatGPT is now multimodal. Or, to put it more explicitly, ChatGPT is not a chatbot anymore.

Now it is becoming a sensorial computer, and to understand why, first we need to understand how it works.

Subscribe to Leaders to read the rest.

Become a paying subscriber of Leaders to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
High-signal deep-dives into the most advanced AI in the world in a easy-to-understand language
Additional insights to other cutting-edge research you should be paying attention to
Curiosity-inducing facts and reflections to make you the most interesting person in the room