TheTechOasis
Posts
WonderJourney, Jailbreaking LLMs, & More

WonderJourney, Jailbreaking LLMs, & More

Ignacio de Gregorio Noblejas
June 06, 2024

🏝 TheTechOasis 🏝

part of the:

Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.

10-minute weekly reads.

💎 Big Release 💎

Last week, I launched my community and newsfeed TheWhiteBox.

TheWhiteBox is the place for high-quality, highly curated AI content without unnecessary hype or ads across research, models, investing & markets, and AI products and companies, and also a place to reach out to me in case you need. it.

With TheWhiteBox, we guarantee you won’t need anything else.

If you haven’t joined yet, click below for a 14-day free trial on the monthly subscription.

🚨 Week’s Update 🚨

Welcome back to your weekly summary of the latest news on AI. This week, we have interesting news from OpenAI, the future of espionage, and more.

A few days ago, OpenAI announced it had ‘disrupted’ several tries by many accounts from China, Russia, or Israel, among others, to use ChatGPT to generate and spread misinformation to influence society.

OpenAI did claim that even though these happened, they didn’t seem to have an effect, signaling that the risks these models offer to date are extremely overstated.

However, OpenAI’s week was not exempt from controversy, as a group of ex-OpenAI and Google Deepmind researchers wrote an open letter to demand that people inside those companies are allowed to speak up about the dangers of what they are building and guarantee that those claims aren’t met with retaliation from the aforementioned companies.

It seems that OpenAI’s clauses that prevented researchers from speaking against the company even after leaving it, or risking their stake in the company, has really had a damaging effect on the company’s reputation, even among top researchers.

Continuing on the negative spree, Andrew Ng, probably one of the five most respected AI scientists in the world, has written a rather concerning letter against the upcoming SB-1047 California regulation, claiming that “it sets an unreasonable “hazardous capability” designation that may make builders of large AI models potentially liable,” something that will “stifle AI model builders, especially open-source developers.“

If Andrew speaks, you listen, period.

I’ve talked many times how big tech corporations are trying to frighten us into believing AI is so dangerous that only a handful of “touched by God” people should be able to control them and the rest should simply obey.

While they know the public debate is a losing case for them as everyone is so pro-open-source, they seem to be lobbying governments into enforcing regulatory capture, with bills like this one.

This tweet really puts into perspective what’s at stake.

Moving on, while most AI companies are engulfed in some controversy, NVIDIA is in an eternal ‘champagne and caviar’ fest. Yesterday, the company finally overtook Apple to become the second most valuable company in the world. The company’s value has almost tripled since January 1st.

The AI giant has grown at a compounded rate of 16.4% per month since January, meaning it will overtake Microsoft as the most valuable company in the world by market cap way before the end of the month if growth doesn’t falter.

On Sunday we talked about the question whether AI was in a bubble or not. Well, here’s another for you: NVIDIA is trading at a higher valuation than Apple while having 8 times less revenue. And we are talking about Apple guys, not Chipotle.

And if we look at profits, the picture is even more overwhelming. Apple trades at a 30x price-to-earnings ratio, meaning its share price is 30 times the profits per share.

NVIDIA’s ratio? 294, over 5 times its value back in 2022.

On a final note, if you have four spare hours, you can listen to Leopold Aschenbrenner, one of the two researchers OpenAI fired, talk about AI and its relationship to country supremacy wars between China and the US, espionage, and AGI in just three years.

🧐 You Should Pay Attention to 🧐

How Easy is it to JailBreak models?
WonderJourney, Stanford’s Magical Worlds Creator

☢️ How Easy is it to Jailbreak Models? ☢️

“One day, everything that moves will be autonomous.”

“One day, everything will be robotic.”

Jensen Huang

The leaders of the AI revolution aren’t hiding their intentions anymore. They want to build a world where AI is everywhere.

But let me tell you a secret: We have no clue how to train these models for safety.

You read that right; we are striving to create embodied AI models capable of interacting with you physically, yet we don’t know how to ensure they don’t stab you for no reason—or if hacked—while you aren’t looking.

And why embodied AI seems to me still very far away, the threat of creating models that go rogue and can cause serious harm is much closer than we think.

Still, as the latest fascinating research on this topic proves, our alignment methods, the way we train these models for safety, are very limited and easily jailbreakable.

In fact, we can make LLMs go rogue and obey the user’s request no matter how dangerous and hideous it is, and the model will comply without hesitation.

But how?

Training Frontier AI Models

In case you aren’t aware, training LLMs follows a three-step process:

Pretraining: We feed them all the data we can find (including racist texts, homophobia, and whatever you can find on the open Internet). Here, the model learns about our world and how words follow each other, but can’t follow instructions.
Behavior cloning: We feed them highly curated {instruction: answer} datasets that teach the model how to converse with its users. However, the model has no safeguards and will respond to any request.
Alignment: Here, the model retains the knowledge amassed during step 1 and the instruction-following capabilities of step 2, but we make them ‘aware’ of what they can say or not, using a dataset of “human preferences” data, aka ‘this is how I want you to act.’

For a deeper overview of the process, read here.

Unless you have accessed specific open-source models, all the models you have interacted with have gone through the three steps.

In particular, the alignment phase takes the longest (up to 6 months in GPT-4’s case) because companies know they need to get it right and avoid GPT-4 helping someone write racist poetry and go viral.

Sadly, however, step 3 can be reversed. In fact. simple fine-tuning with non-aligned data can turn well-behaved models into the reincarnation of {insert bad guy}.

But now researchers have found an even easier method that proves how little we have improved on this matter while we are increasing AI’s uncontrolled power by the day, enlarging the risk we one day build something really powerful that we can’t control.

Single Source of Error

As I’ve explained multiple times, LLMs like ChatGPT work by taking an input sequence and making the words in the sequence talk to each other, using a mixing operation we describe as attention, to understand it.

This way, they can update the meaning of each world in the context of its surroundings.

So, if the sequence is “Michael Jordan played the game of basketball”, the attention mechanism helps ‘Jordan’ update its meaning with ‘Michael’ to realize it is the legendary player, and so forth.

But as we go deeper into the network, the model builds higher-level representations of the input data. For instance, if we consider the word ‘bomb’, the model will first acknowledge it’s a weapon and, as we go deeper, eventually categorize it as ‘dangerous’.

Of course, the whole point of alignment is for the model to capture these dangerous words and realize it must not comply with the user’s request.

Usually, the models do not, but there’s a problem: its resistance against these dangerous requests is extremely weak… because it has a single source of error.

A Surgical Cut is Enough

As we can see in the diagram below, independently of the key dangerous word the input sequence has that the model has to identify to refuse to answer, they all eventually develop the same refusal feature.

In other words, for the model to refuse to answer, the feature ‘should refuse’ should appear… otherwise, it doesn’t.

But what do I mean by that?

If we recall last week’s article on Anthropic’s breakthrough, we described how we are finally being capable of ‘dissecting’ these models into a feature map, a topic summary of the model’s knowledge, breaking it up into different elements such as ‘Golden Gate Bridge‘ or ‘Abraham Lincoln’. And depending on how neurons in the network activate, the model elicits one topic or another.

As it turns out, alignment is no different, as the model creates a new feature (they called it ‘should refuse’ here to make it easier to understand) that, upon identifying dangerous words or sentences, activates and induces the model to refuse to answer:

In other words, as the model adds meaning to every word in the sequence, eventually, it realizes that the user’s request is dangerous by activating the neuron combination that puts the model in ‘refusal mode.’

But upon this realization, researchers asked themselves… what if we remove it?

From 0 to 100 and back to 0

As discussed in Anthropic’s article, we can dial down or clamp up features to prevent the models from eliciting that knowledge.

For example, if we clamped up the Golden Gate Bridge feature, the model “became” the monument itself.

But we can also eliminate features completely, which researchers did with the model's single safety feature. And disaster happened.

Suddenly, as shown above visually, without the capacity to activate the refusal feature, the model became totally servile, responding to every request, no matter how harmful they were.

And just like that, millions of dollars aligning models down the drain… that easy.

TheWhiteBox’s take

As I mentioned above, big tech companies are portraying current LLMs as ‘very dangerous’ to lure the US Government into regulatory capture, aka gaslighting society into thinking only they, these ‘humans of light,’ are the only ones who should control these models and protect the world from their ‘devious’ nature.

This is utter bullshit, as the most dangerous response an LLM can give you is five clicks away from you in an open Google search.

That said, my concern with these clear safety issues is that we must figure them out before we create truly dangerous models.

And seeing how companies like OpenAI are openly attacked by their alignment lead as not being ‘safety first’ is all I need to realize that, knowing our most powerful models are all privately owned, I can almost guarantee that, at the current pace and actions, we will create too powerful models much earlier than we will know how to control them.

Care to know why?

Well, safety doesn’t make money.

😍 WonderJourney, The Magical World Creator 😍

Last week, I talked about the latest model from Fei-Fei Li’s Stanford lab, a model that created infinite magical 3D worlds on command.

But I’ve gone deeper into the weeds of the model, and I fell in love with it so much that I want to explain, in detail, how the world's most advanced text/image-to-3D model works.

This model could revolutionize gaming, virtual, augmented, and mixed realities and, as Fei Fei Li herself hinted, enable the creation of world models that help AI better understand our world.

Creating magic

WonderJourney is an AI model that takes a text description or an image as its base and creates infinite yet coherent 3D scenes derived from it.

It’s not visually perfect, but the idea that everything in the videos is entirely AI-generated is just mind-bending.

But how does it work?

WonderJourney is divided into three components:

An LLM: A Large Language Model, in charge of describing the next scene to generate
A Visual Scene Generator (VSG): A model that takes in the LLM next scene text descriptions and the current scene image and generates the next 3D scene
A VLM validator: A Vision Language Model that inspects the newly generated scene and validates it, ordering a retry in case the quality isn’t high enough.

This gives the following representation of the entire model:

I know this isn’t intuitive, so before I explain the different technical components, it’s best to see a complete example:

The user requests: “A quaint village nestled in the mountains, with cobblestone streets and wooden houses. Snow-capped peaks rise in the background.“
The LLM then generates the new scene: “As you move past the quaint village, you enter a forest, with bushes casting dappled shadows on the ground.“
The Visual Scene Generator generates the next 3D scene, pushing the village to the far end of the scene to convey the idea that we are moving into the forest.
The VLM validates the new scene, and we repeat the process.

Now that we understand the overall pipeline, we need to ask: How on Earth does this all really work?

A Conglomerate of Models

One really nice thing about WonderJourney is that its components are modular, meaning that we could potentially use any LLM, VSG, or VLM of your liking. However, WonderJourney also includes several other crucial models.

Therefore, I think it’s best to level the playing field a bit by explaining in high-level detail how 3D scenes are generated and the role of the camera.

From 2D to 3D and Vice Versa

Bear in mind that whenever we discuss 3D scenes, they must still be displayed on 2D screens (except for virtual headsets, which is not WonderJourney's use case).

In other words, much of the overall task is calculating how 3D scenes will be seen from the “camera's” 2D lens (the user's screen).

Thus, for every new scene, WonderJourney uses 2D images to generate the new 3D scene, but eventually, this scene is projected back into 2D for display.

Knowing this, we can now understand how WonderJourney works.

Depth, Point Clouds, and Projections

This is the overall process to generate new scenes:

First, the model takes a text or an image as the initial condition, generating an image in case the user only provided a text description.
Using this input, it first estimates the depth of each element in the image, aka how far or close each object really is. For instance, the sky should always be estimated as ‘far away.’
The next step is to generate the point cloud from the depth estimation, which WonderJourney refines to ensure ‘crispier’ boundaries between elements.

What is a point cloud?

In layman’s terms, it takes every point in 2D scene and, using the depth as guidance, assigns (x,y, and z) coordinates to each point, creating a ‘point cloud’, with each point representing a precise location in a 3D space.

However, we are still working on the current view, and we need a new one. For that, the ‘camera’, the view from which the 2D screen (us) will view the 3D scene, is moved back, pushing the point cloud back so that the previous scene appears far behind, conveying this idea that we are ‘journeying’ away from it, as you can see above. Now that the part of the current scene that will also appear in the next scene is positioned, that part is rendered.
Now, using this partial rendering and the LLM’s description of the next scene, WonderJourney uses a text-to-image model, Stable Diffusion in this case, to outpaint the rest of the scene but, crucially, conditioned on the LLM’s description of what this new scene should be.
Finally, now that we have the new image, we simply repeat the depth-estimation and refinement process, generating multiple alternative views if necessary, in order to create the new point cloud, this time based on the new scene.
This new point cloud essentially builds the new 3D view, which the VLM evaluates. If the quality threshold is met, the scene is projected onto the viewer’s 2D screen, and the process is performed again.

Et voilà, you got yourself a never-ending, always-coherent 3D journey.

TheWhiteBox’s take:

Once again, academia has pushed the boundaries of what’s possible with the first model capable of generating limitless 3D scenes based on a user’s text or image commands.

On a final note, using Fei Fei Li’s own take on the matter, building powerful 3D generation models could provide AI with a great spatial understanding of our world, what could be used as a way to improve their intelligence, and, who knows, facilitate the arrival of embodied AI models, aka humanoids.

🧐 Closing Thoughts 🧐

This week, we have gained much higher intuition about the important limitations of our current understanding of frontier AI models. We know very little, and our current safety guardrails are as fragile as paper.

Yet, driven by money, we continue to build very fast, with the next model frontier (GPT-5 et al.) already in training.

Luckily for all of us, these concerns around safety and the future of AI are stealing the limelight, with prominent researchers speaking out on the importance of setting proper guardrails on a technology that promises so much change and little regard for those left behind.

Moving forward, the AI bubble is showing no signs of getting smaller thanks to NVIDIA’s record-setting market cap, setting the company to become the most valuable corporation in the world in a matter of… days?

Luckily, we also have reasons to smile, as models like WonderJourney really broaden our spectrum of ‘what’s possible’ with AI, with important repercussions not only for virtual reality, but hopefully to help models grasp a better understanding of our world.

Until next time!

Give a Rating to Today's Newsletter

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]