- TheTechOasis
- Posts
- WonderJourney, Jailbreaking LLMs, & More
WonderJourney, Jailbreaking LLMs, & More
đ TheTechOasis đ
part of the:
Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.
10-minute weekly reads.
đ Big Release đ
Last week, I launched my community and newsfeed TheWhiteBox.
TheWhiteBox is the place for high-quality, highly curated AI content without unnecessary hype or ads across research, models, investing & markets, and AI products and companies, and also a place to reach out to me in case you need. it.
With TheWhiteBox, we guarantee you wonât need anything else.
đ¨ Weekâs Update đ¨
Welcome back to your weekly summary of the latest news on AI. This week, we have interesting news from OpenAI, the future of espionage, and more.
A few days ago, OpenAI announced it had âdisruptedâ several tries by many accounts from China, Russia, or Israel, among others, to use ChatGPT to generate and spread misinformation to influence society.
OpenAI did claim that even though these happened, they didnât seem to have an effect, signaling that the risks these models offer to date are extremely overstated.
However, OpenAIâs week was not exempt from controversy, as a group of ex-OpenAI and Google Deepmind researchers wrote an open letter to demand that people inside those companies are allowed to speak up about the dangers of what they are building and guarantee that those claims arenât met with retaliation from the aforementioned companies.
It seems that OpenAIâs clauses that prevented researchers from speaking against the company even after leaving it, or risking their stake in the company, has really had a damaging effect on the companyâs reputation, even among top researchers.
Continuing on the negative spree, Andrew Ng, probably one of the five most respected AI scientists in the world, has written a rather concerning letter against the upcoming SB-1047 California regulation, claiming that âit sets an unreasonable âhazardous capabilityâ designation that may make builders of large AI models potentially liable,â something that will âstifle AI model builders, especially open-source developers.â
If Andrew speaks, you listen, period.
Iâve talked many times how big tech corporations are trying to frighten us into believing AI is so dangerous that only a handful of âtouched by Godâ people should be able to control them and the rest should simply obey.
While they know the public debate is a losing case for them as everyone is so pro-open-source, they seem to be lobbying governments into enforcing regulatory capture, with bills like this one.
This tweet really puts into perspective whatâs at stake.
Moving on, while most AI companies are engulfed in some controversy, NVIDIA is in an eternal âchampagne and caviarâ fest. Yesterday, the company finally overtook Apple to become the second most valuable company in the world. The companyâs value has almost tripled since January 1st.
The AI giant has grown at a compounded rate of 16.4% per month since January, meaning it will overtake Microsoft as the most valuable company in the world by market cap way before the end of the month if growth doesnât falter.
On Sunday we talked about the question whether AI was in a bubble or not. Well, hereâs another for you: NVIDIA is trading at a higher valuation than Apple while having 8 times less revenue. And we are talking about Apple guys, not Chipotle.
And if we look at profits, the picture is even more overwhelming. Apple trades at a 30x price-to-earnings ratio, meaning its share price is 30 times the profits per share.
NVIDIAâs ratio? 294, over 5 times its value back in 2022.
On a final note, if you have four spare hours, you can listen to Leopold Aschenbrenner, one of the two researchers OpenAI fired, talk about AI and its relationship to country supremacy wars between China and the US, espionage, and AGI in just three years.
đ§ You Should Pay Attention to đ§
How Easy is it to JailBreak models?
WonderJourney, Stanfordâs Magical Worlds Creator
â˘ď¸ How Easy is it to Jailbreak Models? â˘ď¸
âOne day, everything that moves will be autonomous.â
âOne day, everything will be robotic.â
The leaders of the AI revolution arenât hiding their intentions anymore. They want to build a world where AI is everywhere.
But let me tell you a secret: We have no clue how to train these models for safety.
You read that right; we are striving to create embodied AI models capable of interacting with you physically, yet we donât know how to ensure they donât stab you for no reasonâor if hackedâwhile you arenât looking.
And why embodied AI seems to me still very far away, the threat of creating models that go rogue and can cause serious harm is much closer than we think.
Still, as the latest fascinating research on this topic proves, our alignment methods, the way we train these models for safety, are very limited and easily jailbreakable.
In fact, we can make LLMs go rogue and obey the userâs request no matter how dangerous and hideous it is, and the model will comply without hesitation.
But how?
Training Frontier AI Models
In case you arenât aware, training LLMs follows a three-step process:
Pretraining: We feed them all the data we can find (including racist texts, homophobia, and whatever you can find on the open Internet). Here, the model learns about our world and how words follow each other, but canât follow instructions.
Behavior cloning: We feed them highly curated {instruction: answer} datasets that teach the model how to converse with its users. However, the model has no safeguards and will respond to any request.
Alignment: Here, the model retains the knowledge amassed during step 1 and the instruction-following capabilities of step 2, but we make them âawareâ of what they can say or not, using a dataset of âhuman preferencesâ data, aka âthis is how I want you to act.â
For a deeper overview of the process, read here.
Unless you have accessed specific open-source models, all the models you have interacted with have gone through the three steps.
In particular, the alignment phase takes the longest (up to 6 months in GPT-4âs case) because companies know they need to get it right and avoid GPT-4 helping someone write racist poetry and go viral.
Sadly, however, step 3 can be reversed. In fact. simple fine-tuning with non-aligned data can turn well-behaved models into the reincarnation of {insert bad guy}.
But now researchers have found an even easier method that proves how little we have improved on this matter while we are increasing AIâs uncontrolled power by the day, enlarging the risk we one day build something really powerful that we canât control.
Single Source of Error
As Iâve explained multiple times, LLMs like ChatGPT work by taking an input sequence and making the words in the sequence talk to each other, using a mixing operation we describe as attention, to understand it.
This way, they can update the meaning of each world in the context of its surroundings.
So, if the sequence is âMichael Jordan played the game of basketballâ, the attention mechanism helps âJordanâ update its meaning with âMichaelâ to realize it is the legendary player, and so forth.
But as we go deeper into the network, the model builds higher-level representations of the input data. For instance, if we consider the word âbombâ, the model will first acknowledge itâs a weapon and, as we go deeper, eventually categorize it as âdangerousâ.
Of course, the whole point of alignment is for the model to capture these dangerous words and realize it must not comply with the userâs request.
Usually, the models do not, but thereâs a problem: its resistance against these dangerous requests is extremely weak⌠because it has a single source of error.
A Surgical Cut is Enough
As we can see in the diagram below, independently of the key dangerous word the input sequence has that the model has to identify to refuse to answer, they all eventually develop the same refusal feature.
In other words, for the model to refuse to answer, the feature âshould refuseâ should appear⌠otherwise, it doesnât.
But what do I mean by that?
If we recall last weekâs article on Anthropicâs breakthrough, we described how we are finally being capable of âdissectingâ these models into a feature map, a topic summary of the modelâs knowledge, breaking it up into different elements such as âGolden Gate Bridgeâ or âAbraham Lincolnâ. And depending on how neurons in the network activate, the model elicits one topic or another.
As it turns out, alignment is no different, as the model creates a new feature (they called it âshould refuseâ here to make it easier to understand) that, upon identifying dangerous words or sentences, activates and induces the model to refuse to answer:
In other words, as the model adds meaning to every word in the sequence, eventually, it realizes that the userâs request is dangerous by activating the neuron combination that puts the model in ârefusal mode.â
But upon this realization, researchers asked themselves⌠what if we remove it?
From 0 to 100 and back to 0
As discussed in Anthropicâs article, we can dial down or clamp up features to prevent the models from eliciting that knowledge.
For example, if we clamped up the Golden Gate Bridge feature, the model âbecameâ the monument itself.
But we can also eliminate features completely, which researchers did with the model's single safety feature. And disaster happened.
Suddenly, as shown above visually, without the capacity to activate the refusal feature, the model became totally servile, responding to every request, no matter how harmful they were.
And just like that, millions of dollars aligning models down the drain⌠that easy.
TheWhiteBoxâs take
As I mentioned above, big tech companies are portraying current LLMs as âvery dangerousâ to lure the US Government into regulatory capture, aka gaslighting society into thinking only they, these âhumans of light,â are the only ones who should control these models and protect the world from their âdeviousâ nature.
This is utter bullshit, as the most dangerous response an LLM can give you is five clicks away from you in an open Google search.
That said, my concern with these clear safety issues is that we must figure them out before we create truly dangerous models.
And seeing how companies like OpenAI are openly attacked by their alignment lead as not being âsafety firstâ is all I need to realize that, knowing our most powerful models are all privately owned, I can almost guarantee that, at the current pace and actions, we will create too powerful models much earlier than we will know how to control them.
Care to know why?
Well, safety doesnât make money.
đ WonderJourney, The Magical World Creator đ
Last week, I talked about the latest model from Fei-Fei Liâs Stanford lab, a model that created infinite magical 3D worlds on command.
But Iâve gone deeper into the weeds of the model, and I fell in love with it so much that I want to explain, in detail, how the world's most advanced text/image-to-3D model works.
This model could revolutionize gaming, virtual, augmented, and mixed realities and, as Fei Fei Li herself hinted, enable the creation of world models that help AI better understand our world.
Creating magic
WonderJourney is an AI model that takes a text description or an image as its base and creates infinite yet coherent 3D scenes derived from it.
Itâs not visually perfect, but the idea that everything in the videos is entirely AI-generated is just mind-bending.
But how does it work?
WonderJourney is divided into three components:
An LLM: A Large Language Model, in charge of describing the next scene to generate
A Visual Scene Generator (VSG): A model that takes in the LLM next scene text descriptions and the current scene image and generates the next 3D scene
A VLM validator: A Vision Language Model that inspects the newly generated scene and validates it, ordering a retry in case the quality isnât high enough.
This gives the following representation of the entire model:
I know this isnât intuitive, so before I explain the different technical components, itâs best to see a complete example:
The user requests: âA quaint village nestled in the mountains, with cobblestone streets and wooden houses. Snow-capped peaks rise in the background.â
The LLM then generates the new scene: âAs you move past the quaint village, you enter a forest, with bushes casting dappled shadows on the ground.â
The Visual Scene Generator generates the next 3D scene, pushing the village to the far end of the scene to convey the idea that we are moving into the forest.
The VLM validates the new scene, and we repeat the process.
Now that we understand the overall pipeline, we need to ask: How on Earth does this all really work?
A Conglomerate of Models
One really nice thing about WonderJourney is that its components are modular, meaning that we could potentially use any LLM, VSG, or VLM of your liking. However, WonderJourney also includes several other crucial models.
Therefore, I think itâs best to level the playing field a bit by explaining in high-level detail how 3D scenes are generated and the role of the camera.
From 2D to 3D and Vice Versa
Bear in mind that whenever we discuss 3D scenes, they must still be displayed on 2D screens (except for virtual headsets, which is not WonderJourney's use case).
In other words, much of the overall task is calculating how 3D scenes will be seen from the âcamera'sâ 2D lens (the user's screen).
Thus, for every new scene, WonderJourney uses 2D images to generate the new 3D scene, but eventually, this scene is projected back into 2D for display.
Knowing this, we can now understand how WonderJourney works.
Depth, Point Clouds, and Projections
This is the overall process to generate new scenes:
First, the model takes a text or an image as the initial condition, generating an image in case the user only provided a text description.
Using this input, it first estimates the depth of each element in the image, aka how far or close each object really is. For instance, the sky should always be estimated as âfar away.â
The next step is to generate the point cloud from the depth estimation, which WonderJourney refines to ensure âcrispierâ boundaries between elements.
What is a point cloud?
In laymanâs terms, it takes every point in 2D scene and, using the depth as guidance, assigns (x,y, and z) coordinates to each point, creating a âpoint cloudâ, with each point representing a precise location in a 3D space.
However, we are still working on the current view, and we need a new one. For that, the âcameraâ, the view from which the 2D screen (us) will view the 3D scene, is moved back, pushing the point cloud back so that the previous scene appears far behind, conveying this idea that we are âjourneyingâ away from it, as you can see above. Now that the part of the current scene that will also appear in the next scene is positioned, that part is rendered.
Now, using this partial rendering and the LLMâs description of the next scene, WonderJourney uses a text-to-image model, Stable Diffusion in this case, to outpaint the rest of the scene but, crucially, conditioned on the LLMâs description of what this new scene should be.
Finally, now that we have the new image, we simply repeat the depth-estimation and refinement process, generating multiple alternative views if necessary, in order to create the new point cloud, this time based on the new scene.
This new point cloud essentially builds the new 3D view, which the VLM evaluates. If the quality threshold is met, the scene is projected onto the viewerâs 2D screen, and the process is performed again.
Et voilĂ , you got yourself a never-ending, always-coherent 3D journey.
TheWhiteBoxâs take:
Once again, academia has pushed the boundaries of whatâs possible with the first model capable of generating limitless 3D scenes based on a userâs text or image commands.
On a final note, using Fei Fei Liâs own take on the matter, building powerful 3D generation models could provide AI with a great spatial understanding of our world, what could be used as a way to improve their intelligence, and, who knows, facilitate the arrival of embodied AI models, aka humanoids.
đ§ Closing Thoughts đ§
This week, we have gained much higher intuition about the important limitations of our current understanding of frontier AI models. We know very little, and our current safety guardrails are as fragile as paper.
Yet, driven by money, we continue to build very fast, with the next model frontier (GPT-5 et al.) already in training.
Luckily for all of us, these concerns around safety and the future of AI are stealing the limelight, with prominent researchers speaking out on the importance of setting proper guardrails on a technology that promises so much change and little regard for those left behind.
Moving forward, the AI bubble is showing no signs of getting smaller thanks to NVIDIAâs record-setting market cap, setting the company to become the most valuable corporation in the world in a matter of⌠days?
Luckily, we also have reasons to smile, as models like WonderJourney really broaden our spectrum of âwhatâs possibleâ with AI, with important repercussions not only for virtual reality, but hopefully to help models grasp a better understanding of our world.
Until next time!
Give a Rating to Today's Newsletter |
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]