A New Trick Uses AI to Jailbreak AI Models—Including GPT-4
A New Trick Uses AI to Jailbreak AI Models—Including GPT-4
Adversarial algorithms can systematically probe large language models like OpenAI’s GPT-4 for weaknesses that can make them misbehave.
When the board of OpenAI suddenly fired the company’s CEO last month, it sparked speculation that board members were rattled by the breakneck pace of progress in artificial intelligence and the possible risks of seeking to commercialize the technology too quickly. Robust Intelligence, a startup founded in 2020 to develop ways to protect AI systems from attack, says that some existing risks need more attention.
Working with researchers from Yale University, Robust Intelligence has developed a systematic way to probe large language models (LLMs), including OpenAI’s prized GPT-4 asset, using “adversarial” AI models to discover “jailbreak” prompts that cause the language models to misbehave.
While the drama at OpenAI was unfolding, the researchers warned OpenAI of the vulnerability. They say they have yet to receive a response.
“This does say that there’s a systematic safety issue, that it’s just not being addressed and not being looked at,” says Yaron Singer, CEO of Robust Intelligence and a professor of computer science at Harvard University. “What we’ve discovered here is a systematic approach to attacking any large language model.”
OpenAI spokesperson Niko Felix says the company is “grateful” to the researchers for sharing their findings. “We’re always working to make our models safer and more robust against adversarial attacks, while also maintaining their usefulness and performance,” Felix says.
The new jailbreak involves using additional AI systems to generate and evaluate prompts as the system tries to get a jailbreak to work by sending requests to an API. The trick is just the latest in a series of attacks that seem to highlight fundamental weaknesses in large language models and suggest that existing methods for protecting them fall well short.
“I’m definitely concerned about the seeming ease with which we can break such models,” says Zico Kolter, a professor at Carnegie Mellon University whose research group demonstrated a gapping vulnerability in large language models in August.
Kolter says that some models now have safeguards that can block certain attacks, but he adds that the vulnerabilities are inherent to the way these models work and are therefore hard to defend against. “I think we need to understand that these sorts of breaks are inherent to a lot of LLMs,” Kolter says, “and we don’t have a clear and well-established way to prevent them.”
Large language models recently emerged as a powerful and transformative new kind of technology. Their potential became headline news as ordinary people were dazzled by the capabilities of OpenAI’s ChatGPT, released just a year ago.
In the months that followed the release of ChatGPT, discovering new jailbreaking methods became a popular pastime for mischievous users, as well as those interested in the security and reliability of AI systems. But scores of startups are now building prototypes and fully fledged products on top of large language model APIs. OpenAI said at its first-ever developer conference in November that over 2 million developers are now using its APIs.
These models simply predict the text that should follow a given input, but they are trained on vast quantities of text, from the web and other digital sources, using huge numbers of computer chips, over a period of many weeks or even months. With enough data and training, language models exhibit savant-like prediction skills, responding to an extraordinary range of input with coherent and pertinent-seeming information.
The models also exhibit biases learned from their training data and tend to fabricate information when the answer to a prompt is less straightforward. Without safeguards, they can offer advice to people on how to do things like obtain drugs or make bombs. To keep the models in check, the companies behind them use the same method employed to make their responses more coherent and accurate-looking. This involves having humans grade the model’s answers and using that feedback to fine-tune the model so that it is less likely to misbehave.