That’s the conclusion reached by a new, Microsoft-affiliated scientific paper that looked at the “trustworthiness” — and toxicity — of large language models (LLMs) including OpenAI’s GPT-4 and GPT-3.5, GPT-4’s predecessor.
“We find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely,” the co-authors write in a blog post accompanying the paper.
“[T]he research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services.
This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology.
Jailbreaking LLMs entails using prompts worded in a specific way to “trick” the LLM into perform a task that wasn’t a part of its objective.
“Our goal is to encourage others in the research community to utilize and build upon this work,” they wrote in the blog post, “potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm.”
The original article contains 589 words, the summary contains 195 words. Saved 67%. I'm a bot and I'm open source!