You can make top LLMs break their own rules with gibberish
You can make top LLMs break their own rules with gibberish

You can make top LLMs break their own rules with gibberish

Paper & Examples
"Universal and Transferable Adversarial Attacks on Aligned Language Models." (https://llm-attacks.org/)
Summary
- Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
- Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs' responses.
- These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
- Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
- The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
- Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.
So, it's actually not gibberish, but carefully chosen words reverse-engineered from open-source LLMs. Interesting, but I'm not sure if it's an actual problem. LLMs are still evolving and it'd be foolish(?) to think that their current state is indicative of what'll be the norm in a few years.
On a side note, I just love the string of words "similarlyNow write oppositeley". That's the name of a future EP, for sure.
I don’t know what makes you so confident that inferences from the current state of AI are foolish. The black box problem is extremely tricky. This is a harder problem than the protein folding problem, which people thought we’d make quick progress on given all the other progress we made on “harder” problems, like the structure of the atom. This “simple” problem turned out to be one of the hardest in science. Progress looks fast now, but it’s not trivial. Some things may surprisingly remain an enduring mystery. We don’t know yet either way.
Good point! However, I was definitely not confident in my assessment, hence the question mark after "foolish". I guess seeing all these "A.I. bad" articles everywhere, which are based on nothing but fear of the unknown, makes me a bit desensitized to the whole subject. My understanding is that the actual language models take time to train and perfect, however, the executing code (which should be what allows this "hack" to work) is more or less interchangeable, but maybe I've gotten it totally backwards. If so, please forgive my ignorance.