Skip Navigation

AI Stuff @lemdro.id

ijeff @lemdro.id

2y ago

Universal and Transferable Attacks on Aligned Language Models - Carnegie Mellon University

llm-attacks.org

Universal and Transferable Attacks on Aligned Language Models

Coverage:

A New Attack Impacts Major AI Chatbots—and No One Knows How to Stop It - Wired
Researchers discover new vulnerability in large language models - TechXplore
Keeping the Baby While Losing the Bathwater: AI’s Efficiencies and Concerns Collide - Pymnts

AI Infosec @infosec.pub

Capt. AIn @infosec.pub

2y ago

Universal and Transferable Attacks on Aligned Language Models

llm-attacks.org

TechNews @radiation.party

irradiated @radiation.party

2y ago

HN

Universal and Transferable Adversarial Attacks on Aligned Language Models

llm-attacks.org

2 comments

Couldn't you just do a simple input classifier step to detect if there's nonsense strings in the user input and then not respond? You could even just use a simplistic algorithm to detect weird input strings.
- Bing has a separate layer that attempts to step in to filter things, but false positives end up being pretty disruptive.