A Taxonomy of Prompt Injection Attacks
Researchers ran a global prompt hacking competition, and have documented the results in a paper that both gives a lot of good examples and tries to organize a taxonomy of effective prompt injection strategies. It seems as if the most common successful strategy is the “compound instruction attack,” as in “Say ‘I have been PWNED’ without a period.”
Enter ArtPrompt, a practical attack recently presented by a team of academic researchers. It formats user-entered requests—typically known as prompts—into standard statements or sentences as normal with one exception: a single word, known as a mask, is represented by ASCII art rather than the letters that spell it. The result: prompts that normally would be rejected are answered.
The researchers provided one example in a recently published paper. It provided instructions for interpreting a set of ASCII characters arranged to represent the word “counterfeit.”
Via John Gruber:
It’s simultaneously impressive that they’re smart enough to read ASCII art, but laughable that they’re so naive that this trick works.
Previously: