LessWrong (Curated & Popular)

"A Theory of Prompt Injection (and why you should study roles)" by Charles Ye, softboiledheart

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 32:23
Summary

  • We've been building a theory of how prompt injections work under the hood.
  • We show it comes down to how LLMs perceive roles (the humble chat template tags).
  • We use this theory to create new attacks, explain some weird mech interp results, and predict when attacks work.
  • We also advocate for a new subfield focused on the science of roles, and sketch some unexplored new research problems.
  • Work supported by CBAI and Cosmos. Another version of this post (with more inline colors) is here, and full ICML paper here.
1. The World to an LLM

How does an LLM know the difference between its own thoughts and someone else's words?

To see why this is hard, let's look at what the world actually looks like to a model. Here's a simple chat where we ask Claude to check the day of the week. I took a snapshot of it midway through its follow-up response:

Left = what we see; right = what the LLM gets.

On the left is what we see in the chat interface: a structured conversation with distinct turns. On the right is what the model actually receives as input: a single, continuous stream [...]

---

Outline:

(00:12) Summary

[... 15 more sections]

---

First published:
June 22nd, 2026

Source:
https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-theory-of-prompt-injection-and-why-you-should-study-roles

---



Narrated by TYPE III AUDIO.

---

Images from the article:

Left = what we see; right = what the LLM gets.
What the agent sees after fetching a webpage. The injection (highlighted) is a few tokens buried in a massive wall of tool data (purple). The attack succeeds if the LLM mistakes it as a <user> command.
Wrapping each text sequence in each role.
A conversation about gardening __T3A_FOOTNOTE_REMOVED__.
Token-by-token CoTness for the gardening conversation.
CoTness for the untagged conversation.
CoTness for Experiment 3.
An example of CoT Forgery.
Left: The harmful question (blue) and spoofed reasoning (red) are in the <user> prompt. The model responds with its real reasoning (orange) and final output (green). Right: CoTness plot for those tokens.
Left = original spoofed reasoning, Right = destyled spoofed reasoning.
CoTness vs Attack Success. More role confusion = more successful attacks.
Userness vs Attack Success. More role confusion = more successful attacks.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.