LessWrong (Curated & Popular)
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.
If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
LessWrong (Curated & Popular)
[Linkpost] "Interpreting Language Model Parameters" by Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors, Lee Sharkey
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
This is a link post. This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.
Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.
We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.
Additionally, as with our previous technique SPD, VPD does not [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 5th, 2026
Source:
https://www.lesswrong.com/posts/eAQZaiC3PcBhS4HjM/linkpost-interpreting-language-model-parameters
Linkpost URL:
https://www.goodfire.ai/research/interpreting-lm-parameters
---
Narrated by TYPE III AUDIO.
---
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.
Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.
We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.
Additionally, as with our previous technique SPD, VPD does not [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 5th, 2026
Source:
https://www.lesswrong.com/posts/eAQZaiC3PcBhS4HjM/linkpost-interpreting-language-model-parameters
Linkpost URL:
https://www.goodfire.ai/research/interpreting-lm-parameters
---
Narrated by TYPE III AUDIO.
---


