https://www.lesswrong.com/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment
In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:
I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.
Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]
I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:
Nevertheless, as I stated, I think this would still be quite useful.
https://www.lesswrong.com/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment
In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:
I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.
Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]
I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:
Nevertheless, as I stated, I think this would still be quite useful.