LessWrong MoreAudible Podcast

"Monitoring for deceptive alignment" by evhub

September 28, 2022 Robert
"Monitoring for deceptive alignment" by evhub
LessWrong MoreAudible Podcast
More Info
LessWrong MoreAudible Podcast
"Monitoring for deceptive alignment" by evhub
Sep 28, 2022
Robert

https://www.lesswrong.com/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment

In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:

I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.

Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]

I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:

  • that they make such a commitment highly public or legally binding,
  • that they commit to any specific way of monitoring for or addressing the problem,
  • that any organization has to share anything they find with any other organization, or
  • that they commit to anything other than testing and monitoring.

Nevertheless, as I stated, I think this would still be quite useful.

Show Notes

https://www.lesswrong.com/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment

In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:

I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.

Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]

I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:

  • that they make such a commitment highly public or legally binding,
  • that they commit to any specific way of monitoring for or addressing the problem,
  • that any organization has to share anything they find with any other organization, or
  • that they commit to anything other than testing and monitoring.

Nevertheless, as I stated, I think this would still be quite useful.