Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, Mark Xu, and James Lucassen for useful comments, conversations, and feedback that informed this post.

The more I have thought about AI safety over the years, the more I have gotten to the point where the only worlds I can imagine myself actually feeling good about humanity’s chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.[1] Fundamentally, that’s because if we don’t have the feedback loop of being able to directly observe how the internal structure of our models changes based on how we train them, we have to essentially get that structure right on the first try—and I’m very skeptical of humanity’s ability to get almost anything right on the first try, if only just because there are bound to be unknown unknowns that are very difficult to predict in advance.

Certainly, there are other things that I think are likely to be necessary for humanity to succeed as well—e.g. convincing leading actors to actually use such transparency techniques, having a clear training goal that we can use our transparency tools to enforce, etc.—but I currently feel that transparency is the least replaceable necessary condition and yet the one least likely to be solved by default.

Nevertheless, I do think that it is a tractable problem to get to the point where transparency and interpretability is reliably able to give us the sort of insight into our models that I think is necessary for humanity to be in a good spot. I think many people who encounter transparency and interpretability, however, have a hard time envisioning what it might look like to actually get from where we are right now to where we need to be. Having such a vision is important both for enabling us to better figure out how to make that vision into reality and also for helping us tell how far along we are at any point—and thus enabling us to identify at what point we’ve reached a level of transparency and interpretability that we can trust it to reliably solve different sorts of alignment problems.

The goal of this post, therefore, is to attempt to lay out such a vision by providing a “tech tree” of transparency and interpretability problems, with each successive problem tackling harder and harder parts of what I see as the core difficulties. This will only be my tech tree, in terms of the relative difficulties, dependencies, and orderings that I expect as we make transparency and interpretability progress—I could, and probably will, be wrong in various ways, and I’d encourage others to try to build their own tech trees to represent their pictures of progress as well.

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn Download

Spotify RSS Feed

Buzzsprout

Listen on

Spotify RSS Feed

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn

https://www.lesswrong.com/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

LessWrong MoreAudible Podcast

"A transparency and interpretability tech tree" by evhub

Listen to this podcast on