Podcast Episode

Chapters

Premises of the hypothetical situation

Basic setup: an AI company trains a “scientist model” very soon

“Racing forward” assumption: Magma tries to train the most powerful model it can

“HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks

Diagram: Timestep

Diagram: Chain of timesteps

Why do I call Alex’s training strategy “baseline” HFDT?

What are some training strategies that would not fall under baseline HFDT?

“Naive safety effort” assumption: Alex is trained to be “behaviorally safe”

Key properties of Alex: it is a generally-competent creative planner

Alex learns to make creative, unexpected plans to achieve open-ended goals

How the hypothetical situation progresses (from the above premises)

Alex would understand its training process very well (including human psychology)

A spectrum of situational awareness

Why I think Alex would have very high situational awareness

While humans are in control, Alex would be incentivized to “play the training game”

Naive “behavioral safety” interventions wouldn’t eliminate this incentive

Maybe inductive bias or path dependence favors honest strategies?

As humans’ control fades, Alex would be motivated to take over

Diagram: Many instances

Deploying Alex would lead to a rapid loss of human control

Image: Increasing copies via R&D

In this new regime, maximizing reward would likely involve seizing control

Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control

Giving negative rewards to “warning signs” would likely select for patience

Why this simplified scenario is worth thinking about

Acknowledgements

What would change my mind about the path of least resistance?

“Security holes” may also select against straightforward honesty

Simple “baseline” behavioral safety interventions

Using higher-quality feedback and extrapolating feedback quality

Using prompt engineering to emulate more thoughtful judgments

Requiring Alex to provide justification for its actions

Making the training distribution more diverse

Adversarial training to incentivize Alex to act conservatively

“Training out” bad behavior

“Non-baseline” interventions that might help more

Examining arguments that gradient descent favors being nice over playing the training game

Maybe telling the truth is more “natural” than lying?

Maybe path dependence means Alex internalizes moral lessons early?

Maybe gradient descent simply generalizes “surprisingly well”?

A possible architecture for Alex

Diagram: Basic Alex Function

Diagram: Alex + Heads

Diagram: Alex + Heads + Memory

Plausible high-level features of a good architecture

Diagram: High-level architecture of Alex

LessWrong (Curated & Popular)

"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

Sep 27, 2022

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn Download

Subscribe

Apple Podcasts Spotify More