"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra

LessWrong (Curated & Popular)

Chapters
6:41
Cartoon: XKCD
21:25
Premises of the hypothetical situation
23:44
Basic setup: an AI company trains a “scientist model” very soon
33:51
“Racing forward” assumption: Magma tries to train the most powerful model it can
36:28
“HFDT scales far” assumption: Alex is trained to achieve excellent performance on a wide range of difficult tasks
39:00
Diagram: Alex
41:24
Diagram: Timestep
41:50
Diagram: Chain of timesteps
46:31
Why do I call Alex’s training strategy “baseline” HFDT?
49:06
What are some training strategies that would not fall under baseline HFDT?
51:51
“Naive safety effort” assumption: Alex is trained to be “behaviorally safe”
56:28
Key properties of Alex: it is a generally-competent creative planner
1:00:50
Alex learns to make creative, unexpected plans to achieve open-ended goals
1:02:30
How the hypothetical situation progresses (from the above premises)
1:04:12
Alex would understand its training process very well (including human psychology)
1:04:47
A spectrum of situational awareness
1:08:29
Why I think Alex would have very high situational awareness
1:12:09
While humans are in control, Alex would be incentivized to “play the training game”
1:20:26
Naive “behavioral safety” interventions wouldn’t eliminate this incentive
1:26:54
Maybe inductive bias or path dependence favors honest strategies?
1:28:08
As humans’ control fades, Alex would be motivated to take over
1:30:20
Diagram: Many instances
1:31:41
Deploying Alex would lead to a rapid loss of human control
1:36:53
Image: Increasing copies via R&D
1:40:44
In this new regime, maximizing reward would likely involve seizing control
1:47:05
Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control
1:54:23
Giving negative rewards to “warning signs” would likely select for patience
1:57:16
Why this simplified scenario is worth thinking about
2:09:35
Acknowledgements
2:10:24
Appendices
2:10:32
What would change my mind about the path of least resistance?
2:16:01
“Security holes” may also select against straightforward honesty
2:20:36
Simple “baseline” behavioral safety interventions
2:21:49
Using higher-quality feedback and extrapolating feedback quality
2:24:26
Using prompt engineering to emulate more thoughtful judgments
2:26:51
Requiring Alex to provide justification for its actions
2:31:24
Making the training distribution more diverse
2:37:33
Adversarial training to incentivize Alex to act conservatively
2:40:43
“Training out” bad behavior
2:43:01
“Non-baseline” interventions that might help more
2:46:06
Examining arguments that gradient descent favors being nice over playing the training game
2:47:04
Maybe telling the truth is more “natural” than lying?
2:48:52
Maybe path dependence means Alex internalizes moral lessons early?
2:51:32
Maybe gradient descent simply generalizes “surprisingly well”?
2:55:15
A possible architecture for Alex
2:56:22
Diagram: Basic Alex Function
2:57:50
Diagram: Alex + Heads
3:00:32
Diagram: Alex + Heads + Memory
3:03:11
Plausible high-level features of a good architecture
3:03:28
Diagram: High-level architecture of Alex
LessWrong (Curated & Popular)
"Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" by Ajeya Cotra
Sep 27, 2022

https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think that in the coming 15-30 years, the world could plausibly develop “transformative AI”: AI powerful enough to bring us into a new, qualitatively different future, via an explosion in science and technology R&D. This sort of AI could be sufficient to make this the most important century of all time for humanity.

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT):

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

HFDT is not the only approach to developing transformative AI,[2] and it may not work at all.[3] But I take it very seriously, and I’m aware of increasingly many executives and ML researchers at AI companies who believe something within this space could work soon. 

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.