"The Waluigi Effect (mega-post)" by Cleo Nardo

LessWrong (Curated & Popular)

Chapters
0:32
Background
0:56
Prompting LLMs with direct queries
3:40
Prompting LLMs with flattery and dialogue
5:05
Simulator Theory
9:03
The limits of flattery
12:16
Derrida — il n'y a pas de hors-texte
15:17
The Waluigi Effect
17:00
(1) Rules are meant to be broken.
19:45
(2) Traits are complex, valences are simple.
22:03
(3) Structuralist narratology
25:42
Superpositions will typically collapse to waluigis
27:48
Evidence from Microsoft Sydney
28:42
Waluigis after RLHF
31:19
(2) Empirical evidence from Perez et al.
33:11
(3) RLHF promotes mode-collapse
40:22
Conclusion
LessWrong (Curated & Popular)
"The Waluigi Effect (mega-post)" by Cleo Nardo
Mar 08, 2023

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

Share Episode
Share on Facebook Share on Twitter Share on LinkedIn Download
Subscribe
Apple Podcasts Spotify More
Apple Podcasts Spotify Overcast iHeartRadio Podcast Addict Castro Castbox Goodpods RSS Feed
Buzzsprout