Podcast Episode

Chapters

Prompting LLMs with direct queries

Prompting LLMs with flattery and dialogue

Simulator Theory

The limits of flattery

Derrida — il n'y a pas de hors-texte

The Waluigi Effect

(1) Rules are meant to be broken.

(2) Traits are complex, valences are simple.

(3) Structuralist narratology

Superpositions will typically collapse to waluigis

Evidence from Microsoft Sydney

Waluigis after RLHF

(2) Empirical evidence from Perez et al.

(3) RLHF promotes mode-collapse

LessWrong (Curated & Popular)

"The Waluigi Effect (mega-post)" by Cleo Nardo

Mar 08, 2023

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

Share Episode

Share on Facebook Share on Twitter Share on LinkedIn Download

Subscribe

Apple Podcasts Spotify More