Deep Papers

Sleep-time Compute: Beyond Inference Scaling at Test-time

Arize AI

What if your LLM could think ahead—preparing answers before questions are even asked?

In this week's paper read, we dive into a groundbreaking new paper from researchers at Letta, introducing sleep-time compute: a novel technique that lets models do their heavy lifting offline, well before the user query arrives. By predicting likely questions and precomputing key reasoning steps, sleep-time compute dramatically reduces test-time latency and cost—without sacrificing performance.

​We explore new benchmarks—Stateful GSM-Symbolic, Stateful AIME, and the multi-query extension of GSM—that show up to 5x lower compute at inference, 2.5x lower cost per query, and up to 18% higher accuracy when scaled.

​You’ll also see how this method applies to realistic agent use cases and what makes it most effective.If you care about LLM efficiency, scalability, or cutting-edge research.

Explore more AI research, or sign up to hear the next session live: arize.com/ai-research-papers

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

People on this episode