De Nederlandse Kubernetes Podcast

#119 Your Web App Scaling Tricks Don’t Work for LLMs

Ronald Kers en Jan Stomphorst Season 3 Episode 41

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 36:46

In this episode, we talk with Abdel Sghiouar and Mofi Rahman, Developer Advocates at Google and (guest) hosts of the Kubernetes Podcast from Google.
Together, we dive into one central question: can you truly run LLMs reliably and at scale on Kubernetes?

It quickly becomes clear that LLM workloads behave nothing like traditional web applications:

  • GPUs are scarce, expensive, and difficult to schedule.
  • Models are massive — some reaching 700GB — making load times, storage throughput, and caching critical.
  • Containers become huge, making “build small containers” nearly impossible.
  • Autoscaling on CPU or RAM doesn’t work; new signals like GPU cache pressure, queue depth, and model latency take over.
  • LLMs don’t run in parallel, so batching and routing through the Inference Gateway API become essential.
  • Device Management and Dynamic Resource Allocation (DRA) are forming the new foundation for GPU/TPU orchestration.
  • Security shifts as rootless containers often no longer work with hardware accelerators.
  • Guardrails (input/output filtering) become a built-in part of the inference path.

And then there’s the occasional request from customers who want deterministic LLM output —
 to which Mofi dryly responds:
 “You don’t need a model — you need a database.”


Powered by: ACC ICT

Stuur ons een bericht.

DevOps Conference
The Conference for CI/CD, Kubernetes, Platform Engineering & DevSecOps 

k8_Podcast voor 15% korting

Support the show

Like and subscribe! It helps out a lot.

You can also find us on:
De Nederlandse Kubernetes Podcast - YouTube
Nederlandse Kubernetes Podcast (@k8spodcast.nl) | TikTok
De Nederlandse Kubernetes Podcast

Where can you meet us:
Events

This Podcast is powered by:
ACC ICT - IT-Continuïteit voor Bedrijfskritische Applicaties | ACC ICT