Slight Reliability
Episodes
124 episodes
Being a Digital Nomad with Amin Astaneh (Episode 122)
Can you combine your career with personal adventure? What would it be like to live in a truck and travel the country while working remotely?This week I'm joined again by SRE & DevOps legend Amin Astaneh in one of the most human inter...
Four Golden Signals to Kickstart SRE (Episode 121)
When you first start implementing SRE it's a good idea to find early wins. Implementing monitoring of the four golden signals + availability is something I'm experimenting with at the moment to give our SRE team momentum and to pave the way for...
Staying Motivated as a Leader with Cads Oakley (Episode 120)
As an engineer you get constant dopamine hits by solving technical problems. As a leader you're often working toward long term goals that span months or years. How do you stay motivated in that context?This week I'm joined by technology ...
A Beginner's Guide to SRE (Episode 119)
This week I repurpose a talk I just did at the JuniorDev meetup in Auckland. If you're new to SRE or observability then this is the talk for you. For the more seasoned listeners, it's a chance to see how my perspective and understanding has cha...
Freeing Observability Data Hostages with Jacob Leverich (Episode 118)
How do you ingest and store petabytes of telemetry every day in a cost effective and high performing way? How can you do this in a way which gives engineers the operational data they need to keep services running? How has this challenge be tack...
How to Change the World with Rob Roe (Episode 117)
How do you take all the utopian ideas you read about in books and apply them to the reality of the organisations we work in?This week I'm joined by leader, mentor, and coach Rob Roe to tackle this question. We discuss...🌪️ The pit...
Human Software with Richard Bown (Episode 116)
We spend a third of our life at work. It needs to be something we enjoy and something with purpose. Our work experience also impacts our family, friends, and our personal lives.This week I'm joined by tech engineer, leader, and author Ri...
Leadership Gym with Xiao Zhang (Episode 115)
When you become a people leader there is no manual. How can we not only learn leadership skills but practice them and build leadership muscle?This week I'm joined by Orion Group Limited co-founder Xiao Zhang to discuss...👑 The cha...
Starting a New Role (Episode 114)
This week I kick off the 2026 season with some news and we explore how to prepare for a new role.You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand):
AI Use-cases for SRE with Shmuel Kliger (Episode 113)
From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us?This week I'm joined by Causely founder Shmuel Kliger to dive ...
Operational Intelligence with Adam Kinniburgh (Episode 112)
What is operational intelligence and how is it different from observability or BI?This week I'm joined by SquaredUp's VP of Innovation Adam Kinniburgh to answer that question and many more including...❓ What is operational intelli...
Leading Platform Teams with Dinesh Sukhija (Episode 111)
How does leading platform teams differ from leading product teams?This week I'm joined by experienced technology leader Dinesh Sukhija to answer that question and many more including...❓ What is a platform team?⚽ Coaching engi...
Leadership Round One! (Episode 110)
How has my first two years as a manager in tech been? What have I learned? What do I need to work on?This week I share my experiences over the past couple of years. I cover:🔥 My recent close call with burnout🫶 How I attempted ...
The Implications of AI on Observability with Aaron "Checo" Pacheco (Episode 109)
How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight?This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling wit...
Chaos Engineering with Kolton Andrus (Episode 108)
What is chaos engineering and how is it being used in 2025?This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss...🌪️ What is chaos engineering and what is its origins?🪴 How has it evolved over the year?...
Team Topologies with Luke McManus (Episode 107)
What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)?This week I'm joined by Luke McManus to discuss...⛰️ What are the four team topologies?🏆 Can we have too m...
Contributing to Open Source with Wendy Ha (Episode 106)
How do you begin contributing to an open source project? What's it like? What do you get out of it?This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore...
Influencing Leadership with Nora Jones (Episode 105)
As an #SRE how do you influence senior leadership to get support and priority for the things you care about?To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDu...
Slight Reliability Podcast Retrospective (Episode 104)
This week I do a retrospective on the Slight Reliability podcast.👂 How many people listen to it?❤️ How do I feel about the show?🎉 What's going well?🪴 What could be better?❔ What's next for the show?If you want to c...
Burnout with Colette Alexander (Episode 103)
Have you burned out at work? What was your experience? How did you work through it?This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out...
Mobile Observability with Hanson Ho (Episode 102)
This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover...📱 The mobile/backend observability divide✍️ The challenge of distributed tracing on m...
Intro to Resilience Engineering with Michelle Casey (Episode 101)
This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover...🏋️♀️ Reliability VS Robustness VS Resilience🧩 What is a complex system?🔢 Safety ...
Learning with John Allspaw (Episode 100)
This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss...📒 Classroom VS situated learning🤝 The myth of the perfect ha...
Focusing on What Matters with Trent Hornibrook (Episode 99)
This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, ...
The Root Cause Fallacy with Andrew Hatch (Episode 98)
This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore...🌌 Is the root cause of every incident the big bang?...