Slight Reliability
Learning SRE, one day at a time.
Episodes
115 episodes
AI Use-cases for SRE with Shmuel Kliger (Episode 113)
From the day we invented computers we've been struggling to keep applications running and delivering services to the business. Is this latest wave of AI helping or hurting us?This week I'm joined by Causely CEO and founder Shmuel Kliger ...
•
Season 2
•
Episode 113
•
31:51
Operational Intelligence with Adam Kinniburgh (Episode 112)
What is operational intelligence and how is it different from observability or BI?This week I'm joined by SquaredUp's VP of Innovation Adam Kinniburgh to answer that question and many more including...❓ What is operational intelli...
•
Season 2
•
Episode 112
•
31:17
Leading Platform Teams with Dinesh Sukhija (Episode 111)
How does leading platform teams differ from leading product teams?This week I'm joined by experienced technology leader Dinesh Sukhija to answer that question and many more including...❓ What is a platform team?⚽ Coaching engi...
•
Season 2
•
Episode 111
•
32:01
Leadership Round One! (Episode 110)
How has my first two years as a manager in tech been? What have I learned? What do I need to work on?This week I share my experiences over the past couple of years. I cover:🔥 My recent close call with burnout🫶 How I attempted ...
•
Season 2
•
Episode 110
•
19:59
The Implications of AI on Observability with Aaron "Checo" Pacheco (Episode 109)
How could AI help human beings negotiate the mountains of telemetry we collect to get simple and fast insight?This week I'm joined by Ottermon AI CEO and founder Checo Pacheco about the lifecycle of observability coverage and tooling wit...
•
Season 2
•
Episode 109
•
38:27
Chaos Engineering with Kolton Andrus (Episode 108)
What is chaos engineering and how is it being used in 2025?This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss...🌪️ What is chaos engineering and what is its origins?🪴 How has it evolved over the year?...
•
Season 2
•
Episode 108
•
31:16
Team Topologies with Luke McManus (Episode 107)
What are Team Topologies? How can they be used to deliver value simpler and more effectively (and in a more humane way)?This week I'm joined by Luke McManus to discuss...⛰️ What are the four team topologies?🏆 Can we have too m...
•
Season 2
•
Episode 107
•
23:10
Contributing to Open Source with Wendy Ha (Episode 106)
How do you begin contributing to an open source project? What's it like? What do you get out of it?This week I'm joined by Wendy Ha who shares her unique story of joining the Kubernetes project and becoming a contributor. We explore...
•
Season 2
•
Episode 106
•
43:52
Influencing Leadership with Nora Jones (Episode 105)
As an #SRE how do you influence senior leadership to get support and priority for the things you care about?To answer this question I'm joined by Nora Jones, founder of Jeli and now Head of Pricing, Product Strategy and Growth at PagerDu...
•
Season 2
•
Episode 105
•
28:16
Slight Reliability Podcast Retrospective (Episode 104)
This week I do a retrospective on the Slight Reliability podcast.👂 How many people listen to it?❤️ How do I feel about the show?🎉 What's going well?🪴 What could be better?❔ What's next for the show?If you want to c...
•
Season 2
•
Episode 104
•
27:28
Burnout with Colette Alexander (Episode 103)
Have you burned out at work? What was your experience? How did you work through it?This week I'm joined by the incredible Colette Alexander to discuss what burnout is, what it means, and we both share our personal experiences burning out...
•
Season 2
•
Episode 103
•
38:36
Mobile Observability with Hanson Ho (Episode 102)
This week I'm joined by the wonderful Hanson Ho to discuss the unique challenges and opportunities in making our mobile apps observable! We cover...📱 The mobile/backend observability divide✍️ The challenge of distributed tracing on m...
•
Season 2
•
Episode 102
•
31:57
Intro to Resilience Engineering with Michelle Casey (Episode 101)
This week on the I'm joined once more by SRE leader Michelle Casey who gives a broad and shallow introduction to resilience engineering. We cover...🏋️♀️ Reliability VS Robustness VS Resilience🧩 What is a complex system?🔢 Safety ...
•
Season 2
•
Episode 101
•
39:36
Learning with John Allspaw (Episode 100)
This week on the 100th episode I'm joined by DevOps and Resilience Engineering legend John Allspaw to talk about learning (especially from incidents). We discuss...📒 Classroom VS situated learning🤝 The myth of the perfect ha...
•
Season 2
•
Episode 100
•
48:17
Focusing on What Matters with Trent Hornibrook (Episode 99)
This week I'm joined by SRE leader Trent Hornibrook who shares a story about how he improved on-call early in his career, and then we explore the broader theme of focusing on the things that matter in observability, incident response, on-call, ...
•
Season 2
•
Episode 99
•
29:28
The Root Cause Fallacy with Andrew Hatch (Episode 98)
This week I'm joined by SRE leader Andrew Hatch from Cisco ThousandEyes to talk about a dirty word in the resilience community... root cause. In this excellent conversation we explore...🌌 Is the root cause of every incident the big bang?...
•
Season 2
•
Episode 98
•
32:22
Synthetic Monitoring with David Dick (Episode 97)
This week I'm joined by David Dick from 2 Steps to (finally!) discuss synthetic monitoring. We cover...🤖 What is synthetic monitoring?🦾 What are the benefits and drawbacks to using it?☢️ Non-web based synthetics (the tough stuff)...
•
Season 2
•
Episode 97
•
33:04
Tech Leadership with Milan Brown (Episode 96)
This week I'm joined by Cin7 Engineering Director Milan Brown to unpack the challenges of technology management and leadership. We discuss...✖️ Theory X vs Theory Y management🗣️ Intention based leadership and communication🏢 Condi...
•
Season 2
•
Episode 96
•
31:27
Finding Tech Work with Leon Adato (Episode 95)
This week Leon Adato and I break down the state of applying for roles in tech. We cover...📝 What a resume or CV is and is not🤝 Leveraging your connections rather than relying on applying cold🪄 How most job descriptions are works ...
•
Season 2
•
Episode 95
•
36:26
Getting a Start in SRE with Priyam Kumar (Episode 94)
This week Priyam Kumar shares his story of moving from a massive organisation to a startup and the challenges and growth that came from that. We discuss...🪖 War stories and examples of production incidents🩹 The "hacks" we build to ke...
•
Season 2
•
Episode 94
•
31:09
SRE Leadership with Michelle Casey (Episode 93)
This week Michelle Casey shares her insights as a 'head of' engineering manager in the SRE context. This was one of my favourite conversations on the podcast so far. We cover topics such as...🤷🏽 Why move into leadership?👁️ Learning f...
•
Season 2
•
Episode 93
•
39:29
Observability Maturity with Ádám Tóth (Episode 92)
This week Adam and I get philosophical about what constitutes maturity in the field of observability. We tackle questions such as...💸 Does your org treat observability as a cost centre or a value add?🔥 Are you using observability rea...
•
Season 2
•
Episode 92
•
30:09
Head in the Clouds (Episode 91)
In this episode I explore the challenges of achieving unified observability when integrating with SaaS products and services. I cover:🌊 The new wave of mega-complex SaaS⚗️ Challenges integrating SaaS with our observability pipelines<...
•
Season 2
•
Episode 91
•
15:43
Non-Prod Reliability Engineering + 2024 Wrap (Episode 90)
This week I check in and give an update on work, life, and my attempts at bringing to life SRE practices in the world of non-production environment management.You can find the official Slight Reliability podcast website at:
•
Season 2
•
Episode 90
•
18:13
Slight Reliability Episode 89 - Blameless Post-mortems with Karanveer Anand
This week I'm joined by Karanveer Anand, SRE Technical Program Manager at Google to discuss blameless post-mortems. We cover:🦅 The recent Crowdstrike outage and their public post-mortem🚑 When do we do a blameless post-mortem?😕 H...
•
Season 2
•
Episode 89
•
26:06