Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5
Oliver is a distinguished technologist currently working as a Senior Devops Engineer at Sky Betting and Gaming.
Listen to the Software Misadventures Podcast on Apple, Spotify or watch us on YouTube.
Oliver Leaver-Smith, better known as Ols, is a distinguished technologist currently working as a Senior Devops Engineer at Sky Betting and Gaming. His topics of expertise include OpenBSD, automation, chaos and resilience engineering, and Nerf warfare. He is also interested in security, privacy, opensource, decentralisation and federation, hardware hacking, and cyberdecks. Interested in technology (specifically how it breaks) from a young age, his first foray into Linux was in 2003 when, while blindly following the installation instructions in Sams Teach Yourself Red Hat Linux, he upgraded his dad’s Windows XP machine to Red Hat 9. This resulted in him getting his own computer, so it has been widely regarded as a good thing.
We had a lot of fun speaking with Ols! We discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering, a topic Ols deeply cares about. We discuss how his team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process. Please enjoy this fun conversation with Ols!
Show Notes:
It’s just a monitoring change - Blog post by Ols where he wrote about the outage we discuss in the show
Stay in Touch:
👋 Send feedback or say hi: softwaremisadventures@gmail.com