Chaos Engineering

Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving. Tammy has some great advice for teams looking to get started with chaos engineering.

Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5

Ols is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.