site reliability engineering

Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving. Tammy has some great advice for teams looking to get started with chaos engineering.

Ryan Underwood - On debugging the Linux kernel - #4

Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan’s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring.

David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3

David is LinkedIn’s former SVP of Engineering and Operations. In this insightful conversation, he shares stories from early days at LinkedIn and what it took to develop the culture of "Site Up and Secure". We also talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader.