site reliability engineering

Bruno Connelly - Building and leading the global SRE org at LinkedIn - #14

Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech - from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom - like how to stay calm under pressure and how to make people feel at ease - as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people.

Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13

With 5+ years of experience building resilient systems at the Netflix scale, Lorin joins the show to chat about his favorite incident story, the path that led him to doing chaos engineering (and later away from it), and advocating for a dedicated analyst to talk to people after an incident. Throughout the conversation, Lorin shares his philosophy and tips on how to learn from incidents, what engineers can gain from writing better, and why some metrics may not be as useful as you think.

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having worked on SRE at Google for more than 12 years, Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production - like why it’s not enough to just look at aggregated statistics for ML monitoring, all the caveats of building a generalized platform for model training, and figuring out who’s on the hook when ML models don’t perform as expected in production. We also chat about what Todd looks for when hiring ML SREs, his impressive skill of getting linkedin skills endorsements and much more.

Charity Majors - On database outages, journey as a co-founder, thriving under pressure and growing as an engineer - #7

Charity Majors is the co-founder and CTO of honeycomb.io. We had a lot of fun speaking with Charity in this lively conversation! We learned about her journey from being an engineer to co-founding Honeycomb, what it was like being on-call when she was only 17, and staying calm during production incidents. We talked about various production outages throughout the episode. Charity also shares what it takes to build an awesome engineering culture, the engineer/manager pendulum, and qualities Charity looks for when hiring senior engineers.

Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving. Tammy has some great advice for teams looking to get started with chaos engineering.

Ryan Underwood - On debugging the Linux kernel - #4

Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan’s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring.

David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3

David is LinkedIn’s former SVP of Engineering and Operations. In this insightful conversation, he shares stories from early days at LinkedIn and what it took to develop the culture of "Site Up and Secure". We also talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader.