sre

Cory Watson - Leading observability teams at Twitter & Stripe, how to succeed in a new org, effective ways to advocate for your team and more - #16

Cory is currently a Solutions Engineer at Jeli.io and very well known in the community for his work on Observability. His career in observability began at Twitter where he managed the observability team and then he joined Stripe, where he created and led the observability team, this time around as a Principal Engineer. We talk to him about how he got his start in customer support and the role it played in the later part of his career. We discuss his time at Twitter where there was a power outage in the data center on the day he joined and how once he had to stay up all night dealing with file handle leaks. We also discuss how he created and led the observability team at Stripe as an individual contributor, how one can succeed in a new org, how to navigate information asymmetry in the workplace, what are some effective ways to advocate for your team and how we all are just humans trying to get stuff done.

Bruno Connelly - Building and leading the global SRE org at LinkedIn - #14

Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech - from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom - like how to stay calm under pressure and how to make people feel at ease - as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people.

Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13

With 5+ years of experience building resilient systems at the Netflix scale, Lorin joins the show to chat about his favorite incident story, the path that led him to doing chaos engineering (and later away from it), and advocating for a dedicated analyst to talk to people after an incident. Throughout the conversation, Lorin shares his philosophy and tips on how to learn from incidents, what engineers can gain from writing better, and why some metrics may not be as useful as you think.

Spoons (Daniel Spoonhower) - On building Lightstep, being customer focused, developing systems at Google scale and much more - #12

Spoons is the Co-founder and Chief Architect of Lightstep. He joins the show to talk about building systems at Google scale and various aspects that make Google a different place than other companies. We talked about Spoons's journey of leaving Google and deciding to join Lightstep as a co-founder. We dig into the challenges during the early days of Lightstep and discuss the importance of speaking to customers to build the right product. We talk about what it's like to start a family and run a startup and how one can be intentional about building a company’s culture. As always, we go through some of the misadventures and one of them involves a cable being cut under the English channel.

Ryan Underwood - On debugging the Linux kernel - #4

Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan’s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring.

David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3

David is LinkedIn’s former SVP of Engineering and Operations. In this insightful conversation, he shares stories from early days at LinkedIn and what it took to develop the culture of "Site Up and Secure". We also talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader.