Site Reliability Engineering

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having worked on SRE at Google for more than 12 years, Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production - like why it’s not enough to just look at aggregated statistics for ML monitoring, all the caveats of building a generalized platform for model training, and figuring out who’s on the hook when ML models don’t perform as expected in production. We also chat about what Todd looks for when hiring ML SREs, his impressive skill of getting linkedin skills endorsements and much more.

David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3

David is LinkedIn’s former SVP of Engineering and Operations. In this insightful conversation, he shares stories from early days at LinkedIn and what it took to develop the culture of "Site Up and Secure". We also talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader.