Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10
Todd Underwood is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Prior to that, he was in charge of operations, security, and peering for Renesys, a provider of Internet intelligence services; and before that he was CTO of Oso Grande, a New Mexico ISP. He has a background in systems engineering and networking.
Having recently presented on how ML breaks in production, by examining more than a decade of outage postmortems at Google, Todd joins the show to chat about why many ways that ML systems break in production have nothing to do with ML, what’s different about engineering reliable systems for ML, vs traditional software (and the many ways that they are similar), what he looks for when hiring ML SREs, and more. Throughout the episode, Todd shares insight and advice on topics such as “do you need to be a ML expert to succeed as an ML SRE?”, why it’s so difficult to build generalized platforms for ML workflows, and who’s on the hook when ML models don’t perform well in production.
- Todd on Twitter
- Todd on LinkedIn
- Todd’s talk - How ML Breaks: A Decade of Outages for One Large ML Pipeline
Music CreditsVlad Gluschenko — Forest
License: Creative Commons Attribution 3.0
- Emmanuel Ameisen - On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more - #11
- Cory Watson - Leading observability teams at Twitter & Stripe, how to succeed in a new org, effective ways to advocate for your team and more - #16
- Bruno Connelly - Building and leading the global SRE org at LinkedIn - #14
- Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13
- David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3