Machine Learning Infrastructure

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having worked on SRE at Google for more than 12 years, Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production - like why it’s not enough to just look at aggregated statistics for ML monitoring, all the caveats of building a generalized platform for model training, and figuring out who’s on the hook when ML models don’t perform as expected in production. We also chat about what Todd looks for when hiring ML SREs, his impressive skill of getting linkedin skills endorsements and much more.