Having led 100+ ML projects at Insight and built ML systems at Stripe scale, Emmanuel joins the show to chat about how to build useful ML products and what happens next when the model is in production. Throughout the conversation, Manu shares stories and advice on topics like the common mistakes people make when starting a new ML project, what’s similar and different about the lifecycle of ML systems compared to traditional software, and writing a technical book.
Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having worked on SRE at Google for more than 12 years, Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production - like why it’s not enough to just look at aggregated statistics for ML monitoring, all the caveats of building a generalized platform for model training, and figuring out who’s on the hook when ML models don’t perform as expected in production. We also chat about what Todd looks for when hiring ML SREs, his impressive skill of getting linkedin skills endorsements and much more.