Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having worked on SRE at Google for more than 12 years, Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production - like why it’s not enough to just look at aggregated statistics for ML monitoring, all the caveats of building a generalized platform for model training, and figuring out who’s on the hook when ML models don’t perform as expected in production. We also chat about what Todd looks for when hiring ML SREs, his impressive skill of getting linkedin skills endorsements and much more.
Evan is a Director of Engineering at Flatiron Health where he's leading software engineering teams focused on building Machine Learning products. Throughout this episode, Evan shares various stories when recommendation systems didn’t work as expected, like this one time when members saw mathematically worst recommendations for meetups near them. He also shares why Schenectady, NY pops up on some lists of most popular cities and the story behind the Wall Street Journal article titled 'Orbitz steers Mac users to pricier hotels'. We also discuss skills Evan looks for when hiring ML engineers, how to give constructive feedback, filter bubbles and much more.