Early Twitter's fail-whale wars | Dmitriy Ryaboy
When 70% of the Hadoop cluster got accidentally deleted, the financial reality of writing a book, and how to navigate acquisitions.
A veteran of early Twitter's fail whale wars, Dmitriy joins the show to chat about the time when 70% of the Hadoop cluster got accidentally deleted, the financial reality of writing a book, and how to navigate acquisitions.
Ronak & Guang’s Picks
#1 What does your manager do when your team accidentally takes out the Hadoop cluster?
As he wrote the emails to update the company on the status, Dmitriy thought he was going to get fired. While it wasn’t him who deleted the cluster, he was managing and responsible for the team’s actions.
But instead, his boss told him, “#1, this happens, don't worry about it. I've got you. And #2, eventually this was going to happen. It doesn't feel like it right now, but it's going to be such a relief that this happened now and won't happen anymore in this company than if it had happened three or four years from now.”
#2 “Pressure makes diamonds”
Having joined Twitter early in 2010 to help build the data platform, Dmitriy found himself on an engineering team that was constantly putting out fires caused by explosive user growth. But it was also through these firefighting efforts that the team was able to build grit and camaraderie.
“Later, I heard people say that the joke was that I love hiring ex-Twitter people because, no matter how much everything is exploding, they just go, ‘Eh, I've seen worse,’ because things were really, really bad.”
“But also, sometimes the worst times are the best times.”
Segments:
(00:00:00) The infamous Hadoop outage
(00:02:36) War stories from Twitter's early days
(00:04:47) The fail whale era
(00:06:48) The Hadoop cluster shutdown
(00:12:20) “First restore the service then fix the problem. Not the other way around.”
(00:16:16) The importance of communication in incident management
(00:19:07) That time when the data center caught fire
(00:21:45) The "best email ever" at Twitter
(00:25:34) The importance of failing
(00:27:17) Distributed systems and error handling
(00:29:49) The missing README
(00:33:13) Agile and scrum
(00:38:44) The financial reality of writing a book
(00:43:23) Collaborative writing is like open-source coding
(00:44:41) Finding a publisher and the role of editors
(00:50:33) Defining the tone and voice of the book
(00:54:23) Acquisitions from an engineer's perspective
(00:56:00) Integrating acquired teams
(01:02:47) Technical due diligence
(01:04:31) The reality of system implementation
(01:06:11) Integration challenges and gotchas
Show Notes:
Dmitriy Ryaboy on Twitter: https://x.com/squarecog
The Missing README: https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838
Chris Riccomini on how to write a technical book: https://cnr.sh/essays/how-to-write-a-technical-book
Stay in touch:
👋 Make Ronak’s day by leaving us a review and let us know who we should talk to next! hello@softwaremisadventures.com
Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en