Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5 | Transcript

Oliver Leaver-Smith

You can see the show notes for this episode here.

This transcript was generated by an automated transcription service and hasn’t been fully proofread by a human. So expect some inaccuracies in the text.

Ronak: Hey Ols! Welcome to the show. We are super excited to have you here.

Ols: Thank you very much. It’s good to be here.

Ronak: So we were researching for this episode and I was reading about you on the internet. There was one bet which I found rich, rich, which kind of stood out. At least I was fascinated by it and I want to know more about this. So back in 2003 I found a bio. It said that back in 2003, you were learning more about setting up red heart. And you unintentionally upgraded your diet’s windows XP machine to an Hart nine. I’m very curious about how that happened. Can you share that story with us?

Ols: Yeah, so I, I use the term upgraded. He used the term ruined I, I think I upgraded him. So basically he had this this book of Sam’s teach yourself red hat on his bookshelf. And it had a CD in it too, to run like a live instance of red hot night. So I put that in in his computer cause I didn’t have my own at a time. And I clicked through the clicks around the live CD a bit and thought this is quite interesting. I’m going to install it. I thought it was just like, you know, when you install a game or anything, it was literally install button on the desktop. So I thought I’ll click that. I went through the install steps and then. It said it’s now safe to restart your computer. So I thought, okay. I don’t know why I need to be stopped, but fine. And when I went to restart all I had was grew up and and the the option for red hat nine. And I couldn’t work out what I’d done because I was at the stage where I didn’t, I was dangerous enough to. Know how to do things, but not why I was doing them and what the actual effects would be. This did actually result in getting my own computer, though. To tinker on. So I see it as a positive really? Oh yeah. There is, there is a bright side to it. I kind of imagine if your dog was pissed off pretty much. Yeah. Oh, I got a terrible computer out of it. It was like a reject from work or something that nobody wanted.

Ronak: Well, at least you got your own computer to play with. Exactly. Yeah. So can you tell us a little bit about your background? I know a lot of listeners would want to know how you started off. I saw your LinkedIn profile and it said that you started off as a network engineer. And now you’re more in the DevOps space. So we would love to hear from you.

Ols: Yeah. So I started off I started off on like help desk type thing at a, at an ISP. So the natural progression there for me was to go into networking as a, as a discipline. So I went up sort of through the ranks in the, in the help desk and then started being a real network engineer.

And then I moved to another ISP and got more in the weeds of networking. And then I, I branched out from ISP and went to started in the gambling sector. As a network engineer still, but I found the, the environment that the fast pace, the the ridiculously short downtimes that you were permitted, all that sort of thing. I found that really, really interesting. And I kind of, I saw what these these dev ops engineers were doing and these, all these infrastructure engineers, and I saw how they were not automating themselves out of a job, but. Doing less doing more with the time they had by doing less actual work and toil and spending more time on it, working out how they could automate that job.

So I did quite a bit to automate yeah, the boring stuff that we had to do as, as network engineers. So like device config audit’s and all that sort of stuff. And that, that really peak my interest in, in the automation side of things. So I saw a A job advert for a dev ops engineer. And I’d already, I always like had Linux as like in my back pocket and all that sort of stuff.

So I thought, you know what? I’ll, I’ll make the jump. I know this DevOps thing from a networking perspective, I know a bit about Linux, so why not? And it’s gone from there, but it is good to, it’s good to have like a specialism that isn’t necessarily. Just dev ops because you know, if you want to, if you want to get that full view of the whole stack as a team. Yeah. You really need people that have got, you know, the, the t-shaped developer feel like, Oh, the T-shaped engineer specialism. So it’s, it’s worked out all right. For me.

Ronak: Yeah. I mean, I know a lot of DevOps engineers who come from many different backgrounds. I mean, including Austin and myself, we, we come from I don’t know if you would call it an unconventional if everyone is coming from one conventional backgrounds. But yeah, the I having expertise in one domain certainly helps. So now that you’re a DevOps engineer at sky betting and gaming, can you tell us a little bit about your team, your role, like what you do day-to-day and what does your team structure look like?

Ols: Yes. So Skype and end game in itself uses the tribal model that Spotify invented and made famous. So the tribe I am in is called core. Okay. And how the tribes work, it’s kind of like, they’re all individual companies that take resources off each other as if they are individual businesses. Hmm. So in call tribe where I am what we focus on is a lot of sort of key account functionality. So we don’t really talk, we don’t really deal with the, the bat angle, the casino side of things. We’re primarily, you know, user registration and identity verification payments like taking, taking payments from customers and, and sending withdrawals out. So we’re, we’re sort of like the beating heart, if you like that. A lot of other tribes within the company like utilize the team, I mean is a specific platform team. So there are feature squads that have different areas of expertise in that different domains, different applications. But the platform squad that I’m in kind of sits on duty for that and supports the the development and the the rollout of like new, new features and new products.

So we, we do it in kind of a few different rather interesting ways. So we’ll sometimes get like parachuted into a team to be like you know, some like SWAT style platform resource that just needs to like spin up a database cluster or something quickly to allow to allow the team to start developing something. But all the other times we’ll be kind of pulled into this. The, the phrase we use is a pop-up squad. Which is like a single use squad from different domains that can all come together and and do do good things. So the most recent example of this is. We had we had some GDPR work to do, which is for those that don’t know is the the EU data, privacy regulation stuff. So this, this, this needed some, some developers to do make changes on their systems that required platform there to ensure that the backups were being captured the right amount of times of, of basis and all this sort of thing. So it’s quite a Quite playing like hard and loose with the, the definition of a feature squad. But like I say, it works for us and and it’s good to to get the different exposure to different. Perry’s are the business that you wouldn’t necessarily, if you were just being a platform engineer working on just platforms. That makes sense. And it’s actually very interestingly, the tribes structure that you mentioned.

Ronak: I actually want to dig in a little bit into that if you’re in mind. So you mentioned you a part of the core tripe team, so. Does a tribe have like multiple teams within it. And the other teams you work with are part of different tribes or would they, would they be part of the same tribe?

Ols: So the teams that are in the core tribe, I’ll try not to leave any out in case any one from work is listening, because that would be terrible.

So obviously this semester, that’s the most important one, which is platform. And then there are, there’s a squad that. Is focused on account as a service. So that includes like the actual account bit. You see, when you log in so like your change, your details and your credentials and everything. And also things like any, any exclusions you want to put on your account, if you feel that you’re spending too much money on site all the, all the tools that we have there to, to help you manage that as a customer are all part of that team. There’s also then the payments squad, which solely look after taking all the money and giving it back. And then the, another squad is the onboarding squad that handle getting customers through the door in a responsible way, and also ensuring that we can verify. They are who they say they are. You know, whether that be using third-party identity providers or manually verifying documentation that the, that the customer or we’ll provide. I think that’s it. That’s it. Yeah, there’s a, there’s a lot of, there’s a lot of like principal engineers that kind of float around different squads, depending on where the where the resource is needed, but they’re the main, the main squads. And that, that pattern of a tribe made up of multiple squads that have a specific domain to look after that is what is replicated across the business in different tribes.

Ronak: I see makes sense. It’s a, it’s a fascinating concept. And how many people are on the core tribe in general? How many engineers?

Ols: Short-haul engineers, I would guess probably around 60 to 80, including all disciplines, like test and software dev and platform.

Ronak: I see. Pretty, pretty good though. And for some of our listeners who might not be fully aware, can you tell us a little bit about what sky betting and gaming as a company does?

Ols: Yes. So we are I think we are the biggest online bookmaker in the UK. So we do you traditional sports book Baton, so betting on football soccer. And and horse racing and things like that. And then we also have online gaming platforms. So the traditional sort of slot machines online live casino with crew PA’s spinning roulette wheels and things like that. And then we also have a lot of a lot of products that are free to play. So we have things like. A prize machine where you it’s free to spin, I mean, money or or free spins elsewhere. We have things like where you can you can put a free, a free guest on the outcome of a few different football matches. And if that, if that matches the actual results, then you win money. We were lucky in that we were closely affiliated with sky the company, which is quite a good brand. And that’s, it’s like it’s very much the brand you think of when you think about sports, at least in the UK because they’ve, they’ve sort of the home of premier league football for a long time.

Ronak: Makes sense. So considering there is a lot of payments involved people are betting. So I would imagine performance and reliability would be of paramount importance of all the systems that you’re working with. And the requirements would be extremely tight.

Ols: Yes. So we were unfortunate, I guess you could say in the, everyone in the, in the business relies on us and our availability. So if one of the, one of the other tribes that say, if the bat tribe has a problem with our website, the gaming tribe, they can continue to build their products. Whereas if our services go down, then. Every, every single consumer of our services is having the same problem. So we are rightly so. We are held to a very high standard in terms of our system performance anyway.

Ronak: Nice. So in our, today, you, you published a blog post recently on your website about how a seemingly benign. Monitoring change. This is gonna turn into an outage resulting in making your systems trying too hard. And we’re going to dig more into that. And Austin here is on the monitoring infrastructure team I’ve linked in. So I’m going to let him drive this far because he is extremely excited to talk to you about this.

Austin: Yeah. So yeah, I I’m on the monitoring infrastructure team. We’ve we’ve seen, we provide a monitoring platform pretty much for all of the variety of applications at, at LinkedIn. And. We expect it to run smoothly all the time and shouldn’t affect the applications and, and most circumstances. So like, this is really interesting for me. Can you give us a little bit of background on the, kind of the systems that you are monitoring for this particular incident?

Ols: Yeah. So I’m fine talking about this because I was the one that did it. So I I don’t mind throwing the engineer that did it under the bus at all because the engineer was me.

And it’s, it’s healthy to talk about your failures, right? So it’s, it’s good to talk about it. So the specific application that we will want, that we were wanting to monitor in this instance Well, it’s part of our, the sort of voodoo backend, the very legacy backend that talks directly to the database sort of systems rather than anything further up the stack. And we’re in the situation where this particular application that talks directly to the database is one that is provided to us by a third party. And it’s closed source. We have like a route into them for a book fix and feature releases and that sort of thing. But but that’s on like a a consultancy basis.

So something that we requested from them was a metrics and point that we could scrape to tell us how many unfulfilled payments were in the queue of payments, waiting to be fulfilled. So how about our payment fulfillment works? Not just at Skype. And he gave me an anywhere is there’ll be an initial sort of hold on the bank account that says, is this money available? The bank says, yes, that’s fine. And then at a later day, the actual fulfillment taking the money from the bank will happen. So the, this queue is. The payments that are in that state between yes, the money’s there and actually having taken the money. So we can see from this, if this queue grows and it doesn’t seem to be coming down, we can see, or maybe there’s a problem with actually taking the money that customers have asked us to take from their account.

And it’s, it’s easily rectified. We just need to talk to, you know, whoever owns that service, get them to maybe restart it and everything’s happy again. So that’s, that’s what we wanted to monitor. And we asked the, the third party that, that managed the application for us to provide that metrics endpoint and they did and it worked.

Yep. There’s a metrics endpoint. There’s some metrics on it. Cool. We’ll come to that in a bit. When we’ve got some more time to actually implement a proper monitoring, check around it. And then yeah, it kind of stopped. For a good few months, you know, the people that were working on it initially moved on to a different thing, different projects. And and then that’s when I came into the appointment and start actually looking at it again, interest. And that’s when the fun started.

Austin: Yeah. So. It’s it’s interesting that you mentioned like the third party application, just like a third party providing like the metrics endpoint for this particular use case.

You mentioned that there’s like this backend kind of, I guess, legacy database or was your team with like unable to access the database directly? And this was just something that the third party had kind of like the sole access to What I’m trying to get at is I’m really interested about like the trade-off of, you know, asking the third party to provide, you know, a solution to you guys, or was this something that you guys could also build yourself, but it’s one of those things where it’s like, you know, it’s just not worth our time.

They’re the subject matter experts on this? Let’s let them, okay.

Ols: Yeah. So the, we have access to the databases if we need them. And. As Sky Betting and gaming, not necessarily my team, because we, we don’t have a reason of we’d have a reason to get into that database because of the information that’s contained within it.

So separation of privileges and all that sort of thing. So what does as a company have access to those databases, but my team specifically don’t We had built something that debts kind of this sort of monitoring in the past. Based on the information that we had to hand, which was basically tailing, the log files and checking for any errors there. And that gave us an indication that there were failures to fulfill the payments. But what it didn’t tell us is. If payment fulfillment was just not running for a particular reason. So that, that slight nuance is why we, we needed to catch actually get into the application to to get that, that further detail. And, and like I say, it’s not an open source application that’s run. So the only, the only route we had was, was via the third party.

Austin: Got it. All right. That’s really interesting. And I kind of want to take a step back a little bit. I know Ronak myself are very familiar with this. We’ve worked with Prometheus’s as a, as a monitoring solution. And so you mentioned like this query exporter from this third party. Can you kind of briefly explain kind of to the audience that may be not familiar with this? What a prometheus query exporter is and what they may not be aware of.

Ols: So I’m, I’m also new to Promethease on the world of query exporters, which is where a lot of the failure came from in this.

So my understanding, at least a query exporter is something that’s built into the application, which will provide metrics on how an application is behaving or not behaving so that your Corinthia server, which is a time series database. We’ll be able to scrape but endpoint and pulling those metrics and observe what the application is he’s doing. So my understanding this is, this is the, this is where it all fell down. Really. My understanding is that a premium or my understanding was a Promethease exporter aquaria exporter will just present a static metrics page that is updated by the application. However, what I’ve since learned after looking into this is that best practices from Promethease actually dictate that when you hit that metrics, endpoint, it then does the work to generate the metrics. Austin: And that’s, yeah, that’s, that’s a bit that I wasn’t aware of. And is what made this so entertaining?

Ols: Yeah, that’s super interesting. Cause I. Intuitively would have thought exactly like what you were talking about is the process itself is responsible for updating it. And, and, you know, that kind of has a nice separation of concerns.

Austin: It looks like it’s an interesting trade-off that I guess Promethease made on like how fresh the data is. So that’s super interesting. And when the third party provided this query export or to, to you I’m curious, was this going to be something that just ran on one machine or was this something that you would have to roll out to probably multiple VMs at this point?

And like, so how did that rollout process work given that it’s, you know, a third party?

Ols: So they the query exporter application itself was going to run on all the machines that were responsible for for doing that fulfillment process. So we have multiple, multiple machines that do it. Like on a round Robin Q basis. And yeah, this, this metrics and point was going to ruin on all of them, but no, not necessarily because it was looking at what was left in the database, the number of items that were left in the database, it didn’t matter which which of the servers was running the fulfillment process at that moment in time, because any of them could be hit on the metrics endpoint and still get the same data.

Austin: Got it. And so you mentioned like you had rolled this out and it’s, it’s sat there for several months. And I recall reading from the blog, there were, you know, there’s some firewall things that you guys were trying to work through and those sorts of things, which probably added to the delay. So kind of like fast forwarding to maybe the exciting part once that firewall kind of said, cool.

Yeah, you guys are, you guys are good to go. Can you kind of like talk a little bit about kind of the events that unfolded after that?

Ols: Yeah, sure. So I I got everything running as far as Promethease was concerned. It was attempting to scrape the end point with our, our default default settings, which was, have a timeout of 10 seconds and scrape every 30 seconds. And then when you, when you look on the previous list of targets, that it is scraping on the web, ye you’ll see that it says whether the target is healthy or not. And the query exporter that we were looking at said connection reset or something along those lines. So we think, Oh, firewall, right.

Put the firewall request in I’m going home, see you tomorrow sort of thing. And then the, the firewall requests, how our firewalls requests work is they’re largely automated in terms of the, the elaboration of which firewall it needs to go on which interfaces, which which groups of IP addresses, et cetera.

And also the, the actual implementation is automated as well. So this went through the automating process and the firewall was the firewall rules put in place. And at that point says, right, let me out it, and it starts polling the the metrics endpoint. Now here is the interesting bit in that the actual I mentioned earlier, it’s not a static metrics.

Page that is populated by the application. It’s something that runs every time the metrics endpoint is hit. And the request that was being made is quite a big one because we are looking at the total number of unfulfilled payments in the past week, which is a big number. Like it’s bringing you back millions, millions of records.

Every time this. Request is made to the database. So that starts to slow down the database a little, because it’s doing quite a lot of work and it’s taking probably 16 seconds to return the data. We’re timing out after 10 seconds, we don’t really care. And the, the query exporter doesn’t care that we’re timing out from Promethease because it’s run the query that it’s waiting for the response, regardless of what Promethease thinks.

So it starts to take a little longer than 16, 20 seconds. It starts to creep up, creep up a little longer, and then we’re at the stage where it’s taking longer to run the query than the interval of the query itself. So we’ve got multiple queries. We’ve got 10 times this query when we’ve got very much queued up and there all of a sudden the database that is contained in these these payments records that also contains things like, you know, User credentials makes it so that it’s not able to be read anymore because it’s just too basic, which results in logins, failing for a start issues with people being able to place bets with they are already locked in.

You know, this, this is a total outage, essentially because this, this query is just running itself into the ground.

Austin: Interesting. Yeah. And it’s it’s mentioned in the blog about like like the, the breakdown of communication and understanding what the query exporter application was doing. But even beyond that too, of just like, you know, not everyone’s familiar with the query Explorer, probably just learning, figuring this stuff out, but also from the third party team were they able to provide any sort of like documentation about.

This thing that they, they had just shipped to you guys or was this also maybe just something new to them too?

Ols: So there wasn’t any documentation that I saw. It was just like a handover from one team member to another. But when they found out what we were doing with that query, they were very shocked that that’s how we decided to do things.

Austin: Oh, interesting.

Ols: And so they, we were not following their best practices. Of how to out to get that data. I see they, they says, yeah, that’s a pretty heavy query to be running every 30 seconds. You should be doing that every like, you know, 20 minutes, maybe the data, you know, if you’re trending. How many payments have failed to be fulfilled over the past week. That’s not really data you’d need to be renewed every 30 seconds. You can, you can have half an hour to an hour delay on that data.

Austin: Got it. So I guess moving forward now that you guys have been able to, we’re able to root cause that like, okay, yeah. This query pattern is going to generally gonna be expensive.

We can’t afford to, you know, keep thrashing the database like this. When did you guys end up moving more towards You know, trying again, trying to balance this whole aspect of is my data, the most fresh it can be right now, or like, can I kind of wait sort of thing.

Ols: Yeah. So we made a couple of changes to the actual application itself, the query export or application in that it won’t run.

If there are already two processes of it running. Which would have been a nice thing to have from the beginning, but sure. You know, let’s say you live and learn, you live and learn, and it’s certainly going to be something we put into things in future. And then, yeah, we, we went back to the team that specifically looks after the payment side of things. And you know, we had a conversation with them about how, how fresh do you need this data? Because on a busy, after a busy week we did some some calculations with the database team. And we did. We worked out that on a really busy week. This query could take upwards of two minutes to return all the data.

So that, that now runs every half an hour. And and that was like the trade-off like you say, between the freshness of data the stability of of the database, but really it could now run every. Probably every minute, because now we’ve got this the safeguard in place of, it’s not going to run.

If there’s already one or do one running, we could make it make it more frequent, but it’s it’s not on anyone’s roadmap to make it more frequent, you know, just in case, I don’t think anyone’s going to be arguing for that. Right. I think everyone remembers they they’re like, yeah. Let’s, let’s step away from that a little bit.

Austin: Yeah. So after all of a sudden done I think like, I mean, there’s, it sounded like there was definitely gonna be a lot of eyes on this. What were some of the, kind of like the big learnings that your, that your team, or maybe even other teams, Scott out of this incident?

Ols: So our team we took a lot of learnings from it sort of procedurally about handing off work to other people. And if you pick up a piece of work that has been dormant for awhile, You really need to put the effort in from, to to understand exactly what the state of things are. And if you’re not, if you don’t feel that you’re knowledgeable enough to pick that specific bit or. Then the onus is on you to either seek out that extra information from the person who worked on it previously or from the internet, because, you know, I, I Googled Promethease export or query exporter and it said, Oh yeah, the best practice is to run the command every single time it’s hit.

If I’d have done that at the start, then we wouldn’t have been in the situation. The other main big learning that had a lot of focus from from higher ups in the company was the fact that. It wasn’t me as the engineer owning that system that PO the check live. It was the automated firewall rule that ran at some point in the evening that put that life.

When I noticed that the check was failing, because it couldn’t talk to the endpoint. At that point, I should have removed the check. I’ve disabled. It. So the firewall access out and then we enabled it, but that’s where the whole is just a monitoring change. Misnomer comes in, it’s like how much harm can it do really?

Right. Just letting that, let that sit there and wait for the firewall to where to lay through. And then there’s some little things about the application itself that we, that we had to to think about. So I mentioned the. The fact that we now have safeguard to only allow it to have two instances of it, self running, and we have like a real time backup of the data in that database.

There’s no reason why we should be querying that backup instead of the, a instead of the live database, like query the replica instead. Right. So it’s just, it’s things that should be best practice, but maybe weren’t thought about at the time. But yeah, it’s, it’s been a really interesting learning experience for sure.

Austin: Awesome. And I guess kind of stepping back yeah, you mentioned that there was a lot of, it was more of like, kind of like the process sort of thing. Were there any, like. Like LAR large organizational practices that we’re also look forward to for the future of like third party applications. Like, I mean, I think we’ve probably also, again, gotten bitten by this. I’m not personally aware of it at LinkedIn, but we also use other third party. Like, you know, we, we, we have a license with them and, you know, we’re kind of subject to whatever client that that they’ve provided to us. So. And a lot of times it works. And I think that’s really kind of like the kind of part where it’s tough, where, you know, 95% of the time, 99% of the time, the software, they give us from multiple vendors, potentially they just work out of the box.

So then it’s like, Oh, well, what’s wrong with just, you know, one more. Right. So yeah, I’m just curious on that side.

Ols: Yeah. So I think it’s, it’s difficult to say in this instance, because. It wasn’t really a failure of the third PI misunderstanding of what they thought you guys were going to use it for and how, how you guys were actually using it. Yeah. So they thought we were going to use it in a different way. We thought it did something completely different. I don’t know yet that there’s been any specific. Organization-wide Policies put in place to do with that. But I know that our team specifically are now a lot more fine tooth comb when we’re, when we’re picking up things from, from third parties.

Austin: Fair enough.

Ronak: Makes sense. I want to take a step back. You mentioned that this database was also processing a lot of other tasks. And you mentioned when there was this full outage people weren’t able to log in. So in terms of discarding, categorizing the issues, like if we had three categories say major, minor, medium, for instance, this would be accounted for major, I assume.

Ols: Yeah, this is this is the top top priority. This is everyone. Everyone gets paged. Even if you don’t know what the thing is about, you get paged because it might affect your system. Interesting. So when this happened and you mentioned the blog as well, that the banners on your website would go out saying, Hey, we know our systems that are affected and we’ll be working on it to fix it.

Ronak: What does that incident management process look like? Like what happened after that?

Ols: So after we started seeing the problem, you mean?

Ronak: Yeah. Yeah. Once you see the problem, you know, there is an issue and people aren’t able to log in, like, how do you go about just fixing the system then?

Ols: So we’re, we’re pretty slick at incident management throughout the company, not just, not just within core. So we, the, the kind of process of this was the banners go up. We say, okay, Lots of different services are all having problems talking to the database, let’s get the database, people look at this. They S they instantly see. I mean, I’m talking within, within minutes that they’d seen, this is the query. This is running loads of times. I don’t know what this is. I’ve not seen this before. This is something brand new. At which point, someone in the payments team in core says that looks like a query for the last. All unfulfilled payments in the last week. And then you’ve got, you’ve got enough people there to S to kind of inject the context of, well, I know that that that query has just gone live on these servers.

Let’s let’s stop these servers from doing anything let’s firewall them off. And and get the database in a healthy state. Which, yeah, like I’m saying it’s probably 10, 15 minutes before. We’re we’re in a situation where we can say, okay, we’ve identified the cause of this problem. We’ve mitigated it by putting banners up.

We’ve actually fixed the problem by getting rid of the, the query been made from these servers. We’ve tested it from behind banners to check that everything is working now as expected. We can now go and remove the banners and a lot of people back onto site. And it’s a lot. There’s a lot of really quick moving scenarios with, with our incident management purely because of the fact we want to get people back yeah.

Back on site as soon as possible, because it’s a, it’s a very, very costly if, if people are not able to get onsite, especially especially in, you know, certain, certain sporting events you know, if, if an outage at during the afternoon is. Bearable and outage in the evening when there’s a big sport event on is terrible.

And and yet there’s a lot of a lot of pressure to get things back up as soon as possible. Yeah. Which certainly certainly reflects our ESOP more than level system is at risk. Establish this trust with the users of the system as well. And what you described as a really quick recovery leg, as soon as things started going South, your team was paged are multiple teams for page two, came together and were able to recover the system really quickly.

Ronak: So talking about incident response, I know you have mentioned on. Well, some of the other blogs on the website you do something which both Austin and I, and many other folks in this domain are also interested in some people like to call it chaos engineering or recently resiliency engineering. You refer to this word, fire drills, like you simulate. Failures in your system, again, not in production, of course, but in a controlled environment so that everyone who is on the on-call rotation kind of gets used to how the system works, can resolve the issue. And so that you can recover the systems fast when they actually go down. Can you tell us a little bit about how this process of FireDrill started and how it has evolved over the last few years?

Ols: Yeah. So fire drills for Rosario, a way to. Ruin chaos engineering experiments on, on our systems, computer systems to see how they respond when we pull the rug from underneath them like desk or network. But also we use them as a really effective tool for chaos engineering experiments on our people systems like the on-call team, very important squats, very, very important because I think it was Dave Rensin from Google says Employees are bookie microservices, which is, which is so true.

It is, it is. So they need us as much, if not more attention than a, than your computer system.

Ronak: Oh yeah, for sure. I mean, having sound processes in place as equally important than just having sound systems. Ols: Yeah, exactly. So we started we started doing fire drills just within core a few years ago now. And it was every Thursday morning, we would we’d break something. And the actual people that were on call would would get paged out over the years, we’ve kinda w up till recently, we did sort of just have that same pattern every Thursday morning. Primarily the platform squad would break something, but we noticed that it was getting a bit stale. So it was always nearly, always Platform that we’re breaking something. And so the scenarios were getting a bit, a bit samey a bit. Oh, the disc is broken again, purely because we didn’t have the, the knowledge, the in-depth knowledge that the engineers building the systems themselves have of their systems.

So we made a, we made a pledge that we were going to rotate around all the different squads on a weekly basis. And each of them would run a scenario on their own systems. And that’s been in place for maybe six, seven months now, maybe longer. And it’s, it’s been really effective because not only.

All the scenarios, more realistic and more engaging, but the owners of the systems that are breaking them are doing it in a way that they can, they can try and understand what happens when their systems break. So that, that by, by trying to catch their colleagues out with an interesting problem, they’re inadvertently sort of resilience, engineering experiment, experimenting on their own systems.

So yeah, it’s been really, really successful this change. Makes sense. I mean, ha having the teams who understand the system more deeply create these scenarios because I would imagine As the platform group itself after awhile it’s hard to come up with new ideas and breaking it on systems and having SMEs do that for you will result into more engaging outcomes.

Ronak: Can you describe one of the last fire drills that either your team or one of your other teams simulated, if that’s okay to share on this platform?

Ols: Yeah. So I I did one yesterday.

Ronak: Oh, nice.

Ols: It’s fresh, fresh in my mind. And this was this was good because this was across tribe federal, so it involved us as core and also the bat tribe.

And what we did was we made a change to one of the core systems, remove some API keys. Which meant that the putting a selection onto the bet slip to actually place a bat would fail and give a bad placement on available error. So this, this kind of ran where we I was, I made the change and then I was slowly restarting Kubernetes pods instead of doing it. Yeah. Big bank. So it was sort of like a slow degradation of service. And and then the, the engineer was paged and saw the, sort of the areas that are, this looks like something to do with core let’s call core out and you know, every everyone’s happy, everyone enjoys a good, a good investigative scenario, don’t they?

Oh, yes. But what we what we’ve spent time doing is it’s making. Focusing a lot on the realism and the immersion of the, of the fire drills. So we’ve got this this Slack bot. Where you as the, the exercise coordinator can type in what you want to say, but who you want to say it as. Oh, nice. It’s interesting.

Very interesting. Yeah. Yeah. So you can say like tech desk, it says we’re seeing a lot of calls coming through from the contact center to say that customers are unable to place bets. And it’s just another one of those things that helps keep people in the moment and, and and treat it like it’s.

Real cause it’s all too easy to just, you know, I ain’t got time for this. I’ve got more important work to be doing. I’ll leave all the people to deal with that problem. Whereas if it’s actually engaging and entertaining, then it’s a lot more, a lot more interesting, a lot easier to get people involved. Oh yeah, sure.

Ronak: I think, I mean, it’s, it’s more of a, it’s a cultural change or it’s a more of a culture that people buy into. You mentioned that it’s so first of all, how long do some of these fire drills go on for.

Ols: So we, we book out the morning but he doesn’t take that full time. So we, we allocate one hour purely because we want to put a window on it so that if somebody needs to do something in the environment, in which we’re running the drill we’re not blocking them from doing what they need to do.

Because while we don’t use customer facing production, we do use like our production disaster recovery environments so that we can have a truly representative environment to do the testing in terms of like application scale and everything like that. So we, we time box that to an hour and then we.

What we were doing previously is having a retrospective as if it was a post-incident review of a real incident. And then raising any actions and sending them off to, to the to the relevant squad to deal with what we do now is we have a specific hour. After the end of the fire drill, where we have the retrospective straight away, everything.

And then if it’s small bits like documentation changes, then we just do them then and there instead of necessarily passing them off to someone else too. So it’s been really good and it’s helped get a lot of low-hanging fruit, whereas otherwise it would go and sit on someone’s backlog for. X number of years before it actually becomes important enough to do okay.

Ronak: Oh yeah. Ha doing the retrospective right away. Sounds like a good idea because it’s so much, the incident is so much fresh in your mind and you know exactly the improvements to make. Can you tell us a little bit about the, what the anatomy of the filter looks like before you actually start? So let’s say you mentioned you do it every week.

So I’m assuming you, you are other team members would be thinking of certain scenarios beforehand. You don’t think of what you kind of break that day itself. And the scenario that you create would also be something, this is just, again, an assumption. You might be sharing it with your team members for learning it at a later point.

So what does that look like? How do you structure this in dogs? When do you prepare for these things? Like, do you have a list of scenarios that you want to cycle through?

Ols: So for platforms specifically, and now, now that we don’t own every fire drill and we no longer have like, visibility of what the other squads are planning, unfortunately, or fortunately, because it makes it more realistic. Yes. But the there’s two main sources of of where we pull our scenarios from. One is past incidents. Nice. So we S because we’re using the fire drills, not just as as experimenting on the computer systems, it’s the people systems as well. We see all that process kind of broke down in that the last time we had this incident, let’s run it again and see how people respond this time.

And the other source is it’s just. People’s brains and figuring what’s what’s the worst that could happen. Or what would happen if, if X and we as platform have a list of of potential scenarios to run and like if you want to simulate this happening, run this command on this server here’s what you should see here is where you’ll see the evidence that. It’s having the desired effect here is how you back it out quickly. And here is how people would probably go about fixing it.

Ronak: I see. Nice. Makes sense. So you mentioned now that the other tribes are also doing this, you, you don’t always have visibility into what will be happening, which isn’t always good.

It’s more realistic. So say for instance, one of your on-call team members gets paged. How did they differentiate between a real page where it says a page from a fire drill?

Ols: I’m afraid. We we’re a bit of a cop-out. So when we, when we raised the pages we, we prefix it with fire drill.

Ronak: Okay. That makes sense.

Ols: I know, like yeah, in an ideal world, we’d not only be not doing that, but we’d be doing it in production as well. Like in customer facing environments.

Ronak: Oh, that’s risky, but it’s, it’s hard to get. Right. It’s, it’s very hard to get.

Ols: Right. But we can all dream.

Ronak: Oh, yes. Yes. Curious, have you mentioned you don’t necessarily do this on production systems, which makes sense. Have any of the fire drills gone sideways where someone tried to simulate a failure, but it got worse than what they’re planning.

Ols: I can’t think of any that have gone worse. I can think of lots where they’ve gone, not at all how we expected. Okay. I would love to hear a scenario if you can share it. So we’ve had, we had one where we thought, right, what we’re going to do. We’re going to take this database down and this is going to break everything for everyone. Non-production of course. Yeah. So we, we ran what we thought would happen. And the system is just seem to handle it and just not be bothered at all. So system is so it’s pretty good. Yeah. So we’re here, like waiting, waiting to page all these people and say, Top priority, priority one incident, everybody all hands on deck and nothing’s broken at all.

Ronak: How rarely does that happen?

Ols: It’s very rare. I wish it happened more often.

Ronak: Yeah. Nice. So you, you also touched that a little bit on once you’ve been doing it every week, which is a pretty good frequency in my opinion. And there is a trade off between. Spending time on a fire drill, versus like you mentioned doing other things like project work, because everyone’s planning for new features and new things they want to get out. How do you, as a, as an organization, how do you balance the straight off and justify the cost of doing fire drills every week? As it relates to the amount of time you invest in the project work that needs to happen.

Ols: I, this is something I feel very strongly about, and this is a. Yeah, a whole, and I blow a lot to get people to listen to.

And it is something that the company accepts thankfully, but I can imagine in other organizations, it may not, may not be the case and you may need to do a lot of a lot of bargaining the the way I see it. And the way I put it to people is that if you have a, a team that is focusing solely on features and new shiny things in your application, that’s fine.

But there comes a point where it doesn’t matter how many new features you add, if you suddenly have an outage and every system crashes, because there’s been no thought put into the resiliency of that system, doesn’t matter how fancy your your application is. If no one can get to it, because you’ve not thought about how it handles failure.

People have no loyalty. Right? Yeah. As soon as that happens, they’re going to go to the competitor who, yeah. Their, their website may not be using the latest and greatest JavaScript framework for its webpage. It works as long as it works. Yeah. I can place a bet. Oh, that is really well put that’s very well put.

Ronak: So do you have any advice or thoughts for organizations who are thinking about chaos engineering or resiliency engineering and just getting started? This is not something that they’ve done, but they are thinking about starting.

Ols: Yeah. The first thing I think you need to know, and you need to have in place before you can even start thinking about breaking your system is having the observability nailed.

So, if you’re going to expend the effort by expend the effort to have your engineers breaking the systems, if they haven’t got the ability to deep dive into exactly what the application is doing, when it’s being broken, then it’s wasted effort. The first thing you need to do before you even think about breaking stuff is ensure that you, you have a total total knowledge of what’s going on in your platform.

It doesn’t necessarily have to be like. You know, distributed, tracing level down to that, you know, down that deep, but you do have to be able to see when your system and services are, are misbehaving. And then in terms of actually getting started. Yeah. There’s, there is a, a temptation, if you like to go with the easy, obvious things to break, like the network goes away. That’s that’s going to happen. Sure. But that’s not very exciting. You’re not going to get your engagement up the best thing. And we learned this too late. This is why our fire drills went stale the easiest way and the best way to get people to get buy in from people in the business is to involve people in the business and get them thinking how their own systems can break. Instead of, instead of you know, the platform team coming in and saying, We’re going to break your system and tell you what’s wrong with it and how you need to fix it. Instead of doing that, it’s about right. Let’s as a, as a team, as a collective let’s look at your system and see, how could it break?

What have you thought about this? You don’t know what happens if, if this goes away, well, let’s take this downstream dependency away and see how your application behaves.

Austin: Yeah. These have been great discussions. I think even. Like all the talk about the fire drills. I think this would be a wonderful onboarding tool for even new engineers.

I think this is something that happens in many organizations, many companies, you engineers come in, they, they don’t know like how the layout of the land is. But with these fire drills, I think it’s, it’s a very real way to kind of immerse them into this environment so that they can quickly figure out like, Oh, my application talks to these other applications and those sorts of things where.

Without that, unfortunately, it’s kind of learned on call, which I think is what a lot of companies kind of, and it’s fair for the on-call engineers to go in and be like, I’m terrified. I’m like, yeah, it’s going to take some time. But with these it, I think it’s, it’s like, it’s probably. Less stressful for them, but I think it’s a wonderful experience for new engineers to come in and be like, I can do this in a safe environment.

And when I do go on call for real, like it’s not, it’s not as scary, which is, which is a great feeling.

Ols: It’s throwing people in at the deep end, but you’ve given them light. A group of ring, they’ve got like flotation devices all over them. They’re not going to sing. They might feel scared for the first 10 seconds or so, but actually they’re going to realize that it’s safe to do.

And by the time they get rid of the flotation devices and they’re actually on call, it’s like the deep end. That’s fine. As part, as part of going on call and onto our on-call rotation, you have to have gone through a number of fire drill experiences before you can actually go on call.

Austin: That’s. Perfect.

Cool. So I’d like to kind of, this is a question that we ask all, all of the folks that come on to our podcast. So I don’t know what are given that you you’ve. You have a huge breadth given that you’ve kind of like put together these fire drills. You’ve probably worked with a lot of tools at this point and the dev ops space or, and other places. So were, was kind of the last tool that you discovered and that you just really enjoyed using really?

Ols: It might, it might seem kind of a cop-out because it’s not, it’s not what you might think that there is no wrong answer here. So. I recently went back from bash to is that S H and and I found this this theme called P power, power, something power level 10 K and M.

And what it is, it’s it you know, if you have loads of plugins is SSH, it kind of slows your prompt down and you press enter and you just get the gaps on your terminal. This, I don’t know how it does it. It’s magic. It’s sort of lazy loads, your plugins, but it gives you a prompt straight away. And and then it fills your prompt with all these super low latency utilities. So it’s, it does like your, you get or subversion or whatever version control in your prompt. It gives you A clock that actually counts up the seconds in your prompt instead of being the time that you press the last press enter, which I think is amazing. I think every, every prompt should come without.

Yeah. So, yeah, I don’t, I don’t know if, if as that H theme is going to be the most exciting tool that you’re going to get on this segment ever, but it amazed me purely because of like how, how it manages to take something that. We’ll take literal seconds to load up your prompt and just makes it by 10 milliseconds before you have a prompt update.

I just found it amazing. Austin: Yeah, no, that’s, that’s huge. I mean, I think for anyone who’s working in this space and probably some of the most frustrating things is you’re trying to run something and you’re like, Oh, I have to wait a few. Just even like three seconds is enough to just like, How any of us go a little bit crazy, so that that’s really neat.

Ronak: Sharing that team name again.

Ols: It’s power level 10 K.

Ronak: Okay. nice. Yeah.

Austin: Well, and so where can people find you on the internet and learn more about what you’re up to these days now?

Ols: I, I tweet occasionally on Hey, it’s Oles. All one word. I’m S I’m sometimes, sometimes mess about on the, on the Fetty verse, but I’m getting a bit bored of that.

So maybe not. My my website is ols.wtf, which I sometimes write blog posts on sometimes don’t, but if I’m going to be active, it’s on there, basically. It’s on there.

Austin: Awesome. And is there anything else that you would like to share with our listeners today?

Ols: No. Oh, well actually, yeah, go, go and break stuff. Cause you don’t, you don’t know how things work until you’ve broken them.

Ronak: Yes, that’s true. Plus plus went to that.

Austin: Well, it’s been a blast having you on our podcast. So thank you so much for coming onto the show.

Ols: Cheers. It’s been brilliant.

Listen on

Apple | Google | Spotify | YouTube | Stitcher | Overcast | Castro | Pocket Casts | Breaker

Next
Previous