Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6 | Transcript

You can see the show notes for this episode here.

This transcript was generated by an automated transcription service and hasn’t been fully proofread by a human. So expect some inaccuracies in the text.

Guang: Hi Tammy, I’m super excited to have you with us today. Welcome to the show.

Tammy: Thanks so much for having me great to be here.

Guang: I noticed that you were doing full stack software engineering before doubling down and sorry. So what made you kind of decide to join the dark side and focus on infrastructure?

Tammy: Yeah, that’s like a, it’s actually a really fun story. So when I was, I always loved computers. Like since I was super little, you know how to computer, how the internet, since I was like 11. But the big thing that happened to me was, you know, I finished school, went to universities, study computing, you know, computer science did programming, loved it. I really loved everything. So it was like trying to figure out what would I like to focus on. And then I got my first job out of university college at the national Australia bank in there, like graduate programmer job, like rotation that they have. And so what they do is they put you in like three different teams. And that was pretty cool for me because I thought that I’ll get to try a bunch of different things out and, you know, see what I really wanted to double down on. And my first team I went on was mortgage broking. That was like super cool. Cause it’s very critical systems, but I realized every single time they were like, can you build some new features on the front end? Can you also do some business logic work? Can you fix them issues with the database? Can you fix some issues with the load balancer? Can you be on call? Okay. I was like, do everything. And I realized that like every time I tried to build anything on the front end, I had to go back a layer and I was like, okay, like this isn’t working. It’s like super slow. Why is it not working? Oh, like the middle T is really bad. Wait, actually it’s like the SQL queries. Can I fix those? No, the database structures like really bad. And then there was issues with load balancing too. So I just like would go back and back and back. And I, and then I was like going all the way, the hardware, like what kind of father is this running on? And this is the kind of person that I am on. Like I want to like. I just can’t stop going all the way back to the very like bottom level. Like, you know, how has this data center powered, like, and I went and visited the data center to learn more about it, but that’s just me, like, I’m curious. And I realized that it’s like really hard to make things amazing on the front, if they’re really not very great on the back. So you kind of have to fix the backend first a lot of the time. Yeah. That’s why.

Guang: And now you’re a principal SRE at gremlin. Yeah. What does a day in your life look like as a principal? Sorry.

Tammy:: Yeah, it’s really very, like, I’ve been at gremlin for three years now, so I’ve done a lot of different things. I joined as the ninth employee. And you know, I’ve done a lot of things over the years. Like, you know, normal things like you would imagine we are primarily on AWS. We’ve got a really nice like layout. We, we try and like, in terms of architecture, we also use a lot of new services. So when AWS releases a new service, we try and be like at the forefront. That was one of the reasons I like joining gremlin because we went like, we’re just going to stick with the old stuff that we know. Well, gremlins like written in rust. We use like SQS SNS. We’ve used Lambdas for some stuff like. So that’s been really fun. Like I wanted to join gremlin. One of the reasons was I’d always actually worked on on-prem because like Dropbox is on prem DigitalOcean’s on prem national Australia is on-prem. I worked on a little AWS project that it was always like building the cloud or cloud sort of related products for other people. And Gremlin was founded by two engineers that worked at AWS, like building AWS. And I was like, that’s awesome. If I want to learn about AWS, I’ll go work with them. And so, yeah, like I’ve learned so much from them about how to build like reliable, scalable systems on AWS, which is like a whole new area to me. So actually, like, even though I’m a principal SRE, like I’d say like the last year, year, three years, I’ve just learned so much about. Building reliable systems on AWS and especially by doing a lot of chaos engineering failure injection, obviously I’ve been on call over the years, but we really don’t have, I pretty much never got paid. I think in my first year I got paged like once, like it was like nothing and that’s very different in the past. I used to get paid hundreds of times a week, you know, until we would fix issues. And also you got the advantage of joining a startup when it’s small thing, the ninth employee it’s like. Super early on, right? Like gremlin just existed. We just got a web UI because before it was like a command line tool for the agent, it’s all like really new. We did a big migration to react. And that was something that happened over the years, but yeah. Lots of cool stuff. I’d say my day to day, a lot of it’s now actually helping other people understand like how to go reliable systems, how to do chaos engineering, how to do failure, injection, and also thinking through like strategy and future work which is something I’m really excited about. Like. You know, where is SRE heading? Where is chaos engineering going? Like what are the new platforms that are coming out in the future? What are some easier ways that we can get big reliability wins? Like, instead of just doing it, like the same way that we’ve always done it. Cause I have a few tricks up my sleeve, but I like to always think of new ways to say, yeah, that’s me.

Guang: Okay. That’s really cool. Kind of moving away from doing on-prem to cloud. What was your biggest. Pleasant or unpleasant surprise. Yeah,

Tammy:: I mean it, yeah, it was really different than I expected. Like I honestly thought it was going to be a lot more complicated and difficult because like, You know, no offense to AWS, but like, I’ve like, it’s like, there’s a lot of services and everything’s really different. And this, it feel, it felt like before I started to do it, there was just like a lot, like, it was like, wow, this is like a lot of services and a lot of things to learn. And like, it didn’t seem like the pieces all connected really well together. Right. To me. But now it doesn’t feel like that at all, because I’ve done it for three years. So now I just feel like, Oh yeah, just grab this bit, put it over here, grab that and do that. You know, it’s all actually been very easy to do. And then injecting failure enables you to say, like, what happens when systems don’t work well together? Right. Like. And I like a lot of the features, but I’ve learned like the ins and outs of how they don’t work. Like, you know, auto scaling. That’s a really cool product that AWS built, but like, I’ve also seen a lot of outages related to auto scaling because of configuration issues and like throttling problems there. And so just, you know, I’m not at that level where I’m going into that next layer of like, how can you actually make sure that this is reliable, but like in a really, really like tiny little details with on-prem. It’s totally different. Like, I was focused on such different stuff like hardware, performance, tuning, firmware, upgrades, kernel versions. I mean, like I’m just not looking at, you know, kernel versions for AC too. That’s like, this is different things that I just started with doing well, picking hardware, you know, I’ll be part of like buying and hardware decisions. What hardware do we want to buy for our databases? Let’s look at all the options. Let’s have like hardware vendors come in and demo to us. And like pick the best hardware and do capacity planning. Totally different. You still have to do it then. I don’t have to do it like, you know, Oh, we have to buy this hardware and get it shipped to the us and then put it in our data centers, final data center space. Like just a lot of projects that you don’t do. But I, I honestly love on-prem work. Like I really love it. It’s super fun and cool. Like I know you do it at LinkedIn and then you also have like, As you were as well with Microsoft. So it’s cool. You get to do both like, and I think it’s good. Yeah. Yeah.

Ronak: LinkedIn is actually in the process of migrating to Azure. And some of the challenges that you’re describing is like, there is a set of problems that you don’t have to think about anymore, but then there is a whole new

Tammy:: that’s exactly what happened to me. Like, yeah. And then you’re like, Oh wow, these are really different. Problems that I saw before. And like, I never solved those problems. So it’s like learning totally from scratch. Yeah.

Guang: Gremlin is a chaos engineering company. How did you get into chaos engineering?

Tammy: So I started back at the national Australia bank and one of the things that they told me that I had to do when I first started, they were like, Alrighty. So we’re volunteering you to run our disaster recovery testing. And that’s like, they called them. And basically they were like, so for mortgage broking, you have to make sure that it fails over from one data center to another. And you got to go to this like secret location on the weekend. Someone’s going to come around and like ask you to fail over your system. And then they’re going to check that it actually worked. And if it doesn’t work, then they’re going to like Mark it down. Yeah. On a piece of paper and then you’ll have to fix everything and then test it again in a quarter, I was like, wow, this is like super serious. Like, and I’m, and I just graduated too. So I was like, do I even know how to do this? Oh my goodness. Like, but my boss was like, you know, I think you’re going to be good at this. Just like volunteered me for it, which is really cool. And Yeah, he was a great boss. I was lucky to have an awesome boss coming out of university, gave me a lot of good opportunities. And yeah, I went to this like secret unmarked building. Did the fail over exercise pass the first time? Cause we’d like you do a lot of work to prep, a lot of failure injection on purpose proactively and making sure that you’re injecting that chaos, that you will pass that region. Fail over. And the big thing too, is like, we have to do it for compliance reasons, right? Like, cause you’re a bank, so you have to pass these big, massive exercises. And I did so many over the next six years and sometimes I pass, sometimes I failed, but it was like on big systems, mortgage broking, internet banking, foreign exchange trading. So. Like, I, I love that like working on critical systems and then moved to America, worked at digital ocean. And I was like, Oh, I think the easiest way for us to learn about really complex system is to inject failure. Still believe that. And DigitalOcean was like 14 data centers. So it’s like massive scale for on-prem and like, you know, Yeah, like things go wrong. Failure happens. Like you need to be able to be ready to handle it. It’s like such a cool, like scale too. Like having that many data centers, it was really fun. And we did a lot of cool work. First. We started with just like drills, you know, kind of like tabletop exercises. And then you just think through, what else can you do? You’ve you obviously want to try and inject value. That’s the best way to do it. And then it Dropbox, I went there and did it straight away, like in my first. Three months. I reduced incidents by 10 X with my team, the databases team by injecting failure on purpose to identify those areas of weakness that we needed to fix. And then we just did a like reliability sprint. So two weeks dedicated just to reliability, even though we are SRAs and obviously you do reliability, but we were like, nobody can book meetings with us. We’re focused on this. We want to fix all these things. We’re just going to work really, really fast and hard. And we came out of it and it was awesome. Like we just. Never were as bad for the next 12 months. We never had a high severity incident. So yeah, that was really cool, but that’s how I got started. And she’s been such a fun journey. I’ve learned a lot over the years.

Guang: I remember my first exposure to the concept, I think was watching a talk at AWS reinvent where the Netflix people were talking about the chaos gorilla, chaos monkey, and just being sort of vest memorized by, you know, when they show the graphic of like doing a live sort of you know, and redirecting. Yeah. I I’m curious. How does sort of cloud, or the more, the massive adoption of cloud kind of play into this on one hand? I can definitely see, right? Like, Because these cloud providers kind of abstracts more and more of these things away, such that you don’t have to worry about it. But then on the other hand, because of more abstraction and you can move faster. So there’s more errors and there’s more room for failures. So there’s more of a need to kind of preemptively test it. I’m curious to kind of get your thoughts .

Tammy: Yeah. That’s a great question. Like I get asked that a lot as well, and I’m like, there’s two things that come to mind. One is. You know, moving to the cloud, a lot of folks think like, yeah, this is going to be easy. I want, have to worry about reliability. It’ll happen out of the box, but like that totally doesn’t happen. Yeah. Think that definitely. Yeah. It’s not going to be as easy as you think. Like, and I think as SRS, we know that going into it, you like, you totally know that. Right. But a lot of folks don’t have that reliability background. And so a lot of what I’ve been doing actually is helping people. Create like a, an understanding or culture of liability too, because that’s really important. Like, why is reliability important to us? You know, how do we show that we care about it by demonstrating like that and doing actual work to be proactive and focus on reliability and specifically like, so if we look at, you know, to the NetEase, And if you think about that, there’s like a lot of outages reported just for Coobernetti’s on the cloud. Like lots, like there’s a whole guitar Reaper, like K K aids.ai. If you check that out, there’s like tons of outages. So, and I did some work like research to analyze those outages. That’s like something I do for fun. I’m like a super nerd. Yeah. And and I, and I created this diagram and it was like, wow, like, you know, Twenty-five percent of those outages were related to just CPU issues like spiking CPU or CPU throttling, which is like, you know, you’re like, Whoa, as an SRA, like that’s crazy. And then the other thing was about just clusters being unavailable. So like that’s just, you know, shut down or are unavailable machines, like an unavailable. No, usually, maybe not even at the pod level, which is more complicated, but yeah, looking at that, you just go, wow. Like we still have a lot of basic things to fix. Like, you know, you can’t handle CPU spike in like shutdown of nodes with Kubernetes, which is supposed to be reliable. Like that’s kind of where we are right now as an industry and like, Obviously in like 10 years, we’ll be somewhere way better, but it’s still just the beginning phases. That’s what I really think.

Ronak: Yeah. It’s interesting. What, what, what are the other kinds of failures you see with like patterns with Kubernetes failures in general, in the cloud?

Tammy: It’s really interesting. Like the. So that was like the, you know, half of it was CPU and M hosts or nodes going away. And a lot of that has to do with like how you can set up your clusters, like, cause there’s limitations there around like ACS and regions and that can cause outages. Yeah. It’s like those fine little details of what you need to look into. And there have been outages of like two regions at the same time, you know? So just knowing that is something you’ve got to prepare for. And that has caused people to lose a lot of money. Cause like, if folks don’t know, like it’s some enterprise companies, if you have an outage, you might have to pay a fine to like your customers, like an SLA downtime. Fine. It can be like millions of dollars in some cases. So it’s like, you know, that’s why SRA teams is so important and it’s work. You saved lots of money for a company as well from just that perspective. And then the other side of the issues were actually mostly related to like networking. And resources, but when you think about networking, DNS, big one, you know, that’s like always a big one that has failures.

Ronak: Yeah. If you don’t know what it is, it’s probably DNS.

Tammy: Exactly. Cause a lot of issues. You know, and you can do like redundant DNS or have more reliable DNS infrastructure by having a backup and stuff like that. But a lot of folks don’t do that. And then the other area for networking issues is usually like latency or packet loss. You know. Yeah. Just being able to say what happens if my system experiences latency, the first thing I always think is like, would we know like, and then do we have good tools to be able to identify that? And can we like pinpoint where the actual problem is? And then can we remove that problem? Like from the path? And I think like it’s really good, like super powerful for an SRA to understand networking, you know, it’s like very handy, like. Whenever I booked anywhere. I was always like super good friends with the network engineering team. And they’re like, awesome. You know, I’m like, what tools do you have? Can I have access to your tools? And they’re like, yeah, sure. They’d like, let me log into thousand eyes. And I would like, be like, it’s like, this is a dream. Like I can see the network diagram. And then they taught me all about Piering and how we would like, you know, make the network a lot better. And Yeah. Like, that’s a really cool thing to focus on because you can improve your system, you know, a lot, but you need to be able to work with other teams to do that. Right? Like that’s a big thing. Yeah.

Ronak: Yeah, absolutely. I think networking for it gets more complex with now overlay networks and containers in there where every Linux machine becomes an outer of sorts. So it’s like so many layers of abstractions. You have to understand.

Tammy: Yeah, you’re start. Right. We’re not. And then you can have, yeah. Latency between all those, you know, just like total packet loss or a little bit of packet loss. Yep.

Guang: Job creation. Right. So at Dropbox, you were an engineering manager. Do you feel your experiences? I guess maybe the question these kind of based on this post I read from charity majors like I think a while back talking about sort of straddling between the IC track and the the management track is very hard to do both well at the same time. So, but you also don’t want to get too far away from one another. So then does it, do you feel like your experience managing a team kind of helps you in this, like more like a leadership role as an

Tammy: IC. Yeah, definitely like I’ve I’ve heard that place too. And I think it’s really good. Like I’ve definitely flipped between the two roles over time. So, you know, like, yeah, obviously most folks probably start out and as I say, like, I show it in and then you know, you get your first opportunity. My first opportunity was to lead like an intern, which was really cool. And they were like, can you help lead these intern and guide them for their internship? And then after that, and I count that as leadership experience, right? Like if you’re told to lead an intern, then you are a manager for that intern. And then after that I was leading a new grad. And so I had like one person and then gradually led more and more folks. But then I decided I wanted to go back to IC work. And this is still like at the national Australia bank. So I went back to see work, but after it’s kind of like the way to describe it as like, When you first become a manager, managing people, you like, look behind this red curtain, you like open it up. And you’re like, Whoa, this is what they care about. Like I like did not know how I was being measured for performance or like what people thought was important for me to do until like, I, you know, I didn’t know what a round table discussion was for a performance conversations. Like there’s all this stuff that like, you can’t imagine what it looks like until you see it. And the only way to see it is to become a manager. And like, it’s hard to describe it to people, but. When I would go in, I went in, I remember my first round table after a performance reviews sort of cycle. So well like people saying, I think this engineer should get promoted because they deliver this and this actually, no, I think this one should get promoted more and first and like have a bigger pay rise because they deliver this other project, which was much more important and helped all these other teams. And like, I just. I didn’t know before that. Cause no one had explained to me like how we were being assessed. And so you’re kind of just like trying to do your best work based on what your manager tells you. But then going into that room, I was like, Oh, now I know how to. You know, really do well at my job. And it definitely helps. So like I tell everyone, like, if you can have a stint as a manager, just do it. So you can like save the Hanukkah and then be like, okay, now I get it.

Ronak: At least you would know how the system works then.

Tammy: Yeah. Yeah, totally. And like, as engineers, we love to know how the system works. Right? Like you’re like curious and you want to understand it and just even doing it for three months. It’s really good. I wish there was an easier way to just like understand it and see that. Yeah, it’s a good thing to just do it. I’ll recommend it for everyone.

Ronak: So when you, when you became a manager did you miss some of the IC work that you did because a lot of your time would then go into people management?

Tammy: No, because like, I started really small with leadership, so, and I recommend that as well. Like I started with leading one person, one engineer, and he was a new grad and he was like super smart and switched on and I would just help him with, it was very much more like, I would say, like being a tech lead manager, you know, so having him and you know, mentoring him doing one-on-one. Doing his performance reviews, figuring out what work he could deliver for the company and like assigning him to projects. You kind of have to like, be like, my engineer can work on all these things, like shopping around the projects and stuff. And that’s how it works in a bank is really different. And yeah. And they have like internal charge codes for projects. It’s like really different. But yeah, when we did that, most of my work was actually with him code review, like architecture discussions. Like it’s very technical because he was new to industry and he needed to learn a lot of that. And then after that I managed like, you know, two people, so still like had a lot of time to do IC work. And I had icy projects assigned to me as well. And then I, then I went actually, when I was at digital ocean. I manage like 33 people at my like highest points. And I was like, yeah, that was a big team. And I was having like 15 minute one-on-ones every two weeks. And my team, like, I didn’t even have the ability to meet all of them enough. But then we ended up scaling that out, but that’s like startup life, you know, you grow so fast and eventually high manages. But I always knew that that was way too high at that. Dropbox. My largest team was 14 people across like database and block storage magic pocket at the same time, which was like, fine. That was really cool. Like, and I think I know I was just involved in a lot of technical conversations and I did do some IC work. So I suppose I’m like someone that can never totally be away from what we’re doing. So I feel like you wouldn’t be very good at your job, you know, like, Involved in picking like what, from where there should be carnival discussions. You know, I was the one doing the chaos engineering experiments, even though as managing the teams. Like I was doing a lot of the failure injection, but more as like a validation for my table. All right. I’m going to like fail. Is it ready? Yes. Yeah, it’s ready. Let’s do it. Cool. It’s survived. Like we did great. Well, like, Oh no, look, we identified some issues. Let’s fix them. But yeah, I, I’m lucky to, you know, to work with so many awesome engineers. I think that’s like as a manager, when you have an amazing team, you just select, wow. I can get to work with all of these folks every day. Like that’s like super inspiring the guy. Yeah. What a lot of great people I’ve worked with.

Ronak: Awesome. Like when, when, when you have manager, who’s, who’s also getting into the technical details. Some of the managers might hate me for saying this, but where they get a lot of street credit from their entire team. Because like, Hey, I know my manager understands all these details, you know, it’s not just like about the eventual goal, but also what it takes to get there.

Tammy: Yeah. Yeah. And that’s what I think is really important. Like, you know, say you’ve got a project like this. This is just a really simple example, but we had to do these database migration projects. And I like looked at all of the tables, understood this game up at interviewed every engineer that had worked on all the different services that touch that database, because I was like, we need to migrate from my sequel to our new. Distributed data store. And I was like, I feel like there’s going to be a big project. Like, I can’t just be like, Hey guys, I want you to deliver this in six months. You know, like looking into it, that’d be crazy. Like, and so then I did all these interviews and looked at the code and understood. It looked into what data was even they’re trying to think through. Could we just like, get rid of these tables or do we actually need to migrate them over, like talk to people about that? So like, Really looking into the details. And then, you know, when we looked at it, like, I was like, I think this is going to be like over a hundred engineers and it’s going to take like over 12 months and that’s what I put forward. And that’s what ended up being like, you know, but it was like, that’s the thing too. I think a great manager has to talk to their team and like their engineers and ask, Hey, this is a big project. Like, what do you think? Going to be the hardest things for us, what’s gonna take us time. You know, Do you think we’re going to come across any issues here? Like let’s chat about it. Like let’s get into a room and draw out a diagram of the architecture, but I think it’s, it’s also, it’s like, yeah, the engineering manager has to be interested in that. It’s like be passionate, have purpose that also be like curious. And you can’t fake that, right? Like you can’t fake it. Sorry.

Ronak: Well, yeah. Could not agree more with that. So w we’ve been, we’ve been talking about some of the outages in the public cloud, and I actually want to touch that. But before we get there you, you were managing the storage. Some of the sawdust teams are Dropbox and Madeline’s taught at systems, in my opinion, are hard. Anyone who does that would attest to it. So we love discussing some of the production outages and lessons learned on the show. Are there any such outages that you could share with us today?

Tammy: Yeah, totally. So, yeah, definitely agree that, you know, Managing anything that is related to data is hard and scary, right? Like it just has to be because it’s critical data. And then you always think through like, what if we lost our data, then it would be gone forever. And you can’t just get it back. Like if you don’t even have backups or something, or there’s no, like extra light is there. So that’s why it’s so scary. And also just around consistency issues, like, yeah. That’s yeah, that’s the other thing. So, you know, they, I always thought this like engineers who are, who are gonna work on storage databases, it’s like, Whoa, you gotta be brave to that one. It is, it’s really like that. And and things change so fast and you know, a lot of teams depend on you because, you know, if you’re at the data layer, like pretty much every team needs that data. So you have a lot of internal customers and you need to make it available, like always reliable and like. Accurate. And so, yeah, that’s a big thing. And then but in terms of outages yeah, everywhere that I’ve worked over the years pretty much except gremlin, like no wood has had like a data related outage and yeah, like right before I joined Dropbox, there’s like a big outage that had happened. And if you look up like, you know, Dropbox outage, I think it was like, when was it? 20, 2014, something like that. There’s like three day outage related to the data databases. And so, yeah, that’s a big outage down for three days and it took a long time to get everything back up and running. And it was just from like human error. Like, you know, somebody did something that they didn’t mean to do and there wasn’t enough guardrails in place and that’s why that ended up happening. So it was just like a thing where yeah, of course, like after that you put in the guard rails in place and you, you know, it’s just like they should have been there in the first place and they weren’t there that a big part of what I did when I joined was to try and think through like, how can we make this better? Like, What can we do to, you know, not be in that situation again, like work with the team, to like put in all these different guard rails in place walked down like the ability to do certain actions. That’s really important. But I think like, you know, even before that, that’s what I’d say with, with databases. It’s always a thing where you go, like, you only want some people to have access to those that really are careful. And, you know, they’re like typing like this, like. Oh, my goodness. Yeah. It’s like a big thing. And when you’re doing something, so you don’t want to, like, most people don’t want access. If they, if they don’t know what they, you know, what’s going to happen or something like that, you really want to limit it. And I think that’s better rather than just having it available to everyone. Yeah.

Ronak: Oh, I remember. So one of my colleagues, we were running a maintenance on a database and one of my colleagues was like, I want you to look over my shoulder to see every command I type before I enter let’s double check the exact thing, and then we’ll hit enter on this production system. I’m like, yeah, I can relate to that.

Tammy: That’s a great way though, having that person, peer review you running those really important commands. Cause that’s the thing, like you check with others, but it’s like, it’s that moment when you run it, like, that’s like the moment that counts and like, yeah, that’s a great tip to just. It’s normal. I would do that. I would be like, Hey, can you check that this is right? Like, I’d usually send it in Slack. I’m going to run these. Does this look good? Yeah, that looks good. Or like, Oh, maybe you could improve it by changing it to be this. All right, cool. Now, now I’m going to run it. I’ve done. I said, Hey do it. All right. Cool. Yeah. And then you get. More confident, but you don’t want to be too confident that you make mistakes. So it’s like always good to like, actually just take that time to check it. And then, so that’s like data related outages and there’s plenty of others that, you know, other really interesting outages too. Like from working on prem, just like a lot of different types of failures caused by, you know, one was like core switches, like took down half a data center. Do you know, do you remember details about this one where a course, which took down like half the data? Yeah, so, I mean, there was a lot of problems that I’ve worked on in the past where it could be like just a configuration problem or an upgrade issue, or just like. You know, even like I’ve worked on outages where there was like power failure within the data center with that happens too. Right. And then yeah, and then a lot of outages related to firmware as well at that next level up. So I think like, you know, the way that I always think about it now is like, everything’s going to break in all these different ways and then you just have to really try and build in your fail over mechanisms and, and think through like, what would happen if this failed in this way? And it is really different, like on-prem versus the cloud. So at least you don’t have to think through like, you know, core switches, like with the cloud.

Ronak: That is definitely one big advantage of moving to the cloud. So I, I want to talk about you on the cloud stuff. Before we do that, you mentioned that when you went to Dropbox one of the first responsibilities was to kind of hard on these systems now. One of the ways, as you mentioned to do that, as you identify these failures or you inject failures and do those or. Do these experiments in a more controlled environment? Personally, I feel like when it comes to orchestrating stateless systems, we have gotten much farther as an industry, but when it comes to stateful, everything is so specific. So even when you’re doing these like chaos engineering experiments on stateful systems to identify these failures What kind of challenges did you see in the past? Or what kind of challenges do you see today to kind of run those on stateful system? Yeah.

Tammy: So definitely agree there. Like, when we were thinking about what to run as our first, you know, chaos engineering experiments for the stapler systems. So for us, it was like, you know, thousands of my sequel machines and then also some proxy machines to some like host running a proxy for the database. Cash like mem cash as well. That was like the main area we started. We just like, I think a good thing to do is to get into a room and brainstorm like different types of experiments you could run, like and using the scientific method, like, what is our hypothesis? What kind of failure do we want to inject? What do we expect is going to happen? And then like, we run it on staging first. Like we didn’t start in production. We gradually work towards production. It didn’t take us that long, but like, it was good to start in staging for sure. And the first ones that we decided to run were actually like process killer, which is like actually pretty events, like type of attack. Like not a lot of people use those. Like now that I work at gremlin, I see tons of people using, you know, practicing chaos engineering, all over the world and process killer for us. We were like, we really want to make sure that, and it’s like very much based on our specific circumstances. So that’s why it’s important to get in a room and talk it through. So we will like, we want to make sure if my SQL D dies. Then the machine and time machine is like raped away. It’s taken away and we get a fresh one. Like we don’t want anything else to happen. That’s exactly what we want to happen. Like give us a fresh new machine. It should already be in the free pool of machines. So it should already be pre-built built. And then it should just go into the cluster and everything should be great. Like that’s what we want to test. And then we will like, We’ve sort of, we sort of have a feeling that sometimes that process doesn’t happen as fast on certain days of the week due to like networking load. So we were like, let’s do it on Monday morning versus like Friday night and then like, see if it changes and it did change. We like sore it and sometimes it’ll be super fast at the time. It’s really long. So you want to like, know all of that, right? Then we started to. Take much more detailed metrics. So like how long does it take for us to replace machines? If the process dies and like, you know, this was like a really detailed project that went on for like several months and then we just kept getting better. But that whole time we will like. Injecting the failure to learn from it. And then making improvements go in and talk to the networking team, sharing our data and results with them, figuring out what we could do to make it faster. We, we realized at one point we were being throttled by the networking team. You know, this like QoS, you can like pick who gets what traffic and like just some tiny portion of throttling was happening, but not at a large scale. So we like resolve that. So, yeah, there’s just, it’s like, you gotta be in that level of detail, like I’m gonna be comfortable climbing into that detail and figuring things out.

Ronak: Yeah, totally. So one thing that engineers like is numbers and metrics w w wins once you did all these experiments, of course it would change over time, but. Or what, like the first span of the six months to a year, what are you able to identify? Some of the low hanging fruits to say we card these in staging or in production. And these were just sticking bombs that would have taken down the system. Yeah.

Tammy: So it’s actually a really interesting, I love that topic. So yeah, I’m really big fan of metrics like everyone is. And one thing that I really love when I joined Dropbox was that we had these automated metric emails that got sent. And so I think you do it at LinkedIn as all. I’ve heard that, that like every day there’s like automated emails that get sent out with the top metrics for the systems. Yeah. I think that’s awesome. Like, I haven’t seen that anywhere else that I’d worked that. It’s also for us, it was color coded. It would be like red. If it was below expected in green, if it was like, Above expected. And it’s like very easy for everyone to just sign up and get those emails for every system that’s critical. And then that’s cool because everyone sees your system getting better, like your systems to be like super grain. And it used to be really red and like, Hey, you guys like fixing things over there. So that’s a really cool way to do it. It’s like really nice. And the other thing was just like, you know, I’m actually big on creating presentations, like based on your data and trying to tell a story around it. Like, Hey, we identify these problems. This is the data set that we first looked at. You know, we identified that these would be key areas to fix, and then this is what we did to fix it. And these are our results, like afterwards, like telling that story. And that really helped us a lot to get like buy-in and, and all the teams, then we’ll like, can you show us how to do this? And then we started to help them. And then we built tools for them to do it themselves. So it was more self-service like we build to a dashboard called scout. That was like an internal tool. So any engineer across the company could add, they’re like PagerDuty service it and then see the metrics for their incidents. But I’m actually like, I’m a little bit different that the way that I like to pick what problems to solve is I’m like, let’s go off to the big problems. Like the big fish, the ones where it’s like, if we fix this problem, Then it’s going to knock out like 80% of our issues, you know, like thinking about it from parade or principle, like that sort of 80, 20 rule. And that’s always like scary to folks a lot of the time, but I think it’s like more fun and I’m like into extreme sports too. So like, it’s like, let’s go to the beach. I like this one really bad. Let’s fix that. And they’re like, Ooh. And it’s also like You know that system, that people are scared of that no one wants to write code for, because no one’s written code for it for like 10 years. Oh my goodness. But yeah. I’m like, let’s do it. Yeah. And then just, we would do that. We’d be like, let’s go off to that system or let’s decommission that like flaky thing, get rid of it and it feels so awesome when you do it, you know?

Ronak: Yeah, absolutely. I mean, once you get past all of that, you do, it’s like a threaded background off your head has just gone away. Now the, all this mental capacity has been freed up.

Tammy: Yep. Totally. You don’t have to be thinking about it at all. You’re like we just removed this really bad part of the system and it’s gone. Like, it feels so much better. Like I imagine it, I think about it a lot. Like say you live on this great straight and there’s one house that’s like super ugly and like real smelly and bad. No one wants to go in there. It’s like, that’s like some of our systems. Someone will get rid of it eventually.

Guang: What extreme sports do you do? You’re Australian. So I feel like that already sets the bar kind of high. You say it’s true supportive.

Tammy: I like lots of stuff. Like, yeah, definitely. You know, everyone’s into surfing. That’s not so extreme, but definitely loves skateboarding. And I used to go in like skateboarding competitions and got like sponsored as well. So yeah. I love that. Yeah. It was really fun. I’ve been doing that since I was little and snowboarding. Love that too. Dirt bikes, mountain bikes, like BMX, like I love like actually Banamex jumping. That’s like super fun, but also like very painful when it’s like the most painful, but also so fun to fly through the air. So it’s like a trade off.

Guang: I would argue that surfing is quite extreme, especially when you’ve got sharks in the water.

Tammy: Yeah. That’s so funny. I’ve never had a shark issue. Like I’ve been in the water when there was sharks. That I’ve never liked, you know? And I’ve just got now it was okay. Like my worst surfing injury was like one time I was riding a wave and then. Got like, right to the end of it and got, you know, sort of toppled over. And my board went flying up in the air and like came right down on my foot and it was like bad. Like, like almost broke my foot. It was so painful, but it sounds so funny. Cause like you don’t think that sort of thing.

Ronak: Well you’re very brave, Tammy is what I would say. Oh, go ahead. Go ahead.

Tammy: I was going to say, I always invite everyone to come along too. So yeah, if you ever want to go skateboarding, we can give it a go.

Ronak: I, I lead some pep talk before we actually started doing that. I, I imagine some of, some of your kind of interest in the extreme sports would also help with kind of preparing you mentally for dealing with production systems and things which are scary.

Tammy: Yeah, that’s something a lot of people don’t understand and they think that it seems weird when I tell them, like, visualize your system. I’m like, visualize it before you go on coal. Like, just think through what could happen, but it’s exactly right. You know, because yeah, like, They a lot of athletes do that, right? Like if you rate yeah. Like basketball athletes and stuff like that, professional folks they’ll visualize themselves like getting, you know, the ball in the hoop and like being like them, like I did it before they do it. And there’s like really good research that shows that helps you do it. Definitely. You’re going to visualize yourself doing a skateboard trick before you do it. Like you spend a lot of time, we call it like, you know, you’re like amping yourself up, getting ready for the trick and thinking through all the details, like, well, I put my foot, how fast will I like move my foot? What direction? Like, you know, all these things. What’s the wind, everything we do that with systems. It makes it a lot easier as well. And it doesn’t take that long. Right. It’s like, Like you said, even with the example of what am I going to type when I’m running this command? Like just taking the time to be patient and think it through that’s way better than like, just like, you know, running off on your skateboard, off the edge of the steps and be like, good luck. Hope it works out. It’s probably not gonna work out

Ronak: hope is not a strategy, right?

Tammy: Exactly. No, not at all.

Ronak: So you, you mentioned running chaos engineering, experiments, like a science experiments of sorts where you have this hypothesis you kind of think about the failure you want to inject, and then you have kind of an expectation of what should happen. Yeah. Have there been instances where your hypothesis. Kind of went sideways where you’re like, I think this shows, this is what should happen, but a completely unrelated different thing happened to the system.

Tammy: Yeah. Like that actually it didn’t use to happen to me until more recently. So yeah, like when I was at Dropbox, it was always, yeah. Pretty much, like sometimes we would learn something new, but it was more like the detail of how we could fix that problem. In particular, like related to proxy, chaos engineering, or like have different types of failure modes, if you did like a hard shot down or a really slow shutdown, like non graceful like hanging threads, stuff like that. But. When I was, when I’ve been doing it recently, like over the last few years, especially, you know, with Kubernetes and on the cloud, definitely like seeing unexpected things. And a lot of it, I think is around, you know, one is dependency analysis, like you said, because there’s a lot more complexity. So it’s like, well, I have containers, how many containers inside each pod? And then I have to pause and then I have like all the orchestration on top of it. And then I’ve got multiple nodes and even just doing something simple, like saying if I fail this service, I expect that this other service will fail, but I think everything else should be good. That’s like a really hard to guess now. Like, you know, you’re really like guessing a lot of the time, cause it’s a super complex system. And I used to like do a thing where I would print out the code to like read the code of all the systems I worked on that are more like monolith and. You know, you’re learning about a specific area and it was easier to get how things connected. But now I think with distributed systems and, you know, containerization, it’s like way harder. So pretty much every time something unexpected happens, I’m like, why is that failing? Oh my goodness. Look at the code and like, see why this is a hard-coded dependency. Or like what, why is there a problem between these two services? Yeah.

Ronak: Yeah. Like microservices just makes this entire draft super hard.

Tammy: It’s like a really complicated graph. That’s like the, if you Google microservices desk bowl,

Ronak: I haven’t Googled it, but I’m definitely going to do the afternoon after this chat.

Tammy: Yeah. And that’s the last exactly what it feels like when you look at that diagram, you’re going to be like, Oh my goodness. It’s just like trying to draw the architecture diagram. Microservices is like horrible.

Ronak: Oh yeah. Well, well, it’s, it’s a work of art at the end. Right?

Tammy: I like that. Especially if you picked cool colors or something.

Ronak: Yeah, exactly. So do you remember any of the weird things any of the latest, weird things you discovered in one of these experiments?

Tammy: Yeah. So one actually has been around. So just like definitely around dependency analysis. And I think like there’s something that we do a lot of work on when we’re doing tests. So we’re always trying to think through like what a new types of chaos engineering attacks that we should create or run. And I have like a list of like 60 plus that are like on my list of things that would be cool to have available. And so then with that, I’m always trying to think of like, what different types of failure will I inject? But I think like lately, the ones that are most interesting to me is like sometimes I’ll get to work on systems that are payments related. And that to me is really interesting when you see different failure modes there, but. Yeah. Like that’s going to be really different depending on different people, what they’re doing and how they’re processing their payments. But I feel like lately that’s like just an interesting area, but it’s probably because I worked in banking too. But if you think about it, do like you go, okay, what’s going to happen with payments, how that could fail. There’s all these different things. It’s like the shopping cart process, the payment processing, sending the information back. Then it depends too, if you’re doing this file at credit card processing or something like PayPal or Stripe, and if you’ve ever worked on like PayPal related, like checkout reliability issues, like PayPal has so many Erica, it’s like, this is like a thing, like when you’ve worked on that, you’re like, Whoa, like, it’s like, yeah, you gotta look up like PayPal. Erica is, this is like a huge encyclopedia of Erica is that you need to handle. And they all mean different things. Like person has not enough balance. Person’s PayPal isn’t working right now, or person’s PayPal has been locked or all these different stuff. And so then your system might not let people process anything. And then you could have like, Yeah, just different outages related to payment providers. So I think to me, that’s, that’s like an area that I’m interested in in particular. Yeah. Just all the payments values.

Ronak: It makes sense. Then with different payment providers, size human gets more complex because the error codes are different.

Tammy: Yeah, exactly. And then you have to go out to a lot of third parties related to processing different types of payments and transactions. So you have to learn about all those different companies and what kind of information they give back. So yeah, like that’s, that’s a whole interesting world that you learn about and all those systems have to work crazy fast. Right? Like that’s like, that’s why I think it’s fun. It’s like, wow, you’re trying to process this like that. It doesn’t realize. And now it’s like, Whoa, like sending that information.

Ronak: Yeah. So w we talked about Kubernetes a little bit. What are some of the patterns that you’re seeing? Or some of the ways to inject failure in Coobernetti’s. I I’ve seen some of your blogs around like doing site reliability engineering for Coobernetti’s or some of the common failures that you referred to before. What, what are the recent patterns that you’re seeing that people could use to inject failure?

Tammy: Yeah, like lately I’ve actually seen a lot of folks are using AKS. So this is like more and more popular now. I’ve seen like a huge spike in people using that. A lot of our customers use AKS. So I think that’s really cool. And a lot of the time, like, you know, first we’ll start on things that, you know, maybe sounds simple, but it’s really not simple, like around region fail over and making sure that, you know, if one region goes away, what happens? My biggest tip there when folks are starting to do it though, is like, You know, a lot of folks will go, well, I’ll just shut down my cluster and see if it fails over to the other one. I’ll just shut down my notes. But what I like to do is say, well, you know, grumbling, we have an attack called a black hole attack. And what that does is it makes it unavailable. So it’s a networking attack. And instead of like tearing down a whole cluster or a whole region, or like a ton of machines and then having to like build them back up again, you know, that’s like really time consuming and also can cause a lot of like, Unnecessary issues. Because like taking something down to bring it back up, that’s like adding more opportunities for failure to happen. Yeah. So if you just do a black hole, then you’re like, say, Oh, let’s just make this non-available for a period of time, whatever it is, the pod, the node, the whole cluster, and then turn it back on and you could do it for like 60 seconds or 30 seconds. And it’s just like gone now it’s back. And so I like to recommend that, like, I think. At the moment, we’re still at a point where that’s, where folks have to focus on is like building out really good architecture with your clusters. And, you know, thinking about that too, like how many regions are in, what happens if it fails and just the configuration of how you’ve got it all set up as well. Yeah, I’d say that’s, that’s my main tip. Once we’re like more advanced, then I’ll come back and be like, okay, now let’s get ready for this.

Ronak: Nice. So I mean, my introduction to chaos engineering was very similar to Guangzhou actually, when I saw that chaos monkey and I started like the CPU exhaustion is like the hello world of chaos engineering experiments. But as you see the teams kind of maturing on there, POS engineering practices. What are some of the sophisticated practices that you’ve seen in the industry?

Tammy: Yeah. Yeah. So I’d say like, one of the interesting things is like where to folks get started with chaos engineering and thinking through like the use cases of chaos engineering. So the first one that a lot of folks start with is validating, monitoring and alerting, which makes a lot of sense, right? Like kind of like as a smoke test. So do a CPU attack. Check that, that flies that you actually can see that in your monitoring. So say like in your dashboards and then also to be able to monitor it, validate monitoring and the loading for if an alert needs to find, because like you’ve breached an SLO and I’m seeing a lot of people, which I think is cool. Do like aspirational, SLS for new services that they’re building. I think that’s a great thing to do. And then being like, let’s set up the alerts for them. Let’s validate that that works that we actually get a page based on that. And I think that’s, that’s really different to where what people were talking about in the past, but it’s really connecting the dots between a lot of things that folks are focusing on, like SLOs and SLIs and, and I, I like the idea of doing it all really automated. So at gremlin we built something called status checks, which enables you to first actually. Say, like, what is my monitoring at now? Like what is the current level for my system? Like for whatever you’re looking at, if it’s a specific resource limit or something like that then run an attack in an automated way. If that first check is okay, and then check back again and see if an, a look fine or everything’s still good. And if it’s great, then still progress even more. So that’s really cool. And then, yeah. Yeah. Like, and it’s different, right? It’s like, Let’s automate this, like let’s tie in the monitoring and alerting into the ordination. So it’s like check first, run it, check again. Yep. All good. And run it again, check. And then you can just have that running like on a cycle and if it fails, then, you know, Hey, something broke like something unusual happened that we weren’t expecting.

Ronak: Yeah. Trying to see how far this is, how far you can push a system, actually.

Tammy: Exactly. Yeah. Like I’m really pushing you. That’s a good thing. Like stress testing, like sometimes people will think about chaos engineering like that and like, yeah, that makes sense. You all trying to stress test your system. And the other thing I like to see lately is a lot of integrating chaos engineering into the CIC pipelines. Lots of folks are still using Jenkins. You know, that’s like still really popular and I’m seeing them like run these attacks like you to play a code, to staging automatically write a set of like a chaos gauntlet or sometimes folks call it a reliability blueprint, which is a set of scenarios that every piece of code, like our new service needs to pass. And this is nice. It’s like super automated. I pass my reliability blueprint now I’m good. And I can go to production and yeah, I think that’s cool. And if you didn’t pass, then you know why, right. It’s not like. Strange to you, you can just go and like fix it and then be like, yeah. Cool. Now I should pause. Do it again. Pause, right?

Ronak: Yeah. That’s very interesting. Can you share like some of the attributes of what this reliability contract would look like?

Tammy: Yeah. Sure. So something that we focus a lot on over probably the last year was the idea of like, let’s go, okay. We have the idea of attack. So say an attack is a process killer or a spike CPU, or shut down a note or a pod. Then what you want to do is think through like the scientific method. So what is my hypothesis? What is the attack or the failure I’m going to inject? What do I expect to see happen often? So what we did at gremlin is we built something called scenarios. Which is like, you could have one or more attacks and like one or more attack types in one scenario, so it can get pretty complicated. And some of our customers have built it out that it’s like, you know, there’s over a hundred scenarios that have to pass because it’s like really complicated systems yeah. Where they have to meet a ton of compliance requirements. So if it’s like a, yeah, if it’s a bank or a finance company, they have to also prove that they. They code and they services past those scenarios. Now it’s like a check that it has to get through. And so that’s like probably on the, like most advanced complex sort of style of what I’ve seen. And then like probably what else I think is great too, is when folks are getting started before they get to that point, they’re doing something like. Say, let’s figure out 10 scenarios that we want to create, that we should be able to pass. And that we want to make sure that we can run and everything goes well. And like, ideally, like everything just passes, all those. And it’s just like a check that’s in place. Yeah. And it’s just running, but a lot of the time it doesn’t pass. Right. And then you’re like, good. We call it this already. Yeah. And it, it could be like let’s black hole of service. That’s an example. Right? Let’s make this service unavailable a third party dependency. Does our whole system crash. Check that out before you go to prod.

Ronak: Yeah, absolutely. So for teams who want to adopt. I started doing chaos engineering apart from just billing, good tooling and trying to make the systems more resilient. It’s it also requires a cultural buy-in where they want to buy into the entire idea of kind of breaking things on purpose to try and make them more resilient. So for teams who are early in their journey of just testing their systems like this. Do you have any words of advice?

Tammy: Yeah, definitely. Something that I think is good. Yeah. There’s like three areas to focus on. The first one to me is helping educate folks on what is chaos engineering and sometimes like, what is reliability engineering? Like what even is SRE. And you’ve got to do that first, like so that people understand why we’re doing this work. Then the next step to me is like trying to figure out, like, how do we move towards a culture for liability, but we were injecting failure. And a lot of the time for that, I would say, you want to think about, are we going to do this as like a centralized team? That’s doing all of this work, like all of this chaos in hearing work, or do we want to do it more like self service? Well, we just make it available to everybody. And those are like two really different strategies. And then if you like, okay, we pick centralized, you got like more control over it. You can help guide folks and you could do the work, but if you’re going with self service, then I think what you want to do is really think through all the things like you need to build like a Wiki and you need to make a lot of things, like much more accessible and available to folks like little videos or tutorials to get you team members ramped up and started. Yeah, that’s like my main tip there.

Ronak: Have you seen like differences in making it more self-service versus kind of running here centrally?

Tammy: Yeah. Definitely. So I seen lately a lot of folks moving to more of a self-service kind of practice of chaos engineering, especially when they’re integrating into CIC. So it’s like. Yeah, we want you to pass these scenarios. If you don’t pass, here’s how you can rerun it yourself on your service and make sure that you understand why your service isn’t pausing and what’s not working well, because I think like, you know, I worked on a build team for a while and a lot of the time people were like, well, I didn’t like, I’m so mad that it’s passed. It’s like, okay, well, got to figure out what I liked. You know, that’s like diving into it, but sometimes it’s not clear because the tests aren’t clear, like it’s like, what does his tests even test for? Like who wrote this test? I always think through like, should they assessing the baby? Is this a really old test? Like, is it relevant anymore? This is like a lot of issues with that. And so with this, you want to make it as easy as possible for people to reproduce the tests that you ask them to pass, and then to understand like how to be able to fix that issue. So that’s like what this whole idea of self services it’s like. Sort of like get out of the way, give people tools, give them education and enable them to like uplift and build like more reliable services themselves. But you have to continuously be guiding them. So there has to be that team it’s like, yeah, doing the work to like help figure out what are the scenarios we need. Everyone’s APOs how have they changed? Like what new types of failure do we want to inject? Like what new systems are we going to be using in the next few months? Doing more like high level strategy work too. Yeah. Yeah.

Ronak: That makes sense.

Guang: We’re getting close to the end of the chat, but this is something that’s super cool that I wanted to make sure we touch on. So we saw that you were the co-founder of girl geek Academy, where the goal I think is to teach 1 million girls like technical skills by 2025. We would love to learn more about it. How, how did all this get started?

Tammy: So yeah, this is a really fun thing. I started off doing this work in Australia really long time ago. So like, while I was in university, you know, studying computing, my, my lecturer, she was really cool to like head of computer science, faculty, Ruth, she said, Hey, Tammy, do you want to help more girls study, you know, technology university? I was like, yeah, that sounds fun. And so then she gave me like a project, so like run kind of like a, a day at university for high school students. And I’ve sort of really cool. I never thought of doing that before, but it was really fun. It felt great to like help them learn and they had a great day. And so then when I moved to Melbourne I started to go to some meetup. Senior, like there’s one group called girl geek dinners. And I liked that it was like fun to meet other women that way in tech, but I’m like super nerdy. So I was like, I want to do like hands-on stuff. Like I want to build things. I want to learn new, you know, new languages or new technologies or new platforms, whatever it is. And Then I asked some friends, like, what do you reckon? Should we build our own group and make it more like workshops and hands-on stuff and do hackathons and do like whatever we want to do. I remember my friends, like, yeah, I want to do 3d printing and I’m like, sweet. Let’s do it. And so we’ve done some super cool stuff. Like we had one weekend, we were, it was like a make, make us sort of weekend. And we had like a 3d scanning machine that scan your whole body. And then you could like print yourself out. Like we were doing like really fun stuff like that, but yeah, it was just like, let’s do whatever we want to do. Why not? You know, there’s no rules here. So that’s why we created that. And we’ve just helped so many women and girls over the last few years. So it’s been really cool, like an all over the world as well. And we’ve worked with a lot of great companies, too. Like Microsoft has been like super supportive of our work. We do a lot of workshops with them classes. If you go to girl geek Academy, there’s like some Microsoft partner classes that are coming off actually like really soon. And so, yeah, it’s just a ton of fun. I’ve met so many awesome people through that.

Guang: That’s awesome. Is do you find it difficult to balance? Is there like a lot of work to sort of balance it with the full-time job or is.

Tammy: No. So yeah. Girl geek Academy has like a full-time CEO. Sarah, my friend, we asked her, Hey Sarah, will you be the CEO? She was like, yep. And so we got funding from the Australian government to run it. We got like our first grant was a million dollars. So like, it’s really cool. The Australian government really cares about this and like was happy to back it with money. So yeah, she’s been able to do that for the last like six years as a full-time job. And then I help out, but I don’t really have to do that much work. So it’s like, it’s good to, you know, you can, like, it’s kind of like, my, my tip is like, if you have some things you want to do, think of like what your ideal dream life looks like, visualize it first, like we talked about and then just make it happen, like, yeah. And that’s it. And then you do it.

Guang: That’s awesome. So the, the fun question that we’d like to end on is what was the last tool that you discovered and really liked?

Tammy: Oh, that’s cool. So lately the main thing I’ve been looking at actually is load testing tools. So I’m currently looking at a lot of different tools like Gatling Neo tests, Neo load Locust and Jamie to kind of like comparing things. So I think like, yeah, like Gatling very popular. And what do people like that? And it’s interesting too, because the reason why to look at load testing is like, you obviously want to be able to like you know, make sure that you can simulate load on your other environments, like staging, something like that when you’re doing your chaos engineering work. So yeah. I would like check out those different tools like Jamie does. You know, I’ve been around for a really long time. But then there’s new things like Neo load that are becoming more and more popular. So yeah, that’s what I say. Check that out. Awesome.

Guang: Anything else you’d like to share with our listeners?

Tammy: Thanks so much for listening and like, yeah. If you’re interested in SRE or chaos engineering, you can find me on Twitter. I’m always happy to answer questions. My Twitter handles Tammy X Bryant, so, yeah.

Guang: Awesome. Thank you so much, Tammy, for taking the time.

Tammy: Really appreciate it for having me. It was fun.

Listen on

Apple | Google | Spotify | YouTube | Stitcher | Overcast | Castro | Pocket Casts | Breaker