Uma Chingunde - On managing migrations, growing engineering teams and much more - #8 | Transcript

Uma Chingunde

You can see the show notes for this episode here.

This transcript was generated by an automated transcription service and hasn’t been fully proofread by a human. So expect some inaccuracies in the text.

Ronak: [00:00:00] Hey, Uma super excited to talk to you today. Welcome to the show.

Uma: [00:02:19] Great. Thank you. Thank you for having me.

Ronak: [00:02:21] So we thought we would start with asking you about your background and how you entered the infrastructure engineering space. Pretty so I’m actually actually kind of joke that my more recent work is higher up in the stack than my my early career work.

Uma: [00:02:37] So my first major job in the U S was working at VM-ware after I actually interned with them. And I worked in the hypervisor management group part of this product called vSphere. And if you’re going to have to kind of describe it in a very simple way, If manage a cluster of hypervisors, which was to be a Maritz hypervisors who kind of was like the classroom management software.

And if you’re going to have , that’s actually what Dell companies like AWS or GCP, or as your use in their backend. So essentially like, you know, mad at someone answers like, you know, server it’s usually like someone else’s work real-life server that’s like somebody in the cloud. And so I started like, we Amber then work that Delphix, that was doing a very similar product, but for databases, trying to watch Lacy, AWS but I was kind of very cognizant of the fact that generally tech as an industry for over a decade now has been moving to SAS.

And so it was kind of like an intentional effort to move to companies that are essentially like, you know, software as a service. The thing of natural extension for me was to look where it all in an infrastructure team. Cause that’s closest to what my experience was previous. I did that at strength almost recently for a few years on the compute group.

And now I’m at render, which is, I think another interesting abstraction where we’re building the next the next level of abstraction for people wanting to deploy to the club. Yeah, that’s very interesting. So you mentioned you were on the computer group at Stripe. Can you tell us more about your role there and what the computer grip looked like?

So essentially compute was kind of the name is intended to be a little self-explanatory. So the idea is that Australia like, you know, is, is essentially providing a payment API to users across the world. And for everything that Stripe is running. That’s kind of essentially being run on a provision on compute resources.

Like my team managed, I mean, we were hosted on a cloud provider ourselves, but if you are a product engineer at any company like Stripe, you don’t necessarily want to have to deal with the nitty gritty of the cloud provider API. So what my team did was essentially abstract away what typically what a typical cloud provider provides in terms of like, you know, compute instances and build abstractions on top of those instances with our kind of essentially this layer on top of that.

So if you are a product engineer, you kind of have. A lot of doors are ukuleles abstracted for you. And you were just kind of focusing on building your service versus having to deal with also the nitty gritty of running the service. So we made it easy for you to build your service without thinking of where and how to run it.

So reliability scale. We also did, so essentially we built in our internal Coopernetties layer also manage like, you know, the service to service communication via on Y just like dire, which is a service mesh that we had adopted internally.

Ronak: [00:05:40] So would it be fair to say, like your team built the abstraction layer for The rest of engineering to say, okay, I’m going to tell you my service and I’m going to tell you the computer I need and just run it somewhere in, in data center, on the cloud.

Uma: [00:05:52] Yeah. I wouldn’t say we were 100% there, but that was like the reason for like my team’s existence. Basically. I see what’s interesting is I’ve seen a lot of compute teams who build this subtraction. Let, I mean, I work on a compute team myself, so I can relate to a lot of the challenges you might be dealing with.

Ronak: [00:06:09] What’s interesting in there is you mentioned you also build the on web proxy layer to provide that service mesh capabilities. Did you collaborate or did you have to collaborate with the network team as well on this or the traffic team to build up? So I think that our, with the traffic team, yes, very heavily.

Uma: [00:06:23] So we had a traffic team at stray back because I should clarify in the things that my team manage, we had like multiple clusters and the edge cluster was actually managed by our traffic. And so they had their own art at all. Out of one voice. We were ready to heavy at that and mighty managed the kind of internal compute cluster versus their managed to edge edge network in agriculture.

But yes, we collaborated together heavily. So the separation of responsibilities was we manage the network, the service to service communication for the internal for the cluster that we managed and all the critics should be managed. They’re managed for decades, but there was heavy over elaborate because with the network, you don’t have such a clear sound.

Ronak: [00:07:05] Yeah. That makes sense. And on usually say your product is essentially a product for all the engineers at Stripe. And so the customers are all internal, which is good. And also sometimes challenging because They will give you feedback really quickly when it’s good. It shows right away and it’s extremely gratifying, but sometimes different teams have different requirements and sometimes the single layer of abstraction doesn’t work for everyone.

How, how did your team, or how in your experience have you dealt with. These requirements in general, when you know that this is going to work for 80% of the use cases, but there is going to be this 20% for home. We’ll have to either give them access to their IEP as under the hood, or we need to do something else.

Uma: [00:07:50] Yeah. I think this is, this is a really good articulation of a common problem for internal teams. And also, honestly, it’s kind of like a subset of a problem that any infrastructure product actually has, we actually had a version of this that we embed itself. Right. So we can have, I think one way to think of it as like cohorts of users, like you have your bulk users and then you have like your super or admin users that we would call up.

So at Stripe it was it was essentially like an ongoing conversation, right? Like what is the class of user that you can optimize for and build your bill, your interfaces for, and who do you kind of just say, actually yours is a specialized skill. So it was always an ongoing conversation, open point of dialogue.

But the idea was that because you also kind of had the luxury because it’s an internal team that you just let them kind of, you know, they’re more of the more fine grain access. And there were definitely teams that had that footage. Okay. And I’ve if your example was I don’t in fact him, so the team that managed data act, they had a much more specific set of requirements.

And there were also someone for instance, where moving a large, not all of their workloads could eventually be moved to Coopernetties. So they were dressed like a separate cohort of users. So our drain of override bright strategy was thinking of these as cohorts of users and approaching them back way.

And I think overtime or something that’d be also realized is you do in the anti-trust draw or kind of darker lane to these users, to your external company users. So you kind of have to like either directly or indirectly optimize for the business. So for Stripe it’s like your payment users and the bulk of the most important piece of its products are where you were focused on most of your attention, but then like, you know, with, with key critical investments being made.

Into like, you know, emerging businesses that maybe had a different thing, or like any other kind of one-off use cases. You do have to make strategic bets where maybe you let them go develop on their own. So it was a combination of engineering and business decision, I guess, is a good summary. Mm that’s really good insight.

Ronak: [00:10:04] I didn’t, I haven’t thought about it that way, but what you said makes a lot of sense that along with working with the internal teams tying it back to the business itself and then optimizing for some of the decisions you make and how you prioritize them. So now, now you’ve moved on to render by the way.

Congratulations on the job. I know you joined recently and as you mentioned, render is also building a, kind of a computer with another abstraction on top of it. And in this case, you’re. Consumers that actually are internal teams, but that’s the product itself. So a lot of your experience ties into the partner agenda really well.

I’m curious, can you describe your role at render and also what kind of similarities or differences you see in the product that you’re building or the challenges that you see as users are using the product? Oh, so it’s actually one of the, this actually touches on one of my motivations for joining render because I was only following them for a while because I was like, okay, this is interesting.

Uma: [00:11:05] This is an interesting product. We kind of used to joke internally that I had striped and what our users really want is for us to really abstract everything else and just give them a way to run their services. Right. And that’s, that’s really kind of, you know, that, that is what all developers work. So I had kind of been like in a following grain grinder was always also the Stripe connection where the CEO and co-founder is also an extra two.

There’s just kind of like this association. So I’d already been following them with this interest in mine. So when they reached out and wanting to talk to me, I was like, yeah, obviously, I mean, I’ve a minimum. I wanted to learn how you’re tackling this problem. Because, so we, I see it is I have the team that I was working on and on to, I was part of a larger foundation.

The name was foundation, but the larger infrastructure team, right, like are equal and excess. So pretty much every company like LinkedIn, as you just talked about has the same thing on RPR, Slack, Slack, Lyft, Cooper, all have similar versions. So it’s, it’s a pretty standard problem. And there’ll be, I’m kind of thinking of it that we are the way this appealed to me was it’s a larger scale, like Stripe scale.

You build a team to fix the problem for you. It’s a similar model as if you’re a Google or Facebook skill, you have a data sector. So the next layer. If you are Stripe or LinkedIn or Uber or Lyft, did you have an internal infra team? You do? You’re probably hosted on a club, but then what about the next layer, which is that even smaller developers, if you want to develop something, you either have to still learn infrastructure.

Like you have to kind of balance with them learning the infrastructure to run your service, or actually building your service. And so I could see the the incubator heavy, like, you know, the, the need for this. And so that was exciting. I had never worked on this problem at this scale, however, so it was kind of really appealing to try out something completely different in terms of scale, like build something from the beginning versus work with an existing system.

And also for me, I really like growing engineering teams and the people side of it was really exciting for me. So the opportunity to build a startup from the beginning was or something I could awesome, basically. Yeah. That’s, that’s certainly exciting. I know. Yeah. For at least a startup startups.

Ronak: [00:13:23] Like if, if infrastructure is not the core product, that’s you’re constantly kind of, the priorities are competing against each other. Do I build the application that makes money or do I build the infrastructure to support it? So partly Grindr makes total sense. And as you mentioned, it’s an early stage startup and there is an amazing opportunity to build out the engineering team from the ground up in terms of.

How you’re thinking about building this engineering team and what you see it under right now. Like, what are some of the things you are thinking about these days? So right now, our current thing is I would say like, since I’ve staggered, a lot of my focus is all just like growing the team itself because currently actually if you have such good traction in that users want to use us, there’s a clear need.

Uma: [00:14:06] Like, you know, essentially what you just described, choice the need for that essentially you pretty much gated on our bandwidth to keep delivering new features. So it’s, it’s pretty much, it’s a very good problem to have. Right. And that your solution is adding people and growing the team. And that’s what, that’s what my immediate focus is and doing it in a way that is sustainable.

And like, you know, we can have like, it’s like a growing and scaling problem. So that’s like the biggest thing that we’re doing. And essentially we were actually like really transparent with our roadmap. There’s actually like a feedback, not render.com that folks kind of can see what we’re building. And currently it’s just pretty much like our opportunities pretty much like the things constraining an opportunity, our, our, our own personal bandwidth and our ability to execute.

So that’s where like, you know, just onboarding new people in a sustainable way and just, just building. And that was partially kind of my excitement as well, which is after having done different. Things it was, I was kind of missing the focus and the kind of more of like being in the weeds and just like executing on stuff.

Yeah. It being in the new role, do you, do you get any bandwidth for yourself to like, do these deep dives with the team or design discussions or your time goes into others? It’s it’s I would say so my kind of right now, I would say saw a flyer. Not a lot has, but the refreshing thing is which is where he different and.

I kind of understood it, but I hadn’t like really seen how much it would actually be. The case is how much like, you know, there’s just like a, day-to-day overhead in a much larger organization, right? Like just your volume of emails, just the volume of meetings, just the, just the overhead of communication that comes from a few counts.

And people is so much different from like my total team, like ranger as a company, as 14 people right now. So when you think of that, right, that you, you just kind of like, you know, cut through a lot of that. So I do have a lot more time to kind of actually sit down and absorb the product. So far, I think I’m, I’m still scratching the surface, but I’m actually like really excited to be able to.

Austin: [00:16:16] Nice, nice. Yeah. That sounds like a pretty exciting transition going from Stripe, which has been growing at a very rapid pace over the last few years, I’d say to a much smaller startup as well. I wanted to kind of take a step back of like you, you’ve pretty much always worked in these spaces where any abstractions that you’re working on are generally going to be pretty huge.

And the impacts are gonna be pretty big. And on the compute side, as you guys are growing this platform, a big part was of course, you know, I’m going to keep delivering these features. And I would also assume that a lot of customers while you were at Stripe may not have been on that platform already.

So I kind of want to go back to kind of pull back onto the whole concept of migrations and you wrote an excellent blog post on this a while back talking about managing migrations. It was a great read. And we’ll put, definitely put that in the show notes. But I would imagine on the compute side there are definitely migrations that are going to be needed there.

Some that are maybe easier than others and some they’re a little bit more yeah, scary, scary to even encounter. But I think the blog that you wrote gave a very good rundown of you, just your thought process of how you go about it. And I kinda want to talk about that more today and kind of jumping into that.

But for starting any sort of migration, like, what are some of like the, I’m assuming like the first part always is the planning part, if you don’t plan for it. Yeah. I’m assuming pretty much set up for failure. I think I’ve seen this firsthand on my side migrations that have gone well, some that have gone awful.

So yeah, I just wanted to get your perspective on that.

Uma: [00:17:44] I wrote a little checklist in that as well, which is kind of like, you know, things to like think off even before you can have, have written like, you know, a single line of code or like anything to migrate. Some of it depends though on the time you have, right.

Like some migrations are planned and some are like, you know, last minute, like a foot per the spectrum and God migration that react to do. So. I think it really depends on how much time you have, but I do think it’s kind of one of those things or the metaphor of. Measure twice, cut once really helps. So the way I like to think of it as like, if you have to invest in a planning and depending on your bandwidth, you can obviously like, you know, constrained the planning to be a quick nutrition.

And they’re like, keep, keep going versus like actually spend a lot of time doing the planning upfront, but at a minimum, it’s kind of like this checklist that actually put together in, in, in my blog, which is just like, you know, just kind of sitting down at a minimum for an hour and just like asking like, you know, Okay.

What, what does this migration mean? Why are we doing great? What are the goals like, you know, is 80% of goal, 50%. A hundred percent. Right. And where does the kind of almost like OTR stay what’s the a hundred percent, what’s the 80%. What’s the priority? What are the constraints like? Is, is it, is it like an execution constraint?

Is it like a technology constraint, things like that. So I can minimum, like, you know, so I tried to like really summarize her checklist, which I put in the blog into, like, the things that I just saw are ready, repeatable at a minimum, having this one meeting with the key stakeholders. I like having this actually put in a document which was just like, you know why are we doing this migration?

It was actually a quantity on the team who came up with this idea, which is like Y on Y, which was like one of the first ones we did this for. Right. And that’s like the output typically off of this conversation, listing out all the constraints. So that’s, I would say is like the MVP of the planning, but we can do a lot more.

Austin: [00:19:49] Got it. Yeah. And that’s, it makes a lot of sense of starting with kind of like starting with the why I think there’s like books about this as well. And I think definitely applies here. Otherwise people say like, Oh, of course, why are we doing this? You also mentioned I really liked that you touched on how far do we want to go with this migration?

Like what, what is, what do we want to call done? Which I, I think I like myself also would say, okay, yeah, we’re going to get this migration to the, to the end, but we don’t really specify, like stamp it down of saying like making it very clear what that is. And you, you talk about how you want to have some sort of metrics to kind of track this progress, which I’m assuming is to like, How, what you call done is going to be reflective of the metrics that you are able to capture, right?

Uma: [00:20:34] Yup. Yup. And that’s what, and also it’s kind of like can really help drive alignment, the metrics and what you call done is actually can really drive alignment between all the stakeholders. So the spectrum and gone was actually the one where we started doing the metrics and found them to be super useful because there, we actually had a commitment to our external users, which we had kind of decided on, which would be like, we are going to be like, you know, essentially we had different percentages that our security team felt comfortable committing to.

And then we communicated those to an external users, which was like, this percentage of our fleet is going to be running on this update by this time. And that was it. And then so, so that really helped kind of like, you know, frame the importance of the metrics to us. And then, because it was such a pure, purely origin thing, you got kind of like, you know, help drive the metrics is that experience then helped to be, at least realize that if you don’t have that urgency, you can still frame the, like, why are we doing this?

And what does done look like? Even in the non-urgent case, because then it helps teams prioritize things relative. And it kind of drives clarity between, like, in this case, it’s like, you know, you have your account managers talking to their users, you have the leadership team wanting to know what, what our exposure is.

You have the security team wanting to know how fast different teams are working on it. And everyone can just like, look at this one dashboard or like different versions of the same dashboard and just like get the same information. Yeah, that makes a lot of sense. For like the metrics I’m assuming like on the compute side, I can imagine there’s just like let’s say if it was like to patch the cluster or something, it would be like, what do we want to call done at that point for the full migration, it could be 95%, a hundred percent, whatever it is.

For some of the metrics I can imagine you can get alignment. Some of these metrics may exist. Some may not. Have, have you been able to strike a good balance of like, there are some metrics that are like, this is the perfect metric that we want, but. We don’t have access to it. It’s we would have to put a lot of time on creating that metric, let alone.

Austin: [00:22:43] And how do you balance like, kind of like those, those two sides of it. Like, we, we have some point we have to say, okay, this, this is good enough as a proxy. We don’t want to go too far down. Yeah, yeah. I know. No, I, I think this is, this is a really this is actually like a really good topic to talk about.

Uma: [00:22:56] Sometimes building the metrics and extracting them takes more time than a lot of the other things. So I think in that case, it’s like, you can, I think as long as you have a good enough proxy, right? Like you can do something as simple as someone manually updating a spreadsheet. Right. Like that’s okay. As long as it’s a good enough proxy and as long as it’s not too much work, but I think you, so it’s essentially like you need an MBP optometric, which is okay.

I can, I can get like all the factual versions. After this in this way. And then someone has to maybe clean the data and pipe it into the spreadsheet. And like, you know, it’s, it’s like a hodgepodge, but it’s fine. It works where it says, it’ll take someone a week to get it all automated. Then you just want to do a doc the quick and dirty thing.

I do think it’s important to focus on the right version versus having clean, beautiful data for this. It’s it’s like the goal of the metric is to drive the migration, not the metric, right? Yeah. Yeah. But that’s a good column. Yeah. And for a lot of these migrations these are usually, it’s not just one team, not the one team that’s just running some platform.

It’s usually one team that’s managing the platform interfacing with many, many of the customers. So you, you got to work with many other people. So for a lot of these migrations in general, it’s good to have, you know, a few folks that are kind of like driving. This entire thing, which probably I’m assuming begins even in the planning segment as well.

Austin: [00:24:22] Were there any sort of like key characteristics that would make for a good, like either a tech lead or just a general lead in these migrations? I think someone that understands that has breadth is useful. So I think People that are typically like, you know, understand prep. And if they don’t have an existing understanding are able to kind of, you know as they come up amongst roadblocks are able to then dig through them.

Uma: [00:24:46] So an example would be like, you know or we have these kind of weird stateful machine set are relatively harder to patch. And even though like, so the, the best case scenario is like, you know, you’re, you have like someone with a lot of knowledge of your systems. It’s like, Oh yeah, we had these weird systems that are going to be harder to pack to.

We get started on them, like, you know, start special case in the med tech. And if not that, then you want people that are able to problem solve. I like, you know, kind of essentially debug. I think that being said, though, I didn’t in the end, it’s honestly like alignment and trust question because the bigger problem often is that.

Everyone actually needs what needs to be knows what needs to be done, but they’re just too busy to do it. Right. So Sudan, I think is actually like, I would say. The broader team, like as long as you have alignment where the leadership team and engineering overall, it’s like, okay, as this migration is the most important and the second most important, I don’t want to spend X amount of time on it.

It’s usually the problem then get strapped up. So the way I like to think of it as like you had the core team, and then you have representation across the board, like the core team has people they can go to. And if they don’t have people they can go to when they get stuck, that’s when everything stops taking much longer makes sense.

Ronak: [00:26:08] I can actually relate to a lot of the things you’re saying, because I was involved in the Melton’s spectrum, patching at LinkedIn and stateful systems that are interesting, I guess I just say plus window, that I won’t go into the details because that’s probably another conversation. Since you mentioned that as your.

Migrating these systems when you have these representations across different teams, and there’s obviously buy-in and alignment at the leadership level, but as you go through the execution and the migration stage, maybe quarter or more, depending upon what you’re trying to do and priorities evolve based on business, based on what’s current within that team.

And sometimes you’ll see one of these teams who has a unique requirement. And like you said, there are some custom work that you have to do. They just have to put in those hours. And sometimes there are conflicting priorities and they cannot, Oh, what are some of the effective ways you’ve seen to still push the migration forward?

I’m not saying for someone to do the work, but still kind of repetition helps repeat that, Hey, this is why we’re doing this. This is how it helps. So I’m curious, what are some of the effective ways you’ve seen this? I think so one effective way is trying to make it as easy as possible for them to do it.

Uma: [00:27:21] Right. So if there is prep work, especially if there is generalizable prep work, right. Which has Lexi keeping like staging everything and then being like, we have stage there’s, this is where you go. This is the setting you change. These are the script to run. This is the how to, and like, you know, generalizable and like essentially the extra upfront support you can offer to teams that are struggling to better.

Like some tactical things we did was, you know, like the team that was doing this would like run office hours and be like, Hey, if you’re stuck, come, come to these hours and we will help you do this work during those things like that. So there’s like, that’s the plan of like, you know, the upfront kind of being nice and just like offering a lot of white glove support can get you a lot of, we can get you a hot.

Oh. And then I think the other thing is if Is aligning, which is, so if they’re not able to do it, what is the underlying reason? Is it something else that is higher priority that has to be delivered instead of that. And then that often comes to like business alignment, which is like that, or a BDR has to align with your org leader and be like, okay, maybe for this particular set of things for this migration, we are either going to get an exception or we’re going to get some other team to help out and like essentially carve out a part like for that second step, it’s almost always like a business slash team alignment decision versus a technical solution.

Right. And then the technical solution might be like, Oh, The core team actually does the migration for them, but it’s usually obviously like a discussion. That’s kind of like the two broad categories, I would say, essentially it kind of becomes like, Oh, and this is where I kind of alluded to in the blog post for getting specialized program or project management help via like a program manager or quirky PM.

Can’t be very head foot because they are trained, but kind of think of these as like holistic systems and people’s problems. Oh yeah, that is true. I can not emphasize the role of a TVM in a migration effort for sure. A team could go crazy just doing that task itself and not the migration. If you don’t have that support.

Exactly. It’s like the separation of responsibilities. Like there’s a technical work and then there’s the organizational and people side. And like, I would almost say that in most migrations, the second one is the bulk of where the energy goes. Yeah. Yeah. I should just say that. Shout out to all the DPMs who support all the engineering teams.

We talk over to engineering and business. I just want to say shout out to them as well. Great. Anyone who is moonlighting as a clear. Yes. Yeah. So you mentioned and this was actually really neat to hear just now of like for these migrations, you can have you set up these sort of like staging environments for teams to kind of say like, Hey, this is kind of the first transition before we move fully onto the migration.

Austin: [00:30:18] And, and it’s just kind of personal preference, but I think anybody who talks about a migration, it’s generally not like a happy sort of thing or like, Oh yeah, let’s totally do it sort of thing. It’s usually, it’s one of those of just like, Oh gosh, what’s gonna, what’s going to mess up in this. Because maybe like they could have been burned from other migrations in the past.

I mean, in general migrations are just, just hard. Right? So. And I, I really liked that staging idea because it helps, at least for me, I see it as this proves to me that it’s like, it’s going to pop, it’s going to be, it’s going to work, but also kind of de-risk. Some of the, the bad fallouts while during going through this migration and these are kind of like the technical things, and I’m sure there’s other ones too, but are you, have you found other ways to kind of help teams kind of de-risk that and kind of like be less fearful?

Uma: [00:31:08] I think going after teams that have maybe I think getting some wins, Andrea, Andrea bed is actually key. And I also wanted to pull on this thing that you just on the comment that you made, which is like, no one likes my creations. I think depending on it, there might actually be. There might I have found, for instance, are depending on the migration, there were always, actually a few super users that were actually like careering to go that wanted to be opted into the new system.

Like, like the one sec wanted to be on our new Kubernetes cluster for instance. Right. Like they were like, can like sign us up right now. And I think one, that’s actually like really interesting check to pull, pull on because that’s like, Oh God touches on. Most users problems, direct users are willing to bear some pain if you can give them a reward.

Right. So what’s in it for me. So if you can actually find a candidate in your migration, which is at the end of this, this is what you will be getting that actually the most powerful thing. Right. And that goes back to why, like, why are we doing this? Right? Like what’s in it for the user, is it a compliance thing?

But even then there’s like, okay, you will be the first to be like, you know, in the compliance thing or like, I won’t have to manage my infrastructure. I will get these better tools or I will get something. So it’s really important to actually like, make sure that there is, if, if at all possible you should actually have a very compelling reason for the people who migrated to be part of the migration, because then, then you’re like, you know that then you’re halfway there.

And that, that can really solve a lot of times. Yeah, I guess that emphasizes again, back to the planning, part of why, if you don’t have that, I’m assuming this is following how a lot of migrations don’t go through. Well, it feels like you’re pulling teeth most of the time. And for these migrations, for the folks that are working on it like directly the ones that are like owning this migration and also potentially even other engineers that are more on the customer size that, you know, have to work on this part of the migration.

Austin: [00:33:07] And, and you’ve hinted on this. It’s like, we need to find a way for them that they, I see a benefit. They do a lot of, I can see a lot of engineers coming in and especially newer ones. They don’t see migrations as something as like a shiny new feature that they want to implement. This probably wasn’t even part of their interview process and they’re just like, well, I’d have to do this thing.

That’s just moving data from one point a to point B. And how have you been able to communicate that to other engineers so that they can understand the like, kind of the full impact. Of the work that they’re doing. Like it’s not a feature, but it’s still very high impact. Yeah, I think I think that’s where one thing I’ve known is you actually have to leverage multiple channels to make this effective.

Uma: [00:33:45] So one, one channel is like, you know, you just have a landing page for, for everyone, which is like, why? Like, no, why, why Coobernetti’s, why spectrum, why Envoy? Thank, you know, why do I, as an engineer cared about this migration and you have to make sure it’s like really crisp messaging. So there’s like, and for every organization there’s often like preferred delivery mechanisms like Stripe was already written culture.

So, you know, emails, Scottrade, and like, you know, you could actually like, you know, send out a white email and you could ensure that the majority of people would read it and process it. So that in a company like that, you know, you use that mechanism. Other companies will be like, there’s like all hands presentation that maybe everyone goes to, or like, you know, different channels, like Slack is maybe a better one for some organizations.

So you have to find. The preferred preferred communication channels for your organizationally leveraged them. And then regardless of what the main preferred one is, you actually have to repeat it multiple times. So you kind of have to like, you know, send the email, have the all hands, have to be pure fame, send that email.

So you kind of have to like, you know, make sure the message is getting repeated. So, and that’s where the alignment comes in, which is your first and the prep work. And then the alignment plan, like, you know, your org leader on board and said that we are going to be sending this email. So either they sent that email on your behalf or you send the email and they do a tap back, which is like, yes, everyone does email is like super important.

Everyone should think of it highest priority, or they reference it in their notes or their reference within the next all hands meeting. So you kind of have to figure it out to your org and the store to where the scope and the scale core, right. Is it like the team, the org, the entire company that’s being affected.

So based on that you to like create different channels for, for the impact and then tailor your message accordingly. And this is also why, right? Like that’s where you would then tailor it. Like, why should I care? What’s the impact to me? And that’s where, like the delivery of the message and the communication, an example is like in the written communication you want to, the casual user should get the most important information.

Austin: [00:35:56] Like in that first above the fold kind of thing, which is like, what’s happening and what do I need to do and why should I care? All the details. Got it. And. Being in more of this like leadership role? I think a big part has always been to always try to recognize the work that’s being done by all the other engineers and such.

Now that you’ve worked on many other projects as well, including many migrations, how, how have you had there been like other like specific, or maybe less or used ways of how you recognize like these, this type of work? Yeah, for sure. I think the recognition actually goes hand in hand To kind of like, you know, incentivize the work, because if you don’t have the culture of treating this work as important, you’re basically deprioritizing it, right?

Uma: [00:36:42] Because every engineer is like, okay, I want to like a sample kind of somewhat extreme scenario is I want to get promoted in the next year. And if I finish this project, I have a clear path to promotion. Versus if I spent time doing this migration, that my path to promotion is less obvious. So if you’ve created a culture like that, either implicitly or explicitly, then you have a problem getting this work done.

Right? So it’s an innocent, all the reward and recognition that work gets prioritized. So the way to do it is explicitly have the conversation of prank, like, and that’s on the managers and leaders, which is, this is extra sticky work that everyone has to do. Right. And so it’s like a white urge to tap the same people or be like, Oh, we tapped or other kind of really bad signalers or you just always pull the newest hire on the cube to do the work who doesn’t know what they’re getting into.

Right. Like that’s like a big frack flag. So you’ve tried to do distribute the work evenly and fairly and recognize it. And the recognition is again, like, work-based right. Call it out in the promotion packet, call it out in the all hands. Maybe it’s maybe your company that’s cash bonuses. Maybe it’s like, you know, pure Paula says something like the same level of reward and recognition system that you have for everything should apply for this work.

Ronak: [00:38:01] Otherwise you’re, you’re implicitly making the decision that not one will want to come to work. Yeah. That makes sense. Those are some really good ideas. And how. A team in an organization can go about conveying, not just to their customers saying, Hey, this is how important the migration is. But also to the team doing the migration saying you’re doing impactful work, and this is how it helps the business.

So in your blog, you mentioned that at one point in the compute group at Stripe, he were doing, I think what five migrations. Simultaneously with a team size of less than 20. I mean, one that is, that sounds extremely hectic. I I’m curious how, how did the team manage it? So I should, I should clarify that when, so a team member actually like went in actually deffer the term Charles Hooper.

Uma: [00:38:49] So he actually just like counted. So they were not all active. Right. So they weren’t ongoing. So essentially there were migrations that were like off things where like, you know, get stuck at this migration from version one to version two off our internal platform. And it had just been slowly progressing for multiple years at that point.

So that was one. And then there was another one to move to onboard them. It’s not the one to move to Kubernetes. So OSI upgrade. So those times, and all of them were being done with different levels of origins and in different stages. So not all of them were like things that people were actively picking up and doing, and that’s the problem.

He saw it and cited. Basically. He was like, we have these half that things, right? It’s like we have five parallel work streams that are like, so essentially what would happen is like we were in different places. So people would go chip away at one of these open or upgrade a few more boxes while they’re active.

Can work extra clean up here. That was the kind of ongoing problems with the problem. I said, I filed them all active at the same time. They had just been like kind of slowly proceeding for, as I said, the most extreme case was essentially, it was just kind of, this basically was accumulated. Yeah. Take that as a sign of a rapidly growing organization.

Ronak: [00:40:13] So I forget which, where was the set? I think it was about Google or somewhere where there’s no final version of the product. The one in use will be deprecated. And the one that we want to use is being built right now. And as you mentioned that there were these multiple migrations which are going on and some of these kinds of we’re moving forward slowly.

Yeah. W I think it would be appropriate to say that migration side of Matan not a sprint and as. The team is moving forward with this work. Sometimes if the goal, if the migration or any work that goes on for too long, it’s only natural as humans. That one would start losing some interest. Not because someone wants to, but that’s just the nature of the work.

So as a leader, how do you ensure the team stays motivated to continue on and prevent burnout in the process? Yeah, I think really, really important topic. I think, especially for infra teams, I think I would almost like turn it around and be like, I think my creation. And just like long running projects for this reason should actually be avoided in the sense that even if it’s going to take multiple years, you should have your cutoff points and that’s kind of solution we took with those, like, you know, multiple, like up to five migrations, which were like, The team collectively decided.

Uma: [00:41:35] And the manager Ian decided that they would actually just like focus and borne it down in one call because there’s the cognitive and just like tech, tech wash Norfolk, which has continued if you just like, let it linger. So they actually just made it a goal to actually finish all of the ingoing ones.

And the other side of it is that no one actually benefits from an incomplete light creation because no one has access to it. Like you always have these like cohort of people that can’t use the newest tool. There’s actually really cool. Bye I’m a manager of mine were Larson, which kind of seeded my blog, which migrations are actually like your way to fix tech deck because, you know, as you finish the migration, you’re fixing the tech deck.

So. I think the way to look at it is actually one has keep it like actually tying Bob. And this course back to the kickoff, I like defined the done point and then explicitly have it beyond or be off. Like, you know, if you’re going to like a problem with those migrations, I can’t be lingering. Which is have, you will not make an explicit decision to prioritize or deprioritize.

So they were just kind of lingering. So I think that’s where if you have that extra set decision, what does that look like? We stop at 50% or we wait until a hundred percent and what the stopping, what are the costs of stopping? What are the costs of finishing? What are the benefits of finishing? So all of just being very intentional at every stage is what counts.

And I think if you’re doing that in a fair, like being intentional and timeboxing things, that’s when you prevent burnout, because a good description of burnout is when. People don’t feel like they’re in control of that kind of destiny. And so if they are constantly dealing with torsions of a system, right, like why all did we fix the spark and then your system, but actually having to support like, you know, field support for the old system, because migration is complete, that leads to frustration.

So that’s why, you know, you have honest conversation, be like, okay, we’re going to deprioritize other stuff, get the game, go a hundred percent that then you prevent burnout versus if you just select.

Ronak: [00:43:49] Hmm. That makes sense. I think that the takeaway for personally, for me that is. In short, there is an end state to the migration and ensure there’s clarity on what that looks like for the team working on it and also for the stakeholders. Yep. I mean, if you’re the migration goes on for a long year, just sending up, supporting to a system, selections, more cost on the team.

So as the team has teams work on these migrations and well, once you’re done with the migration, people are super excited to work on the new system, support the new system, because usually it’s a better, faster, improved version of what you had before. In, in that regard, there are two aspects to it. One let’s build a new system and then there is the migration part is like, let’s move the peop move the customers from one to the other.

How have you seen, or in your experience, how have people balanced work which is which people I should re let me rephrase that. How have you seen engineers balance this work one themselves, and also as leader, do you move people around who are building the system versus the people doing the migration?

Because I can imagine as an engineered one, wanting to build more. And spend less time on the migration. I think this goes back to the dark and if it’s like a similar problem with the report and recognition, right. And my philosophy is that if it’s work that everyone wants to do, like if it’s bright, shiny work, then you’ve spread out the opportunity so that everyone gets an opportunity.

Uma: [00:45:21] And if it’s grant work and then also you kind of essentially spread out opportunity. So essentially that’s kind of like, you know, your way to fairness. Right? So my thing, like the product question is we’ll just have to make sure that everyone participates in all phases and not just the fun stuff, that’s the way.

Ronak: [00:45:45] Hmm. It makes sense. And I know we were getting to this a little late, but we love to talk about war stories and production outages. And on this show and migrations have notorious to create them. I should say. At least in my experience. The intentions are always good. It’s just, we can not foresee every scenario.

So I’m curious, are there any war stories related to migrations or otherwise that you could share with us today? I think it actually might be harder to pick which ones I can share because pretty much all of them had some, I was impressed by the team that drive the spectrum. My McDon migration though, honestly, because for, for the tightness of the timeline and for the scope, which is our entire fleet, we actually had very minor hiccups.

Uma: [00:46:34] Given that though, the interesting one was where there was, we ran into a weird in compatibility between between R or S version. And the underlying machines that were running and it cost interesting, kind of caused an interesting production outfit. That was one thing, because it was basically like a, in combative, in compatibility between like the hypervisor and the Oilers and the packages we were running.

And it kind of took a lot of digging to figure out exactly where that was. I think back when was I think we were not the only ones who saw it and that’s also an interesting anecdote maybe, which is if it’s something that’s like, one was industry-wide, which the spectrum done was it’s actually like, you’re not the only people facing and that’s where the first time actually.

So the leverage of the network, I was actually relatively new to the team then. So I was mostly like observing and coming in to see what the existing thing was working on. But we kind of reached out to a lot of our peers and got lots of like, you know that’s lots of like, you know, support and also from the vendors that we were using, buying, figuring this one out.

So I think that’s maybe one thing that we haven’t touched on, which is like, when you run into production outages of a Mark, like offer large scale, sometimes you can like, you know, see who else is running into these problems. That’s maybe one anecdote, but happy to talk about other ones for stories from pretty much every migration.

I, I definitely want to talk about more about one question I have on this one is the sounds like a very nasty bug, which nicely incompatibilities between different abstraction layers. What was the impact, like, say for instance, you rolled out the new patch. What did the team say? So it was basically we were just kind of like the machines that had been batch would just kind of this has been a while ago, so I’m probably stating it incorrectly with that caveat, but they were basically there was just like a weird.

I think where they would say quality and reboot basically. So the cluster that this was happening, one was our internal based last turn. So they would just like random DJ booth. And so then it was like this trade-off do we essentially then roll back these machines because they are relatively isolated, there was no external network facing.

So kind of essentially we did a trade off there for a very short period of time. We actually had already back the patches on them and continue working while we figured out what the right fix was. And I think that this is the case where it really is like, you know, the kudos to like all the engineers at work and that incident like figuring it out life.

Austin: [00:49:14] Yeah. It sounds like even the detection of this was pretty quick because my is in that particular, but yeah. Well, I imagine like while managing like such a large fleet it’s and the worst kind of bugs are the ones that are like intermittent, like that just happen every now and then. So. It’s, it’s hard to know if it’s because of that or because of something else.

Uma: [00:49:35] Is that something that the team has always considered of for any of these sort of these things? It’s like, let’s. Do you at this bake for a week? I’m not sure. I think that’s like the staging and the roll out is important. And that’s why, you know, you can pick the less critical workloads first and then continue.

And this particular case, we were lucky that it was somewhat localized. It was a particular cluster, which is again like, you know if you, if you roll it out completely, then it becomes harder to pinpoint which of the steps has caused the problem that you’re seeing. So, which is why you always wanted to work in stages that way at every stage, as you’re, as you hit problems, you know, what’s likely to have introduced a change and then you can roll it back or hold or keep going.

I think that’s an important trust. Maybe, yeah. I don’t know, find the issues or at least very important because rolling back when you are more than halfway through was way too expensive and being able to, I think that’s also very like doing it in a way where you can partially roll back or yes, it’s important.

Ronak: [00:50:40] Yeah. I think that would also touch on kind of How a team thinks about the mechanism to migrate, because one is you make the decision. The other one is you have to think about this as a new feature of sorts that you’re ramping. And if something goes back, you need that undo button when you want it.

So thinking about the undo, a super important, not just for features, but also for the migration part. So you mentioned you have other war stories to share. We would love to dig into more. Can you share another one with, I think it was just more like, I think, I think mostly just around, I think the more critical infrastructure, the more careful you have to be with the migration.

Uma: [00:51:23] Right. And I think I think just like. Just never, never underestimate what water is and how critical it infrastructure is. So I think we just had to, I think with Envoy, we had some like it’s like, it’s a great piece of software, like, and you build so many things, but the, the key thing is that when you, when you migrate to it, like, you know, as we did, we kind of, when we were very careful, like, you know, a few few, like, you know, set up a set up, but then at some critical point you do have to like, you know, start serving up production traffic.

And I think it was just interesting how. How many issues we kind of had, like, there was a time where we kind of had a few issues that was like, because of the critical nature of a, of a service mesh, right? Like you essentially, we were just like essentially breaking down large parts of production with our issues.

The thing, the lesson there that we learned was having a lot more diagnosable, big 10 in 20 critical infrastructure is really the key and that’s, that was like our big learning. So and then also I think it was a big, another big learning of making sure that knowledge of a new system is relieving the spread out early on, because what happened in that particular instance was already small key team had been, are had, was in charge of the migration that had been working on it for a long time.

But then as it went to production, They essentially became the go-to people for on-call. And so when these incidents started happening, they weren’t the ones who are constantly like, you know, dealing with the incidents. So what we had to then do is essentially aggregate everyone. We’re just going to like, stop all your work, go fix the liability issues.

Ronak: [00:53:16] That continue. Makes sense. Yeah. It’s. So important to recognize that were making the trade offer, you know? Okay. It’s time to pause, go back and fix the issues that are causing site up out. Like it’s, it’s a question of site up at this point, so let’s go and fix them and then move forward. So render is still very early, in very early stages.

Uma: [00:53:37] Have you seen any migrations yet or? Not really my creations for C. But yeah, we have a booth and I mean, I think it’s almost like if you’re smaller, it’s almost more fun because everything has like so much, you know, new fun. And I got a bit of cat can be, can be broken where you sitting, because there’s how you to just do it a different stage.

Ronak: [00:54:01] Usually one, one question that we like asking everyone is like, what was the recent tool that you discovered and really liked?

Uma: [00:54:10] I think one of the joys I’ve had of starting at a start up is because we are so small Pilliga to experiment with a lot of tools that are just like, you know, new versus, you know, like because our requirements are like, you know, for scale and just like, overall, I just like, so so small, right?

So I think I would say one, one, do you want that I have liked, which is this tool called linear which I’m like very new to it’s essentially think of it as issue tracking software. And it’s a really interesting problem. Right. And a notoriously hard problem, especially for someone like you was like live through like the original bug tracking software that I worked worked with with like bugs.

Right. So my expectations are already know. And also it’s a very high problem to solve because that scale, so I, I think it’s like linear is actually very refreshing in the way they are approaching this problem. And just, I didn’t expect to kind of actually like using something like an issue tracking software, but that actually made it nice.

Ronak: [00:55:14] But that’s pretty much, that’s the first time you check that out. Yeah. That’s the first time I’ve ever heard someone say the, like the, the tracking software, because for whatever reason, well, engineers, managers, everyone, they just don’t like either bugs, JIRA or whatever the new software is that they’re using, but we’ve got to check it out.

Uma: [00:55:33] Yep. That’s definitely. Is there anything else you would like to share with our listeners today? I think maybe more just like, you know, it’s, I, I think. Since we touched so much on migrations. I think my big thing was like, there’s just this team or are things where people hate the thing of breaking or like actually bug tracking software is a good example, right?

Bug tracking, software migrations, be things right? Like people hate all of those things. And one of the things that I would maybe leave the listeners with it’s, it’s not the thing that you like, you hate. It’s the way the thing is done. Right? If you hate meetings, that’s the way the meetings are being run.

If you hate your issue tracking software. In that case. Yes, it does this, the issue tracking software, but it’s also the way it’s being used. If you hate the migration, it’s the way it’s being done. Not the fact that it’s going to unlock some new capability often in these things. Like the answer is to like step back and like, why do you hit the tape?

Everyone hits incidents, but there shouldn’t be a source of misery again. Right. For everyone that’s like like to your, to the theme of this podcast, right? If you, if you, if like, why is something like a misadventure versus an actual adventure? Like it’s always in the help of the heart to work. Yeah.

Austin: [00:56:54] Things are inherently just hard. Like you stated hard problems. So they, depending on how much effort is being put into them they can get executed. Not as well as other things don’t. Probably easier in that regard.

Ronak: [00:57:06] Well one thing which I would just say for incidents as incidents are actually unintentional investments and learnings, it’s like you didn’t plan for that, but there is a lot of learning that comes out of it.

Uma: [00:57:16] Agreed. And then maybe maybe this might be self-serving, but render itself is a pretty delightful software to use. So maybe I can end with that plug.

Austin: [00:57:27] Yeah, for sure. We’ll, we’ll definitely link to render in our show notes and we encourage our users or listeners to check it out. Yep. Thank you so much again for coming on the show.

Uma: [00:57:36] It was really enjoyable to speak with you about migrations and your experiences. Thank you for having me. This was talk.

Ronak: [00:57:44] Thank you so much for your time. Really appreciate it. Thank you.

Listen on

Apple | Google | Spotify | YouTube | Stitcher | Overcast | Castro | Pocket Casts | Breaker

Next
Previous