Tractable

In this episode of Tractable, Kshitij Grover sits down with special guest Uma Chingunde to explore the intricacies of managing and optimizing complex technical platforms. Uma delves into the critical importance of constant investment in AI to ensure reliability and staying ahead in the ever-evolving landscape of technology. The discussion also dig into the necessity of transparency and communication during critical incidents, emphasizing a blameless approach and the impact of decision-making on key operational changes. Uncover the challenges and strategies involved in running mission-critical workloads, the value of developer experience, and the intricate balance of serving different user segments. Tune in for a deep dive into the world of technical platform management and the thought processes behind significant decisions.

What is Tractable?

Tractable is a podcast for engineering leaders to talk about the hardest technical problems their orgs are tackling — whether that's scaling products to deal with increased demand, racing towards releases, or pivoting the technical stack to better cater to a new landscape of challenges. Each tractable podcast is an in-depth exploration of how the core technology underlying the world's fastest growing companies is built and iterated on.

Tractable is hosted by Kshitij Grover, co-founder and CTO at Orb. Orb is the modern pricing platform which solves your billing needs, from seats to consumption and everything in between.

Kshitij Grover [00:00:04]:
Hello, everyone, and welcome to another episode of Tractable. I'm Kshitij, cofounder and CTO here at Orb. Today, I have with me Uma, who's the VP of engineering at Render. Render is a platform that helps you build, deploy, and scale applications across a broad range of technical stacks and services. Render is used by companies like Equal, Readme, Felt, and a ton of other start-ups. So to give Uma a little bit of an introduction, Uma has a ton of experience in AI environments and previously helped to scale, internal compute, and runtime at Stripe in a period of growth for the company both in terms of usage and headcount. So lots and lots of experience to draw, and really excited to have Uma here.

Uma Chingunde [00:00:41]:
Thanks so much for having me.

Kshitij Grover [00:00:44]:
Well, maybe before we dive into technical details, which I'm sure will be the meat of our conversation, tell me a little bit about render today. Obviously, you spend a lot of time on the engineering side, but if we just talk about the core thesis of the company, how would you frame that?

Uma Chingunde [00:01:02]:
The core thesis is really what you touched on. We try to be a modern cloud for developers. So, you know, it's like we're gonna be the cloud for developers and then the right question is, which how is that different from what already exists? And the key here is we want to abstract away everything that developers don't want to worry about and focus on the core development experience. So all the underlying infrastructure, all the undifferentiated toil, we take that away. We wanna provide a true platform as a service. So that's the core value of Render.

Kshitij Grover [00:01:38]:
Yeah. And I'm sure that a lot of the inspiration for that thesis, at least in your journey, comes from having seen companies like Stripe or, previous experiences where engineers did have to endure a lot of toil, and then inevitably companies will spin up internal platform teams that focus on that and try to abstract away parts of the, maybe, build process or deployment process or, even debugging process from them. Are there specific things where you feel like, render does a really good job or even just specific pain points where you feel like, that toil continues to show up and there's an opportunity to really get at it?

Uma Chingunde [00:02:21]:
Yeah. The core thesis, the way I think of it is what developers want is a workflow versus what cloud providers provide our resources. So we are bridging that key gap between what cloud providers are providing is, like, here's compute, here's data, here's storage networking, go build something with it. But what developers want is a workflow that they want to be able to bring their GitHub account and push it and have it live. And so we're abstracting that away. So the key difference is we're taking a bunch of resources and making them available as workflows. So the key workflow that a developer wants and needs, and that's similarity.

Uma Chingunde [00:03:08]:
So the reason I sort of was excited when I started talking to Render was and I could see that it made sense after spending time at Stripe and then talking to a bunch of pure companies, it was actually remarkable how many similarities there was in the internal platforms that a bunch of folks had built. So, you know, fast-type companies building that were cloud native had all essentially developed the same tools and techniques independently. So, essentially, each team was building the same thing. And whenever you see these patterns, there's, like, okay. There's a clear core product that exists here that can be built.

Kshitij Grover [00:03:46]:
Yeah. And that makes a lot of sense to me. I think one of the things I wonder about is if there is room to continue to kind of push the envelope in how prescriptive you can be as a platform. Like, in some sense, it's hard to imagine that people don't want point-in-time recovery of their data. Right? It's just like something that you kind of have to probably piece together in a lot of systems. It's something you have to kind of build on top of maybe what a cloud provider gives you. Are there specific things where you feel like these obstructions could be stronger, like, just categories of workflows where you think everyone intuitively feels like, okay, this is the direction we wanna head in, but they're having to build that themselves over, and over again?

Kshitij Grover [00:04:33]:
Yeah. Maybe it would just be interesting to get some examples of that.

Uma Chingunde [00:04:36]:
Yeah. So some key examples are really how the deployment workflow is so fundamentally different from, you know, piecing together compute and storage and a bunch of other things like PR previews where you are able to sort of, you know, iterate on a change at a point in time is so intuitive and something that we have built at render, but which is very core for the developer experience, but doesn't yet exist at the underlying resource level. So those are a few things. Something that I'm also excited about is anything that's potentially very complex and, you know, things that have potentially paper cuts where things like permissions or security settings or, you know, the classic S3 bucket being left exposed to the Internet. Like, those things where if we provide strongly opinionated defaults, but then always then but there will always be a category of users that do want to sort of, like, be peek under the hood. So you do wanna allow that sort of backup option, and that's where, you know, features like people do want, for instance, the ability to SSH into instances. We provide that. But do the vast majority of users want that? Not necessarily.

Uma Chingunde [00:05:49]:
So I think really strong opinionated same default is the way for The majority, I think, with the option to sort of go another layer and another layer below for power users, basically.

Kshitij Grover [00:06:04]:
Yeah. That makes a lot of sense. I think it's a good example, the permissions one, because I feel like that's the sort of thing where if you're a small team and you get dropped into AWS, it's not easy to fall into the successful outcome there. Right? It is very easy to misconfigure it or maybe even go the opposite way and add a bunch of processes, or bottlenecks at your engineering velocity because you're very concerned about it. And sometimes maybe for no good reason. Right? You're looking kind of trying to add security in the spots where it doesn't actually matter. I'm curious, like, in developing those opinions, of course, you're taking your background, but how does render think about building a product in an opinionated way? Is that you have to talk to a bunch of companies? It's pretty clear what you should build, and it's just a matter of execution. Like, I'm wondering

Uma Chingunde [00:06:52]:
I think that's all of

Kshitij Grover [00:06:53]:
typically ambiguity. Yeah.

Uma Chingunde [00:06:55]:
I think that it's really all of the above. It almost, like, depends on the scenario. So for some things, you know, there's, like, a lot of deep intuition that the team has built. Like Anurag, Meagan is our head of product. Anurag is the CEO. They have built a lot of deep intuition. Or there's in some cases, we have sort of our own experience internally on dealing with this with many users that has sort of, you know, taught us. But the key is you always have to for a start-up, as I'm sure you understand, is, like, we always talk to our users.

Uma Chingunde [00:07:27]:
So Yeah. We have many different ways in which we talk to our users. There's a lot of proactive approach. We also actually take on we have a strong opinion on this, which is potentially a bit different, but we actually take a very sophisticated support view. So our view is that our audience developers are actually, like, you know, highly sophisticated users. So when they are reaching out for support, they have typically tried a bunch of things themselves. So they want highly technical support. So we have a highly technical support team that actually helps our users, and they are an extremely good source of, insights.

Uma Chingunde [00:08:03]:
I actually often say this. I've said this publicly, but they're actually, like, our secret resource. We very deliberately hired a team that is extremely hands-on our essentially developers themselves. So they understand developer pain points, and they do a great job of funneling. They're really advocates for a lot of our users, and they funnel that pain back and advocate for changes in the product.

Kshitij Grover [00:08:29]:
Yeah. That's interesting. I imagine having a really tight feedback loop with support is pretty key to improving the product. And it's almost like when someone reaches out to support, there's a lot of value in that conversation even though I imagine the moment the priority is, like, let's just fix whatever production issue there is, and then let's, like, kind of analyze and retrospect on that conversation to see how we could have made someone feel more comfortable in that situation, understand the system better, kind of whatever the prompt is. I'm sure you get asked this a lot and you just reference this, but it sounds like there is a potential at least that cloud providers like AWS want the same thing, right? They wanna make their abstractions less complicated, and more developer-friendly. How do you think this plays out over the medium term? Of course, no one really knows, but I'm curious for your take on, is there a reason why this relationship would stay symbiotic in this way versus having, you know, someone like AWS just take their primitives and continue building up the chain?

Uma Chingunde [00:09:32]:
Yeah. So something that one like, I think it's actually plausible that there's literally no business that you can start that Amazon as a whole, not necessarily AWS, could not Yeah. Be disrupting. Right? So for example, at Stripe, it would actually be a frequently asked question at all hands, which is what if Amazon enters the payment space? There is actually literally no space that they are not capable of entering. So if you take that as the sort of answer or that as a thesis, a lot of start-ups would exist. You know? So I think that's sort of they can and they do enter a lot of spaces. So that being said, it would be naive to sort of not think about, okay, why would they not build in this space? The thesis here is, well, it's certainly different and I've lived through this myself. When you are in the business of providing a highly profitable product that is actually, like, extremely sticky, which is what the cloud business is already, and it was hugely disruptive, There actually isn't an immediate need to kind of go up the stack.

Uma Chingunde [00:10:43]:
Mhmm. So that's one key thing. But then there isn't also there is a certain DNA that companies need to build, and that is sort of sometimes interestingly, like, you know, it's hard to actually build that house and sometimes that DNA comes from outside, and we have seen this

Kshitij Grover [00:10:59]:
Yeah.

Uma Chingunde [00:10:59]:
Many times. Yeah. For instance, AWS itself should have, in my view, been built by the company that I used to work at called VMware which actually pioneered virtualization, and they did not. Right. So it's you have to actually sometimes be on the other side to build that value add. And that's sort of, like, the core thing. But I also think that it's, again, very symbiotic in that there is a value. There's folks who want what render is providing and we build off cloud providers and there are people who still want to go directly to the cloud provider.

Uma Chingunde [00:11:35]:
Obviously, our hope is that we will continue growing this piece of the pie over time.

Kshitij Grover [00:11:40]:
Yeah.

Uma Chingunde [00:11:40]:
But in our case, it's actually a growing pie. And it's actually Yeah. There's a market for the number of things that people want to develop and sort of, you know, host online is only growing, and we want to be part of that growing market. And we think that there's more people that want to be served than are being served by a platform like Render.

Kshitij Grover [00:12:04]:
Yeah. And it makes sense to me that, there's also a as you're you're saying, there's a core competency sort of difference. Right? It's possible that the folks who wanna go work at AWS are really excited about building something like Dynamo where you have to think at a very low level or scaling, you know, something like S3 where, again, you're you're thinking about a class of guarantees or problems that are different than the ones you're thinking about at Render or at least adjacent to where maybe it's it is just like a different set of concerns that you're focusing on rather than, you know, an engineer at AWS. I'm curious about actually digging into that a little bit. Where is your engineering team spending time today? Like, if we break down the team's capacity, what buckets would you put that in terms of maybe which products are they spending time on? And also just I'm sure there are a set of folks who are who are thinking purely about reliability and scaling. So how does the team break down?

Uma Chingunde [00:13:01]:
Yeah. So no. That's a great question. And I think we've actually designed our teams to map to how we sort of, you know, want to think of our investments because, you know, the way to think of your architecture is going to reflect the org is to actually build your org to reflect what you want the architecture to be. So we have a pretty strong investment in infrastructure and security. Those are sort of AI foundational teams. Security is smaller, but we have a pretty large infrastructure team and they're the ones that build the key foundational infrastructure abstractions on top of core providers. And then we have the core product teams, and we can kind of think of them in terms of the user journey.

Uma Chingunde [00:13:41]:
So we have a team that's sort of thinking of new users that's called activation, and that's a very small nimble team that that iterates very quickly on, you know, user onboarding and new and that part of the user journey. And then core, within the product, are 2 teams that focus on basically data stores, which is our our managed database products and the rest of the capabilities. So all of the other platform capabilities that all users of render will use. Everything that's sort of in the product. And then we have another team that focuses on our largest users and sort of enterprise users. And that's the sort of split as we think about it.

Kshitij Grover [00:14:25]:
That makes sense. One thing that comes to mind when you describe that is, and I think just even in the way that you frame the first team versus the others, how do you think about AI risk tolerance differently across the stack, right? So I imagine there are some teams where you have to be maybe security is an example but even in some of these core infrastructure teams, your AI of the cost of a mistake is very AI.

Uma Chingunde [00:14:46]:
Yep.

Kshitij Grover [00:14:47]:
As you know, no one wants a bug in the UI, especially if you're a developer experience company. But in some sense, it is less costly than a correctness bug in the core layers. So is that something that has changed over time or yeah. Like, how do you message that internally, the risk tolerance piece?

Uma Chingunde [00:14:23]:
I think that's actually something that chain has changed many times almost, and it's it's almost I think it's a bit almost like the weather where it actually changes a bit based on many different things. And the reason I say it's like the weather is that there are actually things outside of our control that affect risk tolerance. So as an example, over a year ago, we had this external event where a competitor was going through some struggles, and then they also announced the end of their free tier. And we had a huge influx of new users. And the platform was, like, really creaking on the edges. So we sort of looked at it and we took a, okay, no more we for the next sort of, like, 4 weeks, no risky changes. Everyone is going to be, like, super. All we're gonna do is, like, fix, things that are breaking and sort of you know? So it is a bit dependent on many different things.

Uma Chingunde [00:15:57]:
But in the steady state, the way we think about it is sort of, like, almost on the change itself. So the idea is that even on infra, there are many, like, safe and easy-to-rule changes, and we try to push the decision-making to the individual engineers as much as possible, and speed is always a premium. So the idea is that if it's a low-risk change, you sort of use your standard process and rule it. If it's a risk change, for example, a large migration, then you have to be, like, really careful about it.

Kshitij Grover [00:16:29]:
Yeah.

Uma Chingunde [00:16:30]:
And something that we, do is when we have incidents, we talk about them. And the idea is that sort of folds back into the learning and the Yeah. Growing of the organization.

Kshitij Grover [00:16:29]:
Are there specific technical decisions you think in the process of building the platform today that have given you a lot of mileage in terms of derisking future changes? Is there an abstraction layer or it could even just be a process, that you have instilled that has made people more confident with changes that would otherwise feel really risky?

Uma Chingunde [00:17:03]:
I think there's there are some changes that we've made, which is sort of very deliberately investing in a lot of understanding of infrastructure, which has which has been a journey, but, I would say that the team very deeply understands infrastructure to a point where they are able to unpack things much more quickly than even, like, you know, some of the cloud. There's that's been, like, a core investment. And then I think other key decisions have been to sort of when temporarily when we have, like, deviated from, like, sort of, like, the good architectural part to that sort of invest in bringing it back into the fold, which has allowed people to move faster. So that's actually publicly we've talked about it in our blog. For example, we launched a free tier, two and a half years ago, and SaaS launched very quickly. And then over time, we realized that was a source of, you know, rough edges because it was a separate sort of almost like a fork. So there was a conscious decision made to bring it back in.

Uma Chingunde [00:18:09]:
So trying to sort of prevent very custom or, like, separate instances as a very deliberate, design decision has been, like, a key thing where the closer you keep all of your code base to not being, like, you know, unique forks, that's

Kshitij Grover [00:18:26]:
a key one.

Uma Chingunde [00:18:08]:
But I would say that we're still very much a work in progress. And this is something that we're actually investing in right now is we found that the sort of, like, development of our own developer workflow, like CI and end-to-end tests, have basically accumulated tech debt over the last few years, and we're actually going to we then decided, okay. That's slowing us down. So we're actually investing. A couple of people are actually taking time right now to essentially redo it, and we're actually moving off an older CI system to get our actions currently. So and and the idea is making those key investments is maybe the most conscious thing that we've been doing.

Kshitij Grover [00:19:07]:
Yeah. And one of the things you said there was not letting the code base diverge too much so that you're not managing multiple mental models of the world. I imagine one challenge you face is building a platform like Render where you have to service the hobbyists, right, so a couple of people versus much larger companies. I imagine it's easy to let enterprises drive the roadmap and in some cases that probably makes sense, right? Revenue is important and in fact, it's just important to have feedback from the more mature companies to inform what problems the smaller companies might hit in the future. But on the other hand, you're of course building a platform that should work, for these small companies and not make it too complex and and not expose them to that level of maturity too early. I'm curious if that resonates and if there are things you do to avoid having larger enterprises kind of dominate the conversation in terms of what investments you make.

Uma Chingunde [00:20:10]:
No. That's 100%, I think, which is sort of why the team split again. You know, like, the larger enterprise team is the one. So the idea is that investment then dictates, like, you know, that's the amount of investment we will make. But that being said, we think of our largest and fastest-growing users as sort of being the leading edge. So they're sort of, you know, in many ways where we want to be. And our goal is always to stay ahead of where they are. So when they say that, hey, they're running into these things.

Uma Chingunde [00:20:09]:
We're like, oh, that's sort of where we need to be, and we need to be there before them. And that's enabled by the tight feedback loop. The way we work on that sort of request and that sort of life cycle is essentially to keep that tight feedback loop, listen to those requests, and wherever possible, we actually try to build the things that can then be generalized. And so, essentially, productizing those features as quickly as possible. So we'll actually, like, you know, build them early based on user feedback. But we are able to, generally, for the most part, quickly understand, like, "Hey, is this something that only that applies only to this user? Or is it that they are the first ones asking for it and many others are soon gonna be able to use it?" And as long as it is something that can be generalized for, like, all of our users, we tend to actually build it and then productize it. The interesting thing we found is users actually typically want when they want a feature, they want it to be like the GA version of the feature.

Uma Chingunde [00:21:45]:
Most users don't want to be on the bleeding edge of the platform. Right? So that is typical, that actually works both ways. As long as we're, okay, this is something that makes sense for probably all users are gonna need this in 3 months, 6 months. They're just the first ones to ask it. We build it, and productize it. We do have what we call early access where we do opt-in some of these users into it early. But most of them actually will say that they want to wait for GA.

Kshitij Grover [00:22:15]:
The GA to work. Yeah. And, actually, I imagine that the thing that becomes tempting as a result of that is to continue expanding the product suite. Right? There are a lot of services you already offer and a lot more you could. One, I'm curious, like, in general, how do you define the boundaries of the product? And I think part of the answer must be what you said earlier, which is you give escape hatches and you don't try to productize everything. But in the kind of horizontal way, you still have to manage the boundaries. And maybe as a part of that, how do you think about build versus buy as an internal decision, right, where there are certain products that, you know, you can quote unquote wrap and certain products that you will want to build? What is the distinction there?

Uma Chingunde [00:23:02]:
This is, like, a great question because I spend a lot of time thinking about it. So the way one of the key grounding things is what's part of the core developer experience for us. The thing that has evolved is that developing individual developers into teams is a core part for us, and that's why we are optimizing this for the developer experience for individuals and teams, and then we have these personas internally. And whatever makes sense for that core developer experience, we will build ourselves. Now there can be a difference, though. There's some things that are operationally hard to build, and then there are things that are operationally hard to manage. So the exam and then some fall into both buckets.

Kshitij Grover [00:23:46]:
Right.

Uma Chingunde [00:23:00]:
Right? So an example of something that is operationally hard to both build and manage is something that's, like, in a very deep infrastructure-heavy. And so an example of if that falls into a core

Uma Chingunde [00:23:58]:
developer experience, that is developers think of it as the fundamental thing that they want, then we build it, hard to build and hard to manage, but it's something that developers will absolutely want as part of the core developer experience. Like, if they're building if you're building an app, you likely have data you want to store somewhere. Right. That's part of it. So the idea is, okay, we will build it. And then depending on how and then it's, like, sort of how we will build it, that sort of is, like, whether we will wrap it on an external thing or but, like, still manage it ourselves or will we offer a plugin? So if it's not part of the core developer experience, then I'd like to sort of take the more AI experience where it's something that people want, but not something that we want to develop core competency in. That's something like becoming like a CDN or you know, preventing DDoS attacks. And that's where we use Cloudflare, and we're pretty open about the fact that we use them.

Uma Chingunde [00:24:56]:
So that's, like, something you absolutely if you're hosting something on the web, you actually want it to stay up and not be DDoSed. But it's not something that we as a company want to become experts in. So that's sort of like the build versus the buy. And there's another category which is purely internal tools. Right? So those I strongly believe unless it's something that's, like, really fundamental and particularly if it's not load bearing, we tend to buy. So, you know, things like buy or use open source versions like CI or, you know, instant management tools, many other.

Kshitij Grover [00:25:30]:
Right. Yeah. That makes sense to me. It almost sounds like the more you can kind of separate it from the rest of the workflow, it makes sense to kind of build or, buy something like DDoS. Whereas, the closer it is and the more it has to fit together with, the rest of the workflow, you don't want it to be a disjointed experience. Right? And so you do need to think about it as a single unit. So I think in the process of, you know, whether it's building these or wrapping them, regardless, you serve a lot of web traffic. And so Render, as a result, cannot be a cheap product to run.

Kshitij Grover [00:26:08]:
Also, there are a lot of different use cases you're serving. Some are very mission-critical and some may be less so. But either way, I think you're incurring costs there. So how do you think about the cloud economic story, and and how do you structure your pricing and packaging, or even just think about internal margins regardless of what the the public pricing is?

Uma Chingunde [00:26:30]:
Yeah. It's, such a great topic and one that no one would have asked, like, in, like, you know, 3 years or

Kshitij Grover [00:26:34]:
so. Right? 2020.

Uma Chingunde [00:26:07]:
2020, who margins? What's that? And then now in this current era, margins are what everyone talks about. So, no, it's actually both a really interesting problem, but also an opportunity. So sort of the same way as we talk about undifferentiated toil and when you amortize that undifferentiated toil across the board, you can think of cloud economics in the same way. You know? It's much more cost-efficient for us to manage all of these resources on behalf of our users instead of each of these users trying to negotiate individually with each of the providers. Right? Because cloud providers are very incentivized to have volume discounts similar. Right. We're talking about, like, packaging up a huge amount of that. And the same way on the technical side, it's much simpler, or at least there's you can actually have developer strategy when you are dealing with a volume or level of usage.

Uma Chingunde [00:27:28]:
And you can, you know, build ways to kind of essentially optimize that. And that's what we have basically spent. We've we have spent a year and a half at this point developing this, so there are a lot of ways in which particularly at high volumes and all of our compute and data network usage is extremely high volume. Right. This is what our infrastructure team has basically spent time doing where they have really optimized the usage of these resources. So, you know, for everything you use, it isn't as simple it's very much not a binary decision.

Kshitij Grover [00:28:06]:
Right.

Uma Chingunde [00:28:07]:
It's a bunch of, like, architecting, and

Kshitij Grover [00:28:09]:
Yeah.

Uma Chingunde [00:28:09]:
this is again the thing where we can become. We actually made a decision that we would become the experts on this, which is working with someone else and or, like, you know, even tooling doesn't actually fully satisfy. So where we are at this point is we deeply understand how usage works under date, and so we are able to very strategically optimize that usage. You know, just like very simple examples. Like, where should how should traffic be routed between regions? Like, you know, for, like, you know, where does it make sense to send data? Where does it make sense to store data? And Yeah. You know, what's, like, local versus being sort of offloaded to something like Glacier? We are able to make all of those decisions at scale for each of our vendors. And so that, over time, adds up to significant margins.

Kshitij Grover [00:29:00]:
Yeah. That makes a lot of sense to me. And it also sounds like this is a very hard problem for you to not internalize because it affects how you ultimately build the technical infrastructure. Right? It's not an afterthought or, like, a side effect. It is pretty core to the way data flows. And maybe even in some cases, the sort of performance latency, like, even consistency guarantees you can offer, as a product on top of that. It's also really interesting you talk about this idea because you're managing so many different tenants, you get to then run that optimization problem in a way that a single tenant couldn't. Right? And then they're maybe not in a position to go to their cloud provider and say, hey.

Kshitij Grover [00:29:44]:
We need a volume discount, whereas Render is. Right?

Uma Chingunde [00:29:47]:
Yeah.

Kshitij Grover [00:29:48]:
So maybe the other thing that's interesting to explore is this idea of, you know, startups big or small trusting you on the reliability front in this mission-critical workload. How do you think about I wanna say the pressure that comes with that, but maybe not even the stress or pressure, just in general the responsibility of running all of these? I know you had as you just said, you have a team staffed around that, but in general, is this like a problem where when you're in these conversations on the go-to market side, it's confidence building? Is it, like, showing people, hey? Here's how the technical architecture is built. Like, how do you navigate those conversations with companies?

Uma Chingunde [00:30:29]:
No. This is, again, what I think is so important. Right? It's like, when you're a platform, you know, you're selling a lot of different things, but the core thing you're selling is peace of mind. Right? Yeah. So and that's something we talk about a lot internally. We do it depends on our users. Some users want to get into the nitty-gritty of how. But what we have found is what people want is for us to show what we show the work, not talk about

Kshitij Grover [00:30:58]:
it.

Uma Chingunde [00:30:58]:
Right? It doesn't matter how well you're doing if the nines aren't there, how resilient are we? How, you know, how well do we respond? How well do we take responsibility for when we do sort of, you know, have an incident? And Yeah. We have a lot very high degree of rigor around this. So Right. We're, like, we're pretty proud of our stability and resilience at this point. So it wasn't always the case. It's taken a long time to get here. Right? But it's always a work in progress.

Uma Chingunde [00:31:35]:
is the kind of thing we have to constantly invest in to stay ahead of the curve. Because, again, it's like everything you're building and every new set of users kind of, you know, starts stress testing it. And then the more things you add, the more, you know your underlying provider, for example, can also have their own issues, and those start.

Kshitij Grover [00:31:57]:
Right.

Uma Chingunde [00:31:57]:
The important framing that I do think is important, especially for, like, smaller users is, you know, we are building this and we're gonna be as reliable as possible. And in many ways, you know, it's not that it I think of it a bit like, you know, a sense of control around there's a sort of word that I'm forgetting, but, like, you know, people feel less safe flying planes versus driving and that's sort of a bit of education. Right? Like, you're not gonna have fewer incidents if you're running it yourself. It's just that you might feel more confident because you're in control.

Kshitij Grover [00:32:37]:
Yeah. And Yeah.

Uma Chingunde [00:32:38]:
That we had to sort of, you know, message that, but also make sure our reliability is completely top notch.

Kshitij Grover [00:32:46]:
And this is probably where this idea of having a very technical support team comes into play. Right? Because if there is an incident and they reach out and they're talking to your support team, it's a big, you know, the job of support, of course, is to fix the issue, but also to instill the confidence that, like, hey. We're on it. We understand what the issue is. We can help you navigate this issue. And that's what I have to imagine for a company that's 3 or 4 or 10 people, that can be a scary moment which is, hey, if none of us can figure it out, like, we need to lean on someone else for the expertise. And, yeah, that's always a conversation that you need to have internally. Yeah.

Uma Chingunde [00:33:22]:
And oversharing, honestly, like, we just never take our users for granted. Right? A key thing is just sort of, you know, like, our users are basically people like us, and we never take them for granted. And we always imagine that they're having the conversations that we would have in their situation. So always being upfront and transparent, Like, pretty much for every, like, critical incident or above a certain degree, we actually, like, publish publish RCS. So those are, like, we take a very, very much of, like, communicate as much as possible and share upfront. And the idea is that by being transparent and communicating, users can sort of, you know, join us in that journey.

Kshitij Grover [00:34:05]:
Yeah. One thing I'm curious about is how that level of transparency translates to internal team culture? Right? Because I imagine from a business standpoint, that makes a lot of sense. That being said, if I'm an engineer at Render and I'm thinking like, if I mess this up, this might end up in a blog post where we have to share it from the transparency angle. Do you feel like people get nervous about that? Or do you think there's a lot of buy-in that, hey, this is the responsibility of working

Uma Chingunde [00:34:34]:
on this a while? It's it's transparency but not blame. Right? I think that's the key. So it's like, you know, instances are all way like, incidents are 100% blameless. Right? Like, in the incident, the first goal is resolve the thing, not blame.

Kshitij Grover [00:34:49]:
Yeah. Yeah.

Uma Chingunde [00:34:50]:
It's never, like, you know, we even, like, for instance, reference who pushed the PR. It's always like Right. Like, how are we rolling back? How are we getting to a good state? The same thing in debriefs after any major incident, we have a debrief. So it's it's always. I think, people sometimes equate transparency with, like, you know, finger pointing, and that's the opposite. Like Right. It's always, like, where so there's never an element of blame. I think there's always, again, the responsibility and transparency, but never, you know, "Oh, why did x do so and so? It was just like, oh, the system allowed us to do this and how do we fix the system to get better?"

Kshitij Grover [00:35:33]:
Yeah. So, I mean, suppose there's a production incident. I imagine, you know, as an engineering leader, there's always a temptation to get into the weeds and fix the thing. And then I'm sure there's a voice in the back of your head that says, like, look. I just need to trust the team to go and fix this. What is your role in production incidents these days? How often are you all hands on deck, like, digging into the technical issue versus maybe doing customer comms or or helping organize the team around it?

Uma Chingunde [00:36:00]:
I'm all I I think the key learning that I actually had prior to Render at Stripe is leaders being in the weeds, honestly, is almost always AI a distraction to the actual leader. Right? This is the culture which I was very grateful to learn was the idea was to let the people fixing things fix it because that's the fastest part to resolution. Really, that's just in everyone's best interest. What leaders can do is comms and any sort of traffic routing. Yeah. And that's how I see my role as well as that of entering managers. So traffic routing and kind of being calm and making sure that the team has what it needs. So

Kshitij Grover [00:36:41]:
Mhmm.

Uma Chingunde [00:35:59]:
You know, in the kind of instant where you're to, like, have a plan a and a plan b and a plan c, that's something that I or another senior person might do in an incident, but or, you know, or and that's what the team knows to rely on us for and they know to, like, if it's, you know, hey, can we ask the user x versus y? Then I can, like, weigh in on that comms or, like, review stuff. But that very much is the right way. Or I might ask questions like, "Hey should we page in others" or things that, you know, sometimes in the heat of the moment, people get tunnel vision and so or they might be hesitant to say, like, hey. Should we so that's sort of that's the role of a good instant commander, and that's how I obviously, my role and that of others. Like, you don't sort of just, like, wait on the sidelines. That's a key distinction. You do want to be involved if it's particularly bad, but in a way that's actively helpful and not just sort of running around yelling at people.

Kshitij Grover [00:37:38]:
Yeah. Yeah. Exactly. Not just causing more panic when people are already stressed. One thing I'm curious about, is whether there a technical decision that, if you look back on it, you made or or improvement that you made that reduced the rate of incidents. I mean, obviously, you know, in general, the trend line is the platform gets more and more stable. But was there, like, a step jump where you're, like, "Oh, we, you know, switched from this strategy to this other strategy or this service to this other service that really helped us?"

Uma Chingunde [00:38:05]:
I think multiple ones. The one I referenced a while ago, like, just earlier today around the sort of rearchitecting free tier, and that's actually on our blog. That was a big one. The decision so there's, like, multiple sorts of key things, but then I can think of the free tier one was a big one because it was causing a lot of internal instability until, basically, a couple of engineers were like, it's time. We're gonna do it and just, like, went and did it. And one of them is the author of a blog. Another couple of sort of, like, step differences was, like, in the summer of 2020, we had, like, a really bad incident, multiple hours. And then after that, we basically stopped a bunch of work to do, to essentially rearchitect our underlying clusters in a particular much more resilient way and redo a bunch of things, and we basically spent a couple of months cleaning up.

Uma Chingunde [00:38:59]:
And that, I think, again, showed a step function difference. And what we AI to also do consciously is things that we learn from them and how do we sort of, like, incrementally make them better. So that one was another key decision. And then a key decision, in addition, that was sort of like a technical decision in this case to go with a particular vendor, okay, we did. For a while, we were actually, like, you know, trying to build a muscle ourselves to deal with DDoS attacks. And then we decided, actually, this isn't the right use of engineering time, and we are going to put ourselves behind behind a a vendor. And then we and that was actually a core investment. So it isn't like a drop-in replacement.

Uma Chingunde [00:39:42]:
You don't just, like, buy buy something and just drop it in. Right? So that was again a core investment where that again had a certain class of incidents. Like, think of these as where a certain class of, like, this investment led to a class of incidents. So these 3 are sort of, like, key examples.

Kshitij Grover [00:40:01]:
Yes. That makes sense. It sounds like one common pattern or thread between those is the incremental improvements weren't really working, and so you had to kind of zoom out. Maybe there's some thrash to saying, okay. We're gonna stop here and we're going to just take a completely different approach. That might mean, you know, 2, 3 months of pain, but it's going to pay off in the end. And ultimately, that's a bet you make. Right? And I'm sure it doesn't always pay off, but it is a sort of thing you have to be willing to do or at least make a bet on in the leadership position where

Uma Chingunde [00:40:34]:
Yep.

Kshitij Grover [00:39:59]:
I'm sure, like, other people are not confident enough to say, "Hey, we should back out of our plan."

Uma Chingunde [00:40:40]:
Yep. I think that's key. Do I think in all of these cases, it was not like me sort of, you know, sort of joining a Zoom and saying we're doing x. Right? It is always a much more like, the process to come to the decision was always a much more, you know, data gathering, for instance. Actually, the free-tier chain was completely engineering-driven because they were feeling the most the biggest pain. And they were just like, we're just gonna make this change. Like, rearchitect this, and then it'll get rid of. So the reason I think those three examples are sort of different because each one actually had a very different decision-making process.

Uma Chingunde [00:41:16]:
It was actually, like, engineer saying, no. This is too much. Like, we just have to rewrite this whole thing. And, really, my decision in that case was really just to say, okay. Like, yes. We have to do it. In the DDoS protection, there was actually, like, a pretty in-depth investigation on even, like, picking the vendor and how we would integrate and similar.

Uma Chingunde [00:41:36]:
And in the different incident, that was actually a series of prioritization efforts almost that we made between, like, between myself and all different parts of the engineering team. So each one was almost like a different engineering possibility. I think the common theme is deciding sort of, like, reaching that point where it says, okay, something different has to be done here and then working backward to a solution from that.

Kshitij Grover [00:42:04]:
Yep. That makes sense. Okay. Sweet. Well, maybe the last thing I wanted to ask you is what's coming next? Like, what are you most excited about? It could be a product launch. It could be, you know, a technical release, but what's next for Render? And personally, what are you most excited about?

Uma Chingunde [00:42:20]:
Yeah. I'm really excited about a bunch of different features that are coming up. And what we've seen, and I'm sure, again, you everyone is probably, like, you know, very well aware, like, you know, more and more AI applications are being built. So what we have started seeing is a difference in the requests we're getting to support them. So while it's like and the surprising thing is that it's less things like GPUs, but and more things like vectorized databases or object store or similar things. And I'm actually very excited that a few of our teams are building features. So something like vectorized databases actually was already launched last year. PGVector has a plug-in.

Uma Chingunde [00:43:08]:
So I was very excited about that. And now what I'm excited about coming up is we just kicked off building object storage, and that's something where we'll actually be building object storage into the Render platform. And that's one of those, again, big, meaty features that a lot of our users will be asking for a very long time. Yeah. That's, you know, something that is, again, core and very complex to build, but very exciting to be able to offer our users. I think that's actually, like, a key sort of differentiator, that that sort of feature for us.

Kshitij Grover [00:43:42]:
Awesome. Well, that sounds very exciting, and thank you again for your time and for coming on the show. Yeah. Appreciate it.

Uma Chingunde [00:43:48]:
It was great chatting, Kshitij. Thanks so much for having me.

Kshitij Grover [00:43:51]:
Of course.

More episodes

Chapters

What is Tractable?