Episode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

Episode #12: Reducing MTTR in Serverless Environments with Emrah ŞamdanEpisode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

00:00 40:15

Jeremy chats with Emrah Şamdan about discovering problems before they cause issues, what constitutes signals of failure in serverless applications, and how we can be better prepared to respond to incidents.

Show Notes

About Emrah Şamdan

Emrah Şamdan is the VP of Product at Thundra, a tool to provide serverless observability for AWS Lambda environments. With the development team, Emrah is obsessed with helping the serverless community with their debugging and monitoring effort both in production and during development. He is responsible for making trouble for the Thundra engineering team while finding solutions to ease the life of serverless teams.

Twitter: @emrahsamdan
Thundra: Thundra.io
Blog: blog.thundra.io.
Demo: demo.thundra.io

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to serverless chats this week. I'm chatting with Emrah Sandam. Hi, Emrah. Thanks for joining me.

Emrah: Hey, Jeremy. Thanks a lot for having me today.

Jeremy: So you're the VP of product at Thundra. So why don't you tell the listeners a little bit about yourself, your background and what Thundra is up to.

Emrah: Yeah, sure. So I'm, you know me as a product manager for Thundra. I started as a product manager at Thundra while it was a start project — it was an insider project in OpsGenie. We were some engineers, me and some designers that began an internal product for OpsGenie engineers. Then it turned out to be a product and company, and now we are serving serverless developers for observability. So in 2017 Serkan, our CEO and the CTO actually acting, and he was developing some modules of OpsGenie with AWS Lambda. And he had some problems with the observability and he couldn't find any solution that fits the purposes. And he said, hey, I can write general libraries because they were writing in Java at that time which can give me some ideas about like how my Lambda functions are performing. And he developed this as like an extracurricular activity for OpsGenie, and he made this available, and it was sending data to Elastic at that time. They were seeing some Thundra produce data. And they thought that, even before I joined OpsGenie, they thought that why don't we make it as a separate product? And why don't we make it as a separate company and I joined and they hired me as a product manager for that. In October last year in 2018, we decided to spin off it as a separate company because, you know, OpsGenie was sold to Atlassian and Thundra will continue as a separate company. And we are helping serverless developers with observability by aggregating traces, metrics and logs.

Jeremy: Very cool. All right, so I wanted to talk to you today about reducing MTTR in serverless environments, because I think when we think about meantime to repair, normally we have a lot of control. Like if we're running our applications on-prem, then we likely have access to the physical servers and the hardware components, and even if we're running our applications on something like EC2, we still have access to the operating systems, the VM instance sizes, the attached storage, and the same is typically true with containers as well, right? So we have a lot of ways in which we can affect the time it takes to repair some of these hardware or even scale issues. But If you are in a serverless environment, then it’s quite a bit different, especially if you're using a lot of managed services from the cloud provider, you really don’t have access to the underlying operating systems or hardware anymore. And I know some people have changed the “R” in MTTR to mean “recovery” or “resolution” since it’s really less about actually repairing hardware. But maybe we can start there, maybe you can give us your thoughts on what's different with how we respond to incidents in serverless versus how we would respond to incidents with more traditional applications.

Emrah: Definitely. So in traditional applications, as you say, there are some resources that we can easily gather the information when we see some problems, some incidents in our system. But in serverless, on the other hand, it is like you have different piles of logs, which it comes out of box from CloudWatch, from the resource that Cloud vendor propose. But these are actually separate, and these are not actually giving the full picture of what happened in the distributed serverless environment. And what you need here is that the problems are different. In a normal environment, the problem, most of the time, was actually about scalability and you were responding to that by giving more resources, by just increasing the power of your system. But with serverless, the problem is about like some problem occurs in any kind of a system in a distributed network and you need some more than log files. You need like all three pillars of observability, which is called traces. In our case, it is distributed traces, which shows the interaction between Lambda functions and the managed APIs and the managed resources and third-party APIs, and the local traces, which shows what happens in the Lambda function, and the metrics and the logs.

Jeremy: Yeah, right. And I think you’re right that the distributed nature of serverless is something that might be relatively new to people as well, so just figuring out where the problem is, or what component is causing the issue, is a challenge in and of itself. So the point about metrics is interesting too, because as you just said, the scalability is handled by the cloud provider for you with most of these services. So we’re likely not as worried about low level metrics like CPU usage anymore. So what are the signals of failure, like, how do we know that something is broken? What are the things that tell us something might be wrong in our application that we might want to address?

Emrah: Yeah, sure. So, like, if the scalability is not the problem, so what might be the problem? So what might be the good metrics that we should look at? In this case, there are some metrics which are actually very predictable by everyone listening here. But they are actually saving our system’s availability a lot. Say first, and the most important is actually latency, because of our aim to not receive timeout alerts, right? So the latency metric, the duration of invocations metric, is something that we should keep our eye on. So you need to keep an eye on how long do our functions take? So you need to see that if the duration is approaching to timeout, if there is, we should check what might be the reason we should check that? Is there something? Is there a problem with the third-party APIs? Is the problem with the resources that we are using? And this gives you like, when you see a latency, you should be approaching it very, very carefully because you don't want to be in the storm of timeout errors. So the second metric, in my opinion, is that memory usage. So you know, we are provisioning a memory to Lambda function. And this is the only thing that we control in serverless. Most of the time, developers are giving the memory more than it actually requires just in order to increase, speed up the IO and throughput and let’s say the CPU. But, in this case, we are having problems with the cost. You know, because whenever your function gets triggered, it runs for a time, and our billing is decided by GBs per second. So in this case, if you allocate more than necessary memory, we may be losing some money for Lambda. This might be very negligible if you don't use Lambda excessively. But if you're using Lambda in production and you're using Lambda and serverless mostly, you'll have some problems with the costs. In order to do that, you should tune your memory accordingly, and I love what Alex Casalboni does about this. You should see how your function is performing best with which memory configuration. Even after that you should keep track of memory usage and see if there's a jump that you don't expect, and you can again tune it again. So you should tune it again because there might be some changes in the managed resources they're using. There might be some changes in the inputs that you are processing. So memory’s just something that you should keep your eye on even after you successfully and carefully tune it. And the other metric is actually not related with Lambda itself, but the resources that we are using. So let's say, for example, they are using Kinesis, you are using SQS, so we need to keep an eye on how our Lambda function is performing in terms of these managed resources. So we should keep an eye on the Kinesis iterator age. We should take a careful look at the SQS queue size in order to not overflow the messages and not to have data losses.

Jeremy: Right. And so then, all this stuff is telling us when we see the increased number of timeout errors, for example, then we could assume that it’s potentially a third party service or one of the managed service we’re using is taking longer than it needs to. And if we see things, like you said, the iterator age for Kinesis or the SQS queue size growing, that those sort of things give us indications that either our application isn't performing correctly or maybe some sort of downstream service isn't performing correctly. But distributed systems have failures all the time, right? Random connectivity or network latency issues. So should we be worried about occasional timeout errors here and there, or is it more in the aggregate that we want to look at?

Emrah: Yeah. So when a timeout error happens, it depends on your use case with Lambda. It might not be like, the single timeout message might not be the end of the world for you. But you should, again it depends on your use case. When you have something that's critical, one timeout error can be something significant. But sometimes when you are using data processing and you can just retry it from Kinesis, let's say, and it doesn't mean that much and the signals can also come from the latency again. Latency when you are processing your message, it may not be that problematic when the invocation increases to some extent because that might be a steady state of your functions when there is lots of load on your system, the agency can increase to some extent and it may not be something that problematic. So what I'm trying to say is that there might be a range that your function is performing daily, weekly, maybe monthly, and there might be some spikes because of the traffic, because of the third-party API slowdowns and these may not always signal failure, but something normal in this steady state and you should know about it by constantly observing your system.

Jeremy: And what about things, though, that you maybe could address? So I talked to Hillel Solow the other day and we were talking about flooding kinesis streams with junk data and then having trouble draining those. And that was more on the security side, but I can see things like bad messages in an SQS queue without the proper redrive policy causing lots of problems too. So how do we know those sort of things are happening?

Emrah: So like in this case, you should be able to have that observability tool that helps you to see what is the request and response that's coming through your Lambda function. So let's say that you have this function, gets to you by Kinesis, and it gets some message and it doesn't throw any timeout, but it just throws some input. In this case, you should set an alert for this condition, and when I see this error type excessively, more than 10 times in a row, I should take an alert, and I should just go to SQS or go to Kinesis and take out the poisonous message.

Jeremy: Okay, so I feel like the SQS situation, where maybe you have a poison message in there, and it's just causing the function to fail over and over again, that those are fairly obvious when you start getting those exceptions. But what about something like, so for example. So this happens to me on one of my projects. I get an error that says something like, and it includes a stack trace, but it's something like a mapping error. And it happens maybe once or twice a week, you know, it's something like 0.001% of my invocations. So, very, very small. It's this sort of edge case, and I know it's just something that I need to address, but I see other errors like that too where something will just kind of pop up and they feel like they're anomalies. Maybe, or maybe just outliers, but, how worried should we be about things like that?

Emrah: So again, it depends. So the first thing that you should be aware of is that outliers happen. So these kinds of situations happen, but as I said, this can be, the outlier even can be in this steady state, but you should be able to understand the reasons of outliers. I talk with many customers from many Lambda users, and they have this information. They’’ll tell you “My 99 percentile of my functioning invocation duration is like, let's say, one second. And, normally it’s like 200 milliseconds.” And when I asked them, “What is the reason?” Do you know that most of the time, [they say,] “I don't know. Maybe this third-party API that we are using. I see it’s slow sometimes.” But in some of the times, it’s not this third-party, but let's say, the DynamoDB table that they use. So when your function is having an abnormal invocation duration, you should know like not exactly at the same time, but maybe later, as a retro, you should know what caused them? So in Thundra we provide kind of a heat map to our customers and they see what are the outliers in the heat map. And when they focus on that, they see what is the normal usage of, let’s say, Dynamo and what's the value of Dynamo in this outlier region. It helps them to see if what is jumping during the outliers. After that, they can also have a closer look to outliers. So what are the inputs? What are the outputs like? What made my Dynamo run slower than normal? And it might be because of the input, it might be because of your bad coding practice, and it might be just because of Dynamo itself. So we made an experiment with Yan Cui, actually. So maybe two months before and it was DynamoDB Keep Alive Connection, and he was seeing some spikes in DynamoDB duration, even if he’s using keep-alive. And we see with Thundra that it's getting, it's actually renewing the connection even its keep-alive. So we called how DynamoDB actually renews the connection, and it causes an outlier. Maybe, if we trust DynamoDB keep-alive itself, it may not be sometimes — it may be causing problems. But detecting this and knowing about it, actually makes us prepared for such kinds of frustration.

Jeremy: Yeah, so that's really interesting about the keep-alive thing, because AWS doesn't put it in the AWS-SDK, but a lot of people use it because it does speed up subsequent HTTP calls. But that's interesting that it could potentially cause a problem if it needs to be reset. And I would assume that some of these outliers will be things like cold starts too, which I know most observability tools will say, “this is a cold start” or “this is a warm invocation.” Yeah, alright, so let's move on to talking about what exactly a “failure” is. You mentioned earlier about not getting an alert every time there's a timeout error or something like that, and I know failure conditions are sort of different based on different customer preferences. But maybe, we could just talk about when does an error or group of errors constitute a failure. And maybe a better way to put it would be, in the serverless world, what are the signals of failure that we can actually address. Like, when should we take action?

Emrah: Yeah, this is something that we are thinking [of], and we [have been] trying to find answers for a long while. You know, most of the tools that, both with CloudWatch and with the other monitoring tools, did the alerts are just for a single error. So you're just having a one Lambda invocation, then an error happens, and most of the time they're paging an alert. But we thought is this something that is actually wanted by people. Is this something that prevents people from alert fatigue? You know, we are coming from OpsGenie, and that's why we are very, very careful about not putting people into alert fatigue. So we ask people, “What is the definition of failure for you?” Like we asked tens of people,”What do you think? When do you think that this serverless architecture has failed?” And the response is that not a single error, like most of the time, it’s not an error. So when I call something an incident, when it causes something cascading failures. So I have a problem with Lambda function, and this Lambda function should should have triggered another Lambda function through SNS, and this triggers another Lambda function to, let's say, SQS. I'm just throwing out a scenario here. So this first Lambda function fails and the others couldn't even start. So in this case, we can understand that we are in very big trouble, that we lose some transaction there. That’s a failure for most of our people that we talk with and the other stuff is that for, at least, especially for upper management, the invocation duration, invocation count, any kind of abnormality about these metrics, are not very important. And they are seeing cost as a signal of failures. So let's say when they want to allocate $10 per day in to the serverless architecture and let's say, one dollar per a function. In such cases, they want to get alerted when the cost exceeds this threshold. They are not interested in if the function is running more than expected because of a third-party API. They're not interested in if the problem happens because of an input error. They just wanted to see if the cost is exceeding something, some threshold. Because all off their motivation was, when joining to Lambda, to save cost. And they don't want to read that with a problematic situation.

Jeremy: Right, yeah. I think cost is always a good indication even, you know, just to see if things were taking longer to run than expected. But I always like this idea. I mean, you mentioned tuning for the memory side of things, but I think tuning for the timeout is also important, because I think it's easy to say, “well, let's just set the timeout to 30 seconds and then that way if something's running slow, then, you know, we'll just absorb it in the 30 seconds,” and maybe that's right for certain situations. But I really like this idea of failing very fast, especially to make sure that the user experiences is good. And speaking about user experience, you mentioned alert fatigue. And I think that actually is really, really important because I know I've worked for several organizations that have had all different types of alerts for their applications and infrastructure. And there's always that one error that you just get every day, like 10 times a day, something random that’s completely benign because you just know, like, “Oh, yeah, every once in a while, this runs for a little bit longer,” or whatever. And maybe you go in and you tweak your alerting system to say, “don't show me these alerts,” but in my experience, most people, almost never do. So then you get to the point where you've got an email folder that's flooded with alerts and you just start ignoring all alerts and that gets very, very dangerous. So how do you fight against that? Is it something like anomaly detection? Like, what can the providers do? And this doesn't have to specifically be about Thundra, but what can you do, or the tools that you use, what can they do to prevent alert fatigue, because I think that is it is a serious problem in many organizations.

Emrah: Definitely. This is a problem that most of our customers are facing. And even if you are using Thundra or not, you just know that what type of errors that you're facing and when you face one of them, how many more is coming for you. So when you see an alert in a time, so when you see a specific error type, it means that after a while, like for a while, you will have thousands of them, tens of them, like hundreds of them, the kind what we call alert storm. In this case, so you should either actually configure [an] alerting mechanism that don't raise me an alert until the error type, the errors with this type exceeds, let's say, 100, 10, whatever about your case, or you can say that, after I create, after I got the first alert about this error type, I don't want to get alerted for, let's say two hours for, let's say, 10 minutes. In this case, you don't see like, lots of messages in your mailbox if you're using OpsGenie or some other tools that you don't need to page an alert for every single error and you can just get the first alert and understand the severity, and you can just go and focus on the problem and solve it. So in this case you can do it yourself, but if you are shipping your logs to, let’s say, some other stuff from CloudWatch, and you can do it with Thundra by writing your own query. Let's say that I like to get alerted, then there are more than 10 errors with this specific type, and after I get the first error, I want to throttle this alert. I don't want to get more than one alert for two hours. So this is again your SLAs, your rules, but you should be able to understand what's happening. And for now, what we do with Thundra, just as I said, we are giving people a flexible querying system, which lets them configure the others as flexibly as possible. I can tell. But we're also now working on some learning capabilities on how the problems occur in their system, and most probably we will be giving up this option to them by the end of year about, like, we will be actually understanding when to raise an alert for you. And this will be more flexible and even they want me to do this, but, you know, then they want to keep the control. We will continue to let them use the queries.

Jeremy: Yeah, and I think it’s hard to build tools that are self-learning like that because our architectures do change quite a bit, so I imagine keeping up with the changes wouldn’t be easy. All right, so what about failures outside, I don't want to say outside of our control, although they probably are outside of our control? But things like SQS errors and DynamoDB errors, like the inability to connect for some reason. Or I get SQS errors all the time that just say “Internal Error” or something like that when you try to put a message to SQS. And it happens every once a while and I know that there are retries in there that happened automatically for you. But when you get messages like that, I don't know if I want to be alerted on those because they just seem to be like normal distributed system errors that I really can't do anything about other than have a good retry mechanism in there. So what are your thoughts on those types of errors like, should we be alerting on those, or should we not?

Emrah: So this is something that is very controversial with our customers as well. So some of our customers wants to know in any case, like they want to understand what even if there's nothing that they can do. They want to report the situation today, to the directors, their team leads. And some of our people say that “Hey, there's nothing to do with this. Let's grab a beer. This is something SQS errors and retry will fix it anyway.” But in such situations what you need to do is this, like when you first received this error, and even if you get alerted or not, you should just check from a distributed tracing application like us and see if there's a problem, there's something abnormal from your side in order to verify that it is something caused by service providers, something caused by Dynamo, something caused by SQS. In this case, you just check the message that you sent, you just see if there's a problem into operation name, if there's a problem with the timing of these and you should see what happened in your code, maybe line-by-line, maybe in a method level. And how did you prepare this message for them? And then you are sure that there is nothing wrong with the message that you sent, you can then blame the others that, “Hey, this is not related with us” and you can just sit and wait until it is fixed actually.

Jeremy: All right, so I think this is a good segue , maybe to the next topic, so I mentioned actionable alerts. So we know bad things happen, right? We know that something can't communicate and APIs slow down and managed service might become unavailable for a period of time, or there's some bad code in your app somewhere, so we can detect some of that stuff, or you can detect some of that stuff with observability tools. You can alert users, but I kind of want to talk about how you respond to those things. Because first of all, there's this responsibility shift, maybe, right where we used to have operations teams that would address the scalability and the hardware issues. And, you know, and as we moved to sort of DevOps, that sort of translated into the developer and OPs working together to try to figure out, if it is a code problem or configuration, or things like that. So we're obviously not using as many operations people in serverless, especially for smaller customers. I don't think you're gonna have operations teams at all, startups and things like that that are using serverless are going to be almost entirely serverless or mostly serverless. So now you get a developer who gets this alert. So when a developer now gets an alert that is actionable, that they can actually do something about, what’s the first step? What does the response process look like?

Emrah: Yeah, as you say, like in previous systems, which like non-serverless system, so, like most of the time, it was a scalability issue, there were Ops people who [are] actually on top of that, and they were scaling systems and they were solving issues even without asking the developers. And they had these issues. Now with serverless, like, as you say, there are teams that don't even have any kind of an operations people inside. And even front-end engineers are building up very nice products with serverless these days. But there are still some problems which actually impact our systems, so which actually impacts our end-user experience. So the problem, most of the time, is actually the errors and the latencies. So we don't have the scalability issues now, but you have these latencies and errors. So the operation people actually tune themselves as the people who teach the colleagues about how to respond to such kinds of errors too. So in this case, like when you take an alert, and I also talked about this: what makes an alert actionable? So an actionable alert, which shows the stack trace of this alert, and this shows how actually it is cascading between functions. So we provide such kinds of alert for our customers when that happens. They can show what's the stake trace. They can see what happens. What was the request in response to this function? And in this case, they can actually understand if the problem is occuring because of them or because of the message that they just received. And is it because of that piece of code? Is it because of a misconfiguration? And they can understand from that perspective. And there are those alerts because of latency. So like what makes an alert, what makes a latency alert actionable? In this case, whenever you receive another, let's say that you set up an alert, which says that I want to get alerted when the dysfunction or all transactions as a change invocations of several functions exceeds one seconds. So alert me when this condition happens more than five times in the last 10 minutes. So let's say you have such kind of an alert. When you receive this alert, the first thing that you will actually want to see is that what caused the latency jump? So in the alert body itself, there should be some[thing] somewhere saying that “Hey, your function is performing worse than before, worse than normal and the cause is that you are reaching out to this API on like some URL and this URL started to respond slower. Like normally, it was like minutes 50 milliseconds. Now it is 500 milliseconds. In this case, you can say that, “Hey, this URL is performing slowly. Is there something that I can do?” This is something [where] you can take action actually.

Jeremy: Yeah. So what I'm thinking is that when you get alerts, and that’s one of the big things that observability tools certainly help with, is this idea of identifying what type of error it is. Because certainly, if it's things like IAM permission errors, which can happen, right? Somebody on your team changes an IAM permission and then all of a sudden one of your functions that doesn't often use the DeleteItem call in DynamoDB now needs to delete an item, and you start getting errors around that. Obviously, things like the timeout errors and the memory errors those are all things that certainly would kind of get you to start looking at stuff. But maybe that's where this, this ounce of prevention is worth a pound of cure saying comes in, right? So basically if you prepare for this sort of stuff, if you're in better shape for these types of failures, because you can sort of anticipate them or build in the resiliency to protect against these things, then you’re way ahead of the game. And I had a whole episode where we talked to Gunnar Grosch about chaos engineering, and he actually mentioned Thundra as being one of the tools that can sort of automatically inject latency and errors into your application to help plan for this kind of stuff. So I don't want to talk too much about chaos engineering, but I think it is a really, really interesting topic. So from the standpoint of mean time to repair and the ability for you to recover quickly if there's an outage, it might be that some of these things could almost be automated for you, in a sense, especially if you did things like graceful degradation, right? So if a third party API isn't responding, there’s better ways to deal with that. So what are some of the other ways that chaos engineering could help us improve our MTTR?

Emrah: Yeah, so, first of all, I listened to the episode with Gunnar and it was very, very nice and and you talked about the basics of chaos engineering and how it is important for serverless a lot. But I can tell about, like how it is actually useful for responding to incidents. So the best way to get prepared for an incident is actually to experience it before. But no one wants to experience something bad over and over again, right? And the nice thing that we can do with chaos engineering is that you can just get yourself prepared by actually simulating that these kind of problems. So you can ask yourself, what if this third-party API that I'm using starts to respond slower? What if the DynamoDB that I'm just leaning on completely starts not to respond. So you can you can run such kind of chaos engineering experiments, and in this case, you should be knowing what will happen. And you should be knowing that not just because of, not from the perspective of what to do, but how to inform the customers, how to inform the upper management, how to have the, let's say, the retro. You can understand how we can respond to these kinds of situations from many different perspectives. So let's say you can set up a HTTP latency chaos experiments and you can see how what are you going to do then when this kind of situation happens. So you can you can set up a graceful delegation policy there, and in this case you can still serve your customers, even if this third-party API is slow. Let's say that what you can just stimulate an idea with Thundra, you can just inject an error to DynamoDB, like illegal access exception, IAM permission exception, and you can't reach out to Dynamo. Maybe Dynamo can experience some throttles and maybe DynamDB — I don't know. I hope it never happens — but DynamoDB becomes unreachable for most of the people in the region. So you should have seen what will happen in your system. And you should actually implement some some exit points for that. So this is all that chaos engineering is about actually: getting prepared for the incidents. So you should maybe, for example, for a DynamoDB issue, you can put a circuit breaker. You can put like another version off your Lambda function with the circuit breakers to Dynamo and you can just upload this. You can just deploy this Lambda function and in this case, you can return some default responses to your customers. But in any case, you won't collapse. You will be able to still answer your customers with some dummy answers, but customers won't be able to see an error message and won't be able to see white screen. In this case, you'll be able to understand what's happening. And the last thing that I'd like to talk about chaos engineering is that you get yourself prepared about like how actually you will respond to the incidents as a team. So when you receive an alert from Thundra or whatever, this will be most [likely] the first time you receive this. And with chaos engineering, it won't be the first time, and you get prepared. You will be have a run book then when you have this error.

Jeremy: So that's actually a question that I have because again, we might be putting circuit breakers in, and gracefully degrading the experience for the user because the latency is too high for something. But when do we want to know about that? So this is something I spoke with Gunnar about, where we said, even if we can't process a credit card transaction through Stripe right now, that's fine. We just buffer those requests in SQS, and then process them later when the service comes back up. So if we do that, if we build in these circuit breakers and we build in these things where you know we have a short timeout on purpose. Maybe we say, if we can't reach Stripe within five seconds, then we're just gonna fail it and we'll buffer the event and we'll just replay it later. Should we still get alerts on these? And maybe these are the non-actionable ones like we talked about, where it's okay. You know, Stripe is down again, although, I think that doesn’t happen very often. But let's say it's some small third party API, and we say, “Oh, yeah, that's down again. It's fine. It'll come back up. We know everything is being buffered.” But can we get alerts on those sort of things to let us know an incident is happening? Just so you're aware of it, but you don't need to take any action?

Emrah: Yeah, this is actually a very [educational] for the team, so you should be knowing the resources that you are using are not performing well from time to time. And you should be knowing this specific statistically that what happens with third-party API? Like Stripe doesn't fail that much, of course, but like this third-party API, which is something not very manageable. Such kinds of statistics give you an idea that if you should still continue to use this, if you should still continue to make a request to these third-party APIs or you can start searching [for] some alternatives if you can. So one of our customers did this with with some APIs that they're using. Actually, because of us, they switched to another product because they see that the latency with this third-party API started to increase continuously, and this, they see that it's not getting any better. And they switched to some alternatives, some competitors of this third-party.

Jeremy: Awesome. Okay, well, listen, Emrah, I really appreciate you being here. Why don't you tell everybody how they can find out more about you and more about Thundra.

Emrah: Yeah, so thanks a lot for having me, first of all. You know I'm on Twitter. I'm with my name and surname @emrahsamdan, and my DMs are open. And I'm I'm talking with many people about serverless and observability. I actually love to speak with the serverless thought-leaders like you, and like many others. Thundra is reachable by Thundra.io and we're releasing continuously some blogs about, like how-to articles, some of our feature updates on blog.thundra.io. We also have an open demo environment in which you can see what Thundra does like using a sample data that we produce continuously. It's reachable by demo.thundra.io, and this is more or less what I can tell about Thundra. If you need something like distributed tracing combined with local tracing, Thundra is, for now, the only solution that you can find. We can just talk about it on Twitter, on our Slack if you join, whenever you want.

Jeremy: Awesome. All right. We'll get all that information to the show notes. Thanks again.

Emrah: Thanks. Thanks. Thanks a lot, Jeremy.

What is Serverless Chats?

Serverless Chats is a podcast that geeks out on everything serverless. Join Jeremy Daly and Rebecca Marshburn as they chat with a special guest each week.

Serverless Chats

Episode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

Episode #12: Reducing MTTR in Serverless Environments with Emrah ŞamdanEpisode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

More episodes

Episode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

Episode #12: Reducing MTTR in Serverless Environments with Emrah Şamdan

About Emrah Şamdan

Transcript

Chapters

Show Notes

About Emrah Şamdan

Transcript

What is Serverless Chats?