Watch this episode on YouTube: https://youtu.be/31LHFQ1lT78Transcript
Jeremy: Hi, everyone. I'm Jeremy Daly and this is Serverless Chats. Today, I'm chatting with Alex Casalboni. Hey, Alex, thanks for joining me.
Alex: Hi, Jeremy. Thanks for having me.
Jeremy: So, you are a senior developer advocate at AWS. So, why don't you tell the listeners a little bit about your background, and what you do as a senior developer advocate?
Alex: Sure. So, I come from the web development, software engineering, and also startup world. And I combine that and I try to help customers using AWS, discovering all the different services and I used to travel the world. Now I do a lot of virtual conferences.
Jeremy: So, as a DA, I know you've been working a lot with serverless. And one of the things that you publish a lot about, you got an open source project that we'll get into about this. But you do a lot with optimizing and tuning the performance of Lambda functions. And I think a lot of people sort of assume that all of this stuff is done for you maybe that it's just a matter of putting your code up there, and it just automatically does what you need it to do. Now, if you look at some of the surveys and look at some of the research data that shows that people just typically use the defaults, I think that people do assume that quite a bit. But there are ways to optimize your Lambda functions. So, what are some of the sort of main things that we have control over when it comes to optimizing Lambda functions?
Alex: Sure. So, there is a lot that the Lambda team, and actually the AWS team is doing to optimize the service itself for performance, to make it faster, to make it more reliable, to make it cheaper, eventually. But of course, it's a service and you can configure it. And with every configuration, you can fine-tune it for your specific use case, right? So, whether you are developing a RESTful API, or an asynchronous service for ETL. Whatever you're building, you might have very different needs in terms of performance or you may, you need to bring the cost to as minimal as possible. So, there are many things you can do. And maybe we'll talk about some of those. I got very passionate about this topic because I keep meeting a lot of developers that are just mind-blown when I tell them about the power-tuning side of things, and how they can actually get a lot more performance, and sometimes even a lot less cost. They can make their functions cheaper just by tuning the memory of their Lambda functions. So, that's what I'm really passionate about.
Jeremy: Yeah, so I mean, and if you think about building any type of application, I mean, obviously, there's a couple of major components to it. I mean, you have to think about latency. You need to think about throughput, especially if you're transferring large files back and forth from S3. And you have to think about cost, right? Cost is always sort of an important factor. So, what are some of the things, maybe we start there. What are some of those things? Do you have control over those three factors? Are there ways besides just the memory manipulation that you can really focus in on those?
Alex: Well, when it comes to the speed itself, the execution time itself, you actually do have control, and there are many things you can do to speed up the average execution. We can talk about cold starts or the large majority of your execution as well when it comes to Lambda. There is also a lot you can do to optimize for throughput. Usually, it's something you do at the architectural level. It's not just like a configuration parameter where you increase throughput. There are services like that. Like, I don't know, Kinesis Streams where you have some configuration that allows you to have more throughput on a megabyte per second, maybe.
But for Lambda itself, you don't really have a parameter to increase throughput. All that is managed by the service. What you can do is at the architectural level, maybe if we have a 10 gigabytes of data to analyze instead of doing it in series 10 megabytes at a time, you can do it in parallel. So, that's an architectural change, that pattern change, but it's not really a configuration of Lambda itself. It's more in the way you use all these services together, in my opinion.
When it comes to cost you also have control over that. There are many guardrails you can take to control the span, to control the parallelism of how many Lambda instances you can add. You can set it to zero if you want to stop everything. So, there are many ways you can control costs. What happens usually in the wild, that's what I see is that most developers I meet are more interested in optimizing for performance just because cost itself of Lambda is a relatively small percentage of their bill. So, they're not so much concerned about, "I want to make a Lambda as cheap as possible." Usually, they want to make the overall architecture cheaper. And they are happy to pay a little bit more for Lambda, if they can make it faster.
Jeremy: Right. Yeah. All right. So, let's talk about cold starts because this is the thing that always comes up. I think there's a lot of confusion around how often you get cold starts, how much of an impact they actually have, when they happen and things like that. So, let's start there. How much of an impact do cold starts actually have on your application?
Alex: It kind of depends on the type. I would say there's no typical answer, I know, I'm sorry. But it depends how often you involve your function. It depends how often you're scaling out to additional run times of your Lambda functions. So, depending on the use case it might be 1% or five or 10% of your executions. If your function is a daily cron job, it's probably 100% of your execution.
So, it really depends. In my experience, unless you really have like one execution per day or one execution per hour you can typically think of it is a relatively small percentage of your executions could be one or 5%. That means typically, it doesn't matter that much. And I understand why many developers are concerned about it because when you go and test it in the console, that's all you "ah, it's low." But at scale, and scale here means if we have more than 10, or 20 invocations per minute, or per second, or per hour it's not like every customer that is using your API is going to experience that.
So, you may want to work there really hard and to try to optimize it. And there are many ways to optimize for cold start times if your application is latency sensitive and customer facing. So, if you want to optimize the user experience. And there are many ways to do that, depending on the language, and the runtime, and the framework. Sometimes there is no way to avoid it. I can tell you the first time I used Lambda in my life it was a 2016, early 2016. And I was deploying a service based on the SciPy and NumPy. These are pretty heavy Python libraries, and the coaster was about two seconds. So, now it's a lot better already. But hey, there was no way to get rid of it. I really needed that library. There are many other cases where even just working on the libraries that you are using can have an impact on that. So, that's a long discussion.
Jeremy: Right. Yeah, no, but I think that you can boil it down to, like you said, if your application is latency sensitive, and I think the latency sensitive ones might be applications that are responding to API events or something like that, that are the synchronous things. I find that a vast majority of the functions that I build now are like asynchronous processing functions. And so, the cold start time, if that takes 300 milliseconds or something like that to start up, that that has a very, very minimal impact. But what are some of those optimizations that you can do just to maybe the code base itself to try to minimize... Or what are those optimizations you can do to minimize those cold starts?
Alex: Yeah, sure. So, one of the best suggestions I can give you. It's not like you do it overnight. But basically, it's about avoiding those monolithic functions that are 10, or 20 megabytes of code and library that they do 99% of what your application is doing. If you end up with a very large deployment package what happens under the hood is that we have to move a lot of bytes over the network. And that means you need more time to spin up your runtime, and then to execute your code. So, that's the simplest, most common answer I can give you.
There are many other examples of very fine-tuned optimizations like if you're using the AWS SDK, for JavaScript, for example, but only the Dynamo client, you can go and just import that single and initialize only that single client not the whole SDK. And that might give you, I don't know, 100 milliseconds back, which is impressive if you think about it. But there are many other techniques. Usually, you will have to either take some tricks in your code like lazy initialization of libraries or moving things around.
For example, I like to have many functions per file in my code. So I can have a full picture of maybe a domain of my application. And very often you end up with a lot of initialization code up there, which you don't always need in every single function. So you're just maximizing the cold start of all this function in that file. So there are ways to optimize it. Either you use one function per file, or you do some lazy initialization for the most heavy objects or libraries that you need to use.
Jeremy: Right. So, I want to get into those in detail. But before we get into that I want to kind of talk about some other optimizations. And maybe they're not optimizations, but they're sort of services and service improvements that were made that help with cold start. So one, we know, especially where you saw a lot of cold start problems or I think a high latency for cold starts was when it came to VPC. That has been, I mean, dramatically reduced, which is great. But then the other thing that came up was provisioned concurrency. So, how does that fit in to optimizing cold starts? And also how does it have an effect on the scaling impact of your Lambda functions?
Alex: Absolutely. So yeah, the VPC impact now is basically zero compared to a year ago. And that's been like that for the last 9 or 10 months since the last reinvent. Since November last year, more or less. Provisional concurrency, that's also been there for, wow, almost nine, 10 months as well. And that's basically a managed solution to the cost of our problem. What developers were doing before is that they were maybe using some cloud watch events to keep their functions warm. But let's say that, that only works at a very small scale. You cannot pre-provision 200 Lambda runtimes if you know that you're going to have a peak in half an hour, and you want to add 200 warm environments.
So, that's exactly what provision concurrency allows you to do. So, typical use case is, I know that, I don't know, my streaming platform is going to handle a live soccer match or whatever live events, and you know you're going to have a peak on your website, which results into a peak on your backend. So, there's a lot of things you can do there. You can do caching. You can try to reduce the load. But at some point, they will hit your backend. And you can try to estimate that, and you can try to ramp up enough warm runtimes so that those customers, or those users, or those consumers are not going to experience a cold start.
The way you do it is you put a number in your Lambda console, or in your Cloudformation template, or Terraform template or Serverless framework template. It takes some time. So, if you need 10,000, it might take a few minutes. I think the ramp-up time is still around 500 runtimes per minute. So, for most use cases, if you know you're going to have a peak well in advance at least a few minutes or hours or days in advance you can plan for it, and ramp up to that provision concurrency. And then you can keep it up there as long as you need it. You can also auto-scale it. So, it integrates with AWS auto scaling. So you don't really have to over-provision. One of the things that some developers told me is that, "Wait, do I get throttled above that concurrency?" No, you never get throttled. You are just going to pay for the regular Lambda implications above that provision threshold.
Jeremy: Yeah. And there's certainly a balance of cost and performance there. Because there is a cost to it. Even if you're just running one function warm... I think one concurrent location assuming you have a function for the whole month it's like $14 or something like that in USD. But for the ramp-up stuff. So, I do think if you're just worried about a cold start every now and again that is something that I would be... I would not worry about too much. But if you do have to really ramp up thousands and thousands of warm instances so that you can handle that peak, then that's definitely something to look at. But anyways, all right.
So, let's go back to just this idea of sort of general tuning of the Lambda function. So, another thing that I think I hear is a complaint sometimes is that Lambda doesn't have a lot of knobs, right? There's not a lot of things you can turn on there. I mean, you have very few things. Really, the main one is just the amount of memory that you allocate to a Lambda function. So what is the relationship though? Because I think people don't always understand this. What's the relationship between the memory setting and CPU and network throughput?
Alex: That's a good question. And that's the, I would say, the most counterintuitive side of serverless. And I keep meeting people that as developers or as operation people will just go and say, "Hey, we have a lot of Lambda functions that we are maybe over-provisioning for memory. Because we have this one gigabyte default configuration, maybe." And then they look at the logs, and the function is only using maybe 50 megabytes of RAM, and the like. We don't need all that RAM. But in reality, what you are provisioning is power. That's why I always think of it as power. Don't think of it as memory. If you allocate more memory, you obviously get more CPU, more IO throughput, you get more power. You get a chance to run faster in simple words.
So, you can go from 128 megabytes up to three gigabytes today, which is usually pretty good range. And one thing that many people do not know is that you actually get 64 different values you can choose from. It's not just 128, 256, 512. You can actually go all the way every 64 megabytes. So they give you a lot of choice. And even going one gig by default there are two potential problems. One is you might be actually over-provisioning if you don't need either speed or you might be saving cost if your function is asynchronous, and you don't care about speed. The second problem is that you might actually go even faster if you allocate more power, depending on what type of workload you are implementing.
Jeremy: Right. And so the biggest problem, I think, with setting that knob, setting that value is obviously understanding what impact that has because like you said, "If you turn the power up, it could actually run faster. That execution time could be faster, and it would actually be cheaper. Or it would just get it done faster and give you more performance." And we can certainly get into cost versus performance. But measuring that is really difficult. You mentioned logs, but that again is what am I going to go in there and go up each 64 megabyte and then try it again and try it again. And then is that an accurate sample, and so forth. So you built the Lambda Power Tuning project, which is an open source project on GitHub that helps you measure this stuff and run these experiments. So, can you tell us what this project is and how it works?
Alex: Sure. So, it is open source project. You can find it on GitHub. Maybe we'll share the link later.
Jeremy: Absolutely.
Alex: And it allows you to deploy into your account a Step Function, a state machine that will take your function as input and run it enough times to get enough data looking at the logs, looking at the implications. And then as the output, it will give you the optimal memory configuration for your specific function with that specific payload. So, the idea was that you could actually automate this process of tuning your functions eventually integrated in some kind of CICD environment. And what I really wanted to solve in 2017, I was an AWS customer. I was really just spending my day back then it was March 2017 just tuning and testing, and tuning and testing. And of course, it's a very manual process. There are 64 different values you can test. And you might actually have very different payloads in production that will have caused some performance implications in production. So, you just don't want to do it manual.
Jeremy: And the tool itself has evolved, right? I mean, you've mentioned it's open source. I know you've gotten a lot of contributions to it. And so, it has a lot of great features, right? I mean, obviously, you're able to send in the ARN of the function, right? It's pretty simple to set up. I mean, you put in the ARN of the function, which values you want to test. Which power values you want to test, the number of samples you want to run for each one because I think that's another important factor where you can't just test one sample. You got to run it 10, 15, 20, 100 times or whatever at each configuration level to understand to get a good average and see... Because especially if you're connecting to third party components like those could vary greatly.
So, being able to see that. Like you said, the payload was great, the parallelization factor is pretty cool. So you don't have to wait and run these serially and takes you hours to run it. You can do it within minutes. But you have some other things in here that I'd love to talk about because I think these are more advanced ways to think about it. And one of them is the strategy input that lets you choose between speed, is it speed and cost? What are the options for that strategy input?
Alex: So, typically, you want either your function to go as fast as possible, so that speed, or you want your function to be as cheap as possible, so that cost. But in reality, what I figured is that you want somewhere in between. You don't always want the cheapest or the fastest because the fastest might be the most expensive, and the cheapest might be the slowest. So, you probably want somewhere in between. Like the sweet spot between these two dimensions. And that's why you also have a third option, which is balanced. And if you choose balanced, you can actually decide what's more important for you. Is it speed? So, you can provide a weight, which is like zero, so you really care about speed. And that's the same as selecting speed as your strategy or you can give it a value of one. By default, this value is 0.5. So, you're giving equal importance to speed and cost.
So, I really recommend the balanced option. It's a formula deviced by an Italian mathematician that I know personally. We just talked about it for a whole day, and we came up with a formula that makes sense. But ultimately, what I really recommend is to run the tool, look at the numbers, and look at the visualization, actually. And it's very, very easy for our brain to find a sweet spot without actually looking at the precise numbers. The formula is useful. The balanced optimization strategy is useful if you want to automate the whole thing,
Jeremy: The auto optimize function, what does that do for you?
Alex: So, that was intended for use cases where you wanted to run the state machine, and leave the function in a state where it was already optimized. So, at the end of the execution it configures the power to the optimal value for you. Initially, I thought of it as a way to automate it in your CICD, maybe. In reality, you probably want to take back that optimal value, put it into your cloud formation or Terraform template, and then redeploy it, right? So the auto optimize option, I wouldn't use it in production. It might be useful in our development environment where you really want to automate it and basically auto optimize at every deployment. But really, it was more of an experiment back then. If you find a very useful way to use it, let me know. I think there are more advanced CICD strategies there.
Jeremy: Right. So, what about the output, though? So, you mentioned the visualization, but what do we actually get out of the state machine when it's done running?
Alex: Right. So, by default, you get a few parameters. You get what's the optimal power value. What's the average cost and the average execution time for that optimal power value. So, that's what you need to... First of all, it's in a JSON format. So, if you really want to automate it, you can just parse it and do something else with it. And that's really just about the optimal configuration. But then I also decided to provide some metadata. So, there is a constant, I wouldn't call it problem, maybe challenge, with AWS services, which is, as a developer, I say that I invoke a service, and I don't know how much I pay for it.
Alex: There are a few services like Amazon Polly that would tell you, "Hey, I've computed 20 characters for this text to speech operation." And so, you have a sense of how much you might have spent for that single API call. So what I decided to do is to end that in the output, the cost of that state machine execution, force that function, and for Lambda. So, you also find that in case you want to keep track of it or log it somewhere or decide it's too much. Usually, it's not too much. Usually, it's less than one cent, one 10th of a cent of a $1, a lot of zeros. So, I consider it free for most use cases. And the last thing you find in the output is a URL. This is a URL that you don't really have to use. But if you decide to open and visit that URL, you will find the same data for all the power values that you have tested, visualized in a chart that is interactive, and you can play with.
Jeremy: Yeah, and I think that, that is the thing that... Again, if you look at a bunch of numbers on a spreadsheet or in a JSON structure or document it's very hard to say, "Yeah, that's better than the other one," and kind of do that. So, the visualization piece, and that's something I think that goes to show the whole power of open source, right? That was something... You didn't build that. That was brought by the community?
Alex: Yeah, that was the idea. My initial idea was that you would just pick and trust me, and you'd just pick the output of my state machine and trust it. What actually I realized is that you want to find the optimal sweet spot visually. It's really impossible to do it with raw numbers on a table or in any other format. So, what happened is, I got actually a university student, the same mathematician I was talking about. Hi, Matteo, if you're watching. So, he decided to help me build this. And actually the visualization is a static website that you also find on GitHub. There is no database. It's not a service or a platform. It's as simple as a client side, static website that is just visualizing some data.
So, when you click or when you visit that URL, what happens is that all the data about your power values, the cost for each power value, and the execution time for each power values is serialized into the hash part of the URL. Meaning, it's not actually sent to the server. It's all available on the client side, so that the JavaScript code in the browser can visualize it for you. And it's completely anonymous and completely GDPR compliant, and there is no data privacy concern there. But if you still do have privacy concerns, if you still want to maybe customize it and build a better one, you can actually provide a custom data visualization URL that you built yourself, and you can provide it at deploy time. So feel free to build your own.
Jeremy: So, just with this topic of open source. So, how did you find working with open source? Because I know I do a lot of open source projects, including one that keeps Lambda functions warm, but I don't really use it that much anymore. But did you learn anything because I just I'm always fascinated by people working in open source because it is such a... It's sometimes thankless. It's also sometimes really rewarding.
Alex: I think it is. I totally agree with you. If you're not used to it, you could find it frustrating. I'm a developer. I like to sit down and code and develop new features. And sometimes you just have to open an issue there and wait for people to provide their opinion, which is what open source is all about. You want to get opinions, ideas, improvements. So, I actually learned a lot via that process. And now I don't just sit down and implement something I really want to implement. I sit down, maybe I'll be surprised in two weeks, there will be an open source contribution that will implement it. And so, in the process, you meet people, you build relationships, you build trust, and you might learn something you wouldn't learn otherwise.
So, I'm really happy about the learnings I've done myself. And some of the best features like the visualization and others have come from other people's idea. So, I could never have done that by myself. Last but not least, sometimes you figure that people are using it in very creative ways. I've met a person, a developer that was using Lambda Power Tuning as a kind of stress testing tool. They would use a very large number of execution just to see what happens to the downstream services. Like are we ready for production? So you do two tasks in one basically.
Jeremy: Interesting. Yeah, so speaking about learning, right? Because this is another thing. I think people are going to start asking this question. And we haven't gotten into too much detail about tweaking knobs and that kind of stuff yet. But this is a lot of work, right? I mean, even if you... Let's say you have 10 functions, and you have to run this tool on 10 functions. That's probably not that big of a deal. You could handle that. But what if you have 1000 functions, and you have to run this and get the optimization.
And so, I think one thing that I've noticed from this tool is you sort of quickly get to understand the type of workloads that benefit from different types of optimizations. What benefits for more power versus less power? And I know you have some examples, and I'd love it, I think you have a GitHub repository that you can actually run these examples and see the different ways to do it. And I'll get that into the show notes. But I'd like to talk about this a little bit. So what type of performance benefit do you see when you're at 120 megabytes versus three gigs if you're just running a standard no op type service? So, you're just doing something quick. Nothing CPU intensive, nothing network intensive, what type of performance difference are you going to see when you turn that power all the way up?
Alex: Well, so for the no op kind of function meaning no API goals, no long running compute task or anything, what typically happens is that you cannot make this kind of function cheaper because you're probably running in five or 20, or 30 milliseconds, right? And that's as cheap as you can get. So, if you're given more power, what happens is that the cost is going to go up. But what might happen, and you may not expect it is that the execution time might actually go down. So, if it's running in five milliseconds, okay, it might not be worth it because what better you can do? Maybe two milliseconds.
There are cases though where you can run for 20 or 30. And maybe you are part of a microservices orchestration tool whatsoever. And if you can shave 10 or 15 milliseconds off each step of your workflow it might be actually noticeable for the end customer. So, even for these kinds of functions, if you can shave off a few milliseconds. Usually, if you go from 128 to 256, that's usually the most you can do in my experience. There might be edge cases. You might see 5, 10, 15, millisecond execution time gain. So, that's sometimes very interesting. Sometimes it's not at all, and you do not want to pay 10% more to gain five milliseconds, I get it. I think that the important point is, "Hey, you have the data. You can take an informed decision about it." I think that's the most important achievement of the tool.
Jeremy: All right. So, what about tasks that are heavy CPU like the NumPy and those sort of things that need to use a lot of the CPU? What do you see memory settings? How does that affect that?
Alex: Well, CPU, functions are in my opinion the most interesting for memory and power optimization because the CPU effect on cost and performance is the most noticeable. So you might have something like a function that runs in 10 seconds, or 20 seconds, or 30 seconds, even more. And it's very likely, unless you're just invoking APIs and waiting. If you're just crunching data or transcoding videos or doing something CPU intensive, it's quite likely that with more power maybe you go from 10 seconds to five seconds. And so, in a way, it's all proportional. So, if you double the power, and you have the execution time the cost is flat. The cost is constant, basically, right? So, why shouldn't you give in more power? You have the same cost, twice better performance just by tuning one parameter. And I think this is the kind of workload that can benefit the most.
I've seen use cases where you run for 20 or 30 milliseconds with maybe one and a half gigabytes, or even two or even three gigabytes you can run in less than a second. So, we are talking 20, 25X performance improvement, right? And usually, the cost is flat meaning that you're either paying the same. You might be paying 10, 15% more. In some cases, you're paying 10, 15% less. And that's like a no-brainer situation, right? You can be 10 times faster and pay 15% less. There's no reason why you shouldn't do it.
Jeremy: Right. Another use case are ones that are network intensive. Because I've seen this quite a bit. I think I wrote an article about this was that when you're just making a network call, especially if you have to wait because you're waiting on the network, that very, very low memory settings really do not affect how fast any of that stuff works. Is that what you've seen in your experience too?
Alex: Yeah, that's the case, especially if you're talking about third party API external to AWS, so that your power cannot in any way affect their performance, right? There are cases where you are transmitting a lot of data. So you might see some kind of improvement with slightly more power. But usually, the cheapest option would be 128 megabytes. I've seen cases where you want to go 256, or 512 just to gain maybe 20% performance. But you need to be ready to pay for something more. Usually, you cannot really make these kind of functions cheaper, which is to be expected. It's their performance not your performance. You're calling someone else. And unless you can control that other API and optimize it too, you can't do much about it.
It's quite different though, if we're talking about internal APIs because you're not leaving a data center. You might not even... There are cases where you invoke a third party API, and you are not leaving the data center anyway because you're invoking, I don't know, a third party API that is also hosted on AWS, but not always, right? If you're invoking DynamoDB though or if you're invoking some other internal API it's pretty common I see that you invoke Dynamo a couple times during a Lambda execution. And you can actually get a lot better performance, usually, for the same cost. Maybe I've seen an example where you were running three DynamoDB queries in series, so there is no way you can paralyze it. You have to run them in series.
So, I think that total was about 250 milliseconds, something like that. And so if you double the power a couple times, you start seeing that the execution time is going down one level, and then down another level to about 90 milliseconds. So that means you can basically get the same cost for two or three X the performance. So if that's what you're doing, invoking internal AWS APIs, it's also worth power tuning. I encourage you to have a look, at least. You might not change everything, but it's common that you can get a considerable performance gain for same cost or very, slightly more cost.
Jeremy: Right. Yeah, no, and I think that part of it too is and if we go back to the sort of the having to run this on every function, and you maybe make a change to a function, you do something like that, it just becomes very, very tedious, and probably a lot of work to run this on every single function. But as you start to see... You run it on a few functions, maybe different types of workloads, those patterns start to emerge, right?
Alex: Absolutely. There is not an infinite set of patterns.
This episode is sponsored by New Relic and Amazon Web Services.