Show Notes:
On event-driven architectures...
Mike Deck: (
Episode #5) I think that it's probably easiest to understand it when contrasted against kind of a command-driven architecture, which I think is what we're mostly sort of used to. So this idea that I've got some set of APIs that I go out and call and I kind of issue commands there, right? So I maybe have an order service. I'm calling create order or I've got downstream from that. There's some invoicing service now, and so the order service goes out and calls that and says, "Create the invoice, please." So that's kind of the standard command-oriented model that you typically see with API-driven architectures. An event-driven architecture is kind of, instead of creating specific, directed commands, you're simply publishing these events that talk about facts that have happened, you know these are signals that state has changed within the application. So the order service may publish an event that says, "hey, an order was created." And now it's up to the other downstream services to, they can observe that event and then do the piece of the process that they're responsible for at that point. So it's kind of a subtle difference, but it's really powerful once you really start kind of taking this further down the road in terms of the ability to decouple your services from one another, right? So when you've got a lot of services that need to interact with a number of other ones, you end up kind of with a lot of knowledge about all of those downstream services getting consolidated into each one of your other kind of microservices, and that leads to more coupling; it makes it more brittle. There's more friction as you're trying to change those things, so that's a huge kind of benefit that you get from moving to this event-driven kind of architecture. And then in terms of kind of the relationship to serverless, obviously with services like AWS Lambda, you know, that is a fundamentally event-driven service. It's about being able to run code in response to events. So when you move to more of this model of hey, I'm just going to kind of publish information about what happened, then it's super easy to now add on additional kind of custom business logic with Lambda functions that can subscribe to those various different events and kind of provide you with this ability to build serverless applications really easily.
On understanding the connectivity of microservices...Ran Ribenzaft: (
Episode #8) We broke them from being a big monolith, a big single monolith, to multiples of microservices, you can call it microservices, service, nanoservices, but the fact that there was one giant thing that broke into 10 or hundreds of resources, suddenly presents a different problem. A problem where you need to understand what is the interconnectivity between these resources, that you need to keep track of messages that [are] going from one service to another, and once something bad happened, you want to see the root cause analysis. This is like a repetitive thing that you can hear over and over. This root cause analysis, so the ability to jump from the error - the error can be like a performance issue or like exception in the code - all the way to the beginning. The beginning can be the user that clicks on a button on your business website that caused this chain of events. So these are the kinds of things that you want to see where, in traditional APMs, in traditional monitoring solutions, you don't have it. And in the future, once you'll find it more and more like that.
On monitoring interconnectivity...Emrah Şamdan: (
Episode #12) In serverless, on the other hand, it is like you have different piles of logs, which it comes out of box from CloudWatch, from the resource that Cloud vendor propose. But these are actually separate, and these are not actually giving the full picture of what happened in the distributed serverless environment. And what you need here is that the problems are different. In a normal environment, the problem, most of the time, was actually about scalability and you were responding to that by giving more resources, by just increasing the power of your system. But with serverless, the problem is about like some problem occurs in any kind of a system in a distributed network and you need some more than log files. You need like all three pillars of observability, which is called traces. In our case, it is distributed traces, which shows the interaction between Lambda functions and the managed APIs and the managed resources and third-party APIs, and the local traces, which shows what happens in the Lambda function, and the metrics and the logs.
On the purpose of AWS X-Ray...Nitzan Shapira: (
Episode #2) You can do it to some extent. X-Ray will integrate pretty well with the AWS APIs inside the Lambda function, for example, and will tell you what kind of API calls you did. It's mostly for performance measurements, so you can understand how much time the DynamoDB putItem operation took or something of that sort. However, it doesn't try to go into the application layer and the data layer. So if information is passed from one function to another via an SNS message queue and then going into an S3, triggering another function - all this data layer is something that X-Ray doesn't look at because it's meant to measure performance. That's why it would not be able to connect asynchronous events going through multiple functions. Because again, this is not the tool's purpose. The purpose is to, again, measure performance and improve the performance of certain specific Lambda functions that you wanna optimize, for example.
On instrumentation...Ran Ribenzaft: (
Episode #8) Instrumentation is the way or a technique which allows a developer to, let's call it hijack, or add something to every request that he wants to instrument. For example, if I'm making a calls using Axios to a REST API for my own code to an external or third-party API. I want to be able to capture each and every request and response that is coming in and out from that resource, from that Axios request. Why would I like to do that? Because I want to capture vital information that I'll be able to ask questions about later on. For example, if my Axios is calling Stripe to make a purchase or to send an invoice to my customer, I wouldn't know how long it takes, because I don't want my customer to wait on this purchase page or wait for his invoice to get into his email. I want to make sure of how long it takes so I can measure that, put that as a metric in CloudWatch metrics or in any other service. And then I'll be able to ask, "Well, was there any operation against Stripe that took more than 100 milliseconds?" If so, it's bad, and this is only accomplished using instrumentation. I mean, the other way around is just to wrap my own codes every time that I'm calling Stripe or every time that I'm calling any other service. But with the amount of annotations that you'll have to add to your code, it's almost unlimited, so you won't get out with it without a proper instrumentation in your code.
On the problems with manual instrumentation...Nitzan Shapira: (
Episode #2) It's not just the fact that you can forget. It's also just going to take you a certain amount of time - always - that you're going to basically waste instead of writing your own business software. Even if you do remember to do it every time, it's still going to take you some time. Some ways that can work [are] in embedded in your standard libraries that you work with. If you have a library that is commonly used to communicate between services, you want to embed that tracing information or extra information there, so it will always be there. This will kind of automate a lot of the work for you. That's just a matter of what type of tool do you use. If you use X-Ray you're still going have to do some kind of manual work. And it's fine, at first. The problem is when you suddenly grow from 100 functions to 1000 functions — that's where you're going to be probably a little bit annoyed or even lost, because it's going to be just a lot of work and doesn't seem like something that really scales. Anything manual doesn't really scale. This is why you use serverless, because you don't want to scale service manually.
On adding distributed tracing early...Sheen Brisals: (
Episode #20) So we do sort of a structure logging that kind of evolved from a simple log messages. So we have sort of a decent level of logging, you know, if you look at the logs, we are now able to trace things through. Then at one point we started, we have a monitoring system in place, so we kind of stream the logs to ElasticSearch as well as to the monitoring system, so that the structure logging, with ElasticSearch, we are able to, you know, go through and try to identify any issues, and engineers work with that. But one area that we didn't focus or we didn’t put in place was the distributed tracing side of things. So that's why I think I once stated that if you're starting your serverless journey, please you know, start with the distributed tracing and manage. I mean, you can start with XRay, or bring in a third-party tool. So that's a really cool thing that gives lots of confidence to the team.
On the best way to prepare for an incident...
Emrah Şamdan: (
Episode #12) So the best way to get prepared for an incident is actually to experience it before. But no one wants to experience something bad over and over again, right? And the nice thing that we can do with chaos engineering is that you can just get yourself prepared by actually simulating that these kind of problems. So you can ask yourself, what if this third-party API that I'm using starts to respond slower? What if the DynamoDB that I'm just leaning on completely starts not to respond. So you can you can run such kind of chaos engineering experiments, and in this case, you should be knowing what will happen. And you should be knowing that not just because of, not from the perspective of what to do, but how to inform the customers, how to inform the upper management, how to have the, let's say, the retro. You can understand how we can respond to these kinds of situations from many different perspectives.
On chaos engineering...
Gunnar Grosch: (
Episode #9) Well, the background is that we know that sooner or later, almost all complex systems will fail. So it's not a question about if it's rather a question about when. So we need to build more resilient systems and to do that, we need to have experience in failure. So chaos engineering is about creating experiments that are designed to reveal the weakness in a system. So what we actually do is that we inject failure intentionally in a controlled fashion to be able to gain confidence so that we get confidence in that our systems can deal with these failures. So chaos engineering is not about breaking things. I think that's really important. We do break things, but that's not the goal. The goal is to build a more resilient system.
On the difference between resiliency and reliability...
Gunnar Grosch: (
Episode #9) Resiliency isn't only about having systems that don't fail at all. We know that failure happens, so we need to have a way of maintaining an acceptable level of operations or service. So when things fail that the service is good enough for the end users or the customers.
On the purpose of these experiments...Gunnar Grosch: (
Episode #9) So we do the experiments to be able to find out how both the system behaves and also how the organization, their operations teams, for example, how they behave when failures occur.
On the differences when testing serverless applications...Slobodan Stojanović: (
Episode #10) There are a few different things but, in general, testing is still the same. You want to check if your application works and the way that you want it to work. But some of the things are not your responsibility anymore. For example, infrastructure is, like, the responsibility of your vendor, such as AWS or Microsoft or someone else. So there's no point in really testing that part because that's not really something that, they have their own testing, things like that. But you still need to be sure that your business logic is working in a way that it works. And also all serverless applications are basically microservices, that they're working together. Most of the time, you don't have one monolithic application that is just uploaded to AWS Lambda or something like that. Most of the time, you have, like many different functions. For example, in Vacation Tracker we have, I think more than 80 functions now that they're working together. So it's really important to be sure that all those small services are working together the way we want them to work together, and that our end users have a decent experience and that they can use our application.
On how software development has changed...Chase Douglas: (
Episode #2) Yeah. So the way that we've always developed software up until very recently was it would, in the end, be running on servers, whether it's in a data center or in the cloud. But these servers were monolithic, compute resource. That meant that typical architectures might be a LAMP style stack. You've got a Linux server, and you've got a MySQL database off to the side somewhere, maybe on the same machine, maybe on a different machine. But mostly as a developer, you're focused on that one server, and that means that you can run that same application on your laptop. So were we become very comfortable. We built up tooling around the idea of being able to run an entire application on our laptop, on our desktop in the past, that faithfully replicated what happens when that gets shipped into production in a data center or in the cloud. With serverless, everything is kind of a little works differently. You don't have a monolithic architecture with a single server somewhere or a cluster of servers, all running the same application code. You start to break everything down into architectural components. So you have an API proxy layer. You have a compute layer that oftentimes is made up of Lambda, though it can include other things like AWS Fargate, which is a docker-based, serverless, in the sense that you don't manage the underlying servers approach. So you've got some compute resource, if you need to do queuing instead of spinning up your own cluster of Kafka machines, you might take something off the shelf, whether it's SQS from AWS or their own Kafka service or Kinesis streams. There's a whole host of services that are available to be used off the shelf. And so your style of building applications is around how to piece those pieces together rather than figuring out how to put those and merge those all into a single monolithic application.
On the tools for serverless...Efi Merdler-Kravitz: (
Episode #13) There are a lot of tools today that enable you to package, to upload, to deploy your code. You have tools today that help you to monitor, and debug. Use them. Don't write something on your own. Don't waste your time on it. And I think one of the first things that you need to learn is to learn tools like AWS CloudFormation or Terraform. These are the tools that enable you in the end, that these are the basic tools, these are the building blocks that enable any serverless packaging technology to deploy your code, to deploy your various sources. So no matter what serverless framework you choose, either the Serverless framework, or Chalice, in the end, behind the scenes, everyone is using either CloudFormation or Terraform. So I think it's very important to learn the best building blocks, and I think you need to learn how to automate your tools, automate your testing. So use good testing libraries like Pytest or Jest and there are many others that are very good. And also use serverless plugins to test some of your flows locally, like DynamoDB or API Gateway.
On testing locally...Efi Merdler-Kravitz: (
Episode #13) I think it's a painful point right now in serverless, in serverless testing. And I think the only thing that I can say right now is that testing locally just as you said won't give you the quality that you're expecting. In the end, local testing will give you a certain amount of validation on your code. But I think that the best way to increase your testing velocity is to give your developers that build it to run their code easily and fast in the cloud. That's the only way to actually test and make sure that the code that you wrote is working.
On the challenges of testing locally...Chase Douglas: (
Episode #2) You start with some code, and that code for a Lambda function has this handler that gets invoked. One of the things that I did early on when I was starting to play with this to try and speed up this this iteration workflow is "well, I could write a little wrapper script that invokes that handler code function with some test data just to get it running locally without having to deploy it all out." And, there came along some tools that kind of helped facilitate this mechanism. AWS Sam, their tooling has SAM local invoke where it will take your function code, and it will actually spin it into a docker container and run it as though it's in a proper lambda environment. Ah, the Serverless Framework has a similar thing. But even there you have a challenge where the permissions that your function has is based on the permissions that you have locally on your laptop. Now, a lot of developers, they have permissions on their laptop, but they have sort of administrator permissions. They can, if they wanted to interact with any resources inside of their AWS account. Whereas the function that you're building it's tied to a very specific set of permissions where you don't normally give it full administrator access. So you have to sort of a lot of times you get your code working. And then as a second step, you have to figure out is the code still working when I deploy it to the cloud and I've got the permission set the right way. And then lastly, you've got the challenge of that service discovery piece where if I'm running on my laptop, how does my function know which DynamoDB table it should be interacting with, which SQS queue it should be sending messages to. So you've got to solve these problems through some mechanism, and a lot of people come up with their own little test scripts on the side that help here and there. But there's a real challenge there around having a workflow that a whole team within an organization can uniformly use and provides them with that sense that they're bringing the cloud to their laptop locally.
On serverless security...Hillel Solow: (
Episode #11) One of the things that you see in the move to serverless - and again I don’t think it's a serverless-only thing, but I think it's in serverless more than anywhere else thing - is that sort of divide between security people and developers. It's not really tenable and, you know, in serverless, a lot of the security controls that security used to own are now security controls that developers control. Like configuring IAM roles and setting up VPCs and things like that. And so in a lot of ways, we've actually put more responsibility on developers, but we haven't necessarily empowered them in real ways to make security decisions, and at the same time, we haven't given security people a way to meaningfully understand and audit some of those things when they don't necessarily understand what the application does or what the code wants to do. So I think that's been a big change, and I think that's true across a lot of cloud applications. But it's just truer in serverless applications. You're kind of forced to reconcile that. The other thing about serverless applications specifically that we like to talk about is the fact that developers have gone from an application that comprises 10 containers to one that comprises 150 functions, you know, could create all sorts of nightmares in testing and monitoring and, you know, deployments and things like that. But for security it’s an interesting win there where you get to apply security policy, IAM roles, runtime protection, at a very fine-grain level — you know, at kind of a zero trust, small perimeter level. And that's if you can do it right, if you can do it at scale and automatically, that could potentially be a huge win, really for, you know, mainly least privilege, reducing attack surface and reducing blast radius. You know, something goes wrong; my developer left a back door accidentally into a function. But now that function really can only do right to one particular table, as supposed to, you know, in the old world, where that gave an attacker a lot more capability. So I think that's an opportunity that is on the table. It is challenging to capitalize on that. Like you said, there's less time. There's less gates between developers and running production code. And that means that, you know, how do we automate and capitalize on a lot of that value without trying to slow everybody down? That's the big challenge.
On building fat functions...Brian Leroux: (
Episode #17) I think it's totally appropriate to build out your first versions with just a few fat functions. But as time goes on, you're gonna want that single responsibility principle and the isolation that it brings. There's one last small interesting advantage to this technique is that the security posture is just better. You have less blast radius. If your functions are locked down to their least privilege and their single responsibility, you're just going to have a way better risk profile for security.
On developers understanding cloud costs...Erik Peterson: (
Episode #6) If you're a SaaS vendor, you know your value delivery chains is built on top of cloud. That's your cost of goods. That's your gross margin. You need to understand that if you're going to deliver a profitable product to the market and and you want that conversation to be part of your entire organization because, I mean, the reality is is that the buying decision is being made by your engineering team now, right? They choose: am I going to use this type of instance or that type of instance? Am I going to implement this kind of code or that kind of code? They make a buying decision every moment of every day. Essentially, every time, every line of code that they write, they're making a buying decision, and and so you have to think about that. And then it gets even more complicated, though, because there are so many intertwined, and particularly in the serverless world, which is so, I think, honestly I'm sure our listeners here will appreciate our point of view, is that we think, we believe serverless is the future of all computing. But, you know, it's even more powerful because you create these very interesting applications that are composed of lots of different services. It's not just Lambda compute. It's I have Lambda connected to SNS passing to SQS, DynamoDB, Kinesis — all these things flowing together. And I'm not just going to the cheat sheet on Amazon and saying, "well, how much does it cost for one hour of compute?" to try to estimate my costs. No, I now have to think through that whole story, and I think it's kind of a shame that, actually, for most organizations, they consider the state of the art there to be well, let's just try it and see what happens. And a lot of times they try it and test and they go, "Oh, looks like it was gonna cost a couple bucks. Great. Let's ship it." And once it gets into production, it's a much different story, and they just don't - organizations really struggle with this. It's unfortunate.
On making sure your developers are aware of costs...Efi Merdler-Kravitz: (
Episode #13) I think that people, you know, people that come to serverless for the first time sometimes forget how easy it is to scale serverless. So in a matter of minutes, you can easily get a hundreds and thousands of Lambdas running simultaneously. Millions of requests to the DynamoDB. And in the end of the day, you suddenly see a bill of a couple of hundreds of dollars, and you ask yourself, “What?” I as a manager, I check the cost on a daily basis, and I'm trying to understand the trends. I use the Cost Explorer in AWS quite a lot. In addition, we also use our own tools. We have our own monitoring tools which also gave us a cost breakdown, and I think again, part of the code review is part of the checklist that I mentioned earlier. We ask the developers, why did they choose, for example, this amount of memory for this specific Lambda. Or why did they add another index to DynamoDB? Each index costs more money because you are duplicating the data. And for example, while they are using Kinesis and not Firehose. So there are many questions we ask along the way when doing the code review. Again, it's not something that can be done automatically, something that people need to see the code and understand what's going on. But you ask the questions in order to make sure that developers understand the trade-off, in order to understand that it costs a lot of money. And you know, especially for startups, where money's always tight, suddenly paying thousands of dollars per month, it's dangerous, can be really dangerous. So it's not only “Oh no, we'll use the corporate credit card.” It can be really dangerous for the startup. So you need to pay attention to it.
On premature optimization...Alex DeBrie: (
Episode #1) In terms of purity versus practicality there, you need to think about your use cases and what matters to you. If you're not gonna have a user-facing application, I wouldn't worry that much about optimizing it, or if your bill's not that high right now, don't worry about optimizing it. Most importantly - I think this is true of serverless or non-serverless, but I think it's been a focus in the serverless community - focus on building a product that brings value. If speed is something that brings value to your customers because they want a quick, responsive app, then maybe focus on speed. Otherwise, focus on building those features in that core experience that your users are really going to care about. Focus on that first rather than some of the optimization techniques.
On best practices...
Michael Hart: (
Episode #19) Don't try and prematurely optimize. Because I'll tell you, we at Bustle, we have very few, very large Lambdas, and we do billions of invocations a month. You know, we do many, many, many, many page views and our latencies are very low, and it's not perhaps as bad as you think. There are some tricks you might need to do, like we webpack all of our JavaScript into one single file, right so that there's no file system calls being made whenever it's required, and we minified — so there might be people that go, “Oh, that's kind of a gross hack, but well, alright. For us, that's fine.” You know, we've got plenty of developers that know how to do that and that are comfortable doing that and would be less comfortable managing 50 or 100 tiny functions, maybe, and dealing with the ops of that because it's not free. You know, a function isn't a zero cost piece of infrastructure. You still need to monitor them, you still need to maybe tune them. There's a whole bunch of things that every function you have, you need to think about a little bit and monitor and that sort of thing. So yes, so even things like that. I think there's a spectrum for best practices, and I would say, try things out first. Maybe be aware that it's a lever that you can pull, but try them at first and don't stress too much about having the perfect — there's no single way to do these things basically.