Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

00:00 49:06

On this episode, Jeremy chats with Nicole Yip about the continued growth of the LEGO Group's serverless development teams, the evolving audit process they use to improve serverless standards, the challenges they faced adopting those standards, and much more.

Show Notes

About Nicole Yip

Nicole Yip is an Engineering Manager at the LEGO Group. She has been working as an Infrastructure and DevOps engineer for over 5 years mostly as a consultant helping teams of all sizes get their services into AWS. Her roles have often become the catch-all for everything non-application-developer but that matches her passions for AWS, Infrastructure as Code, CI/CD, and Security. Nicole is a frequent speaker at various conferences, with past events including Serverless Architecture Conference, DevOps Con, and ServerlessDays Virtual.

Twitter: https://twitter.com/Pelicanpie88
LinkedIn: https://www.linkedin.com/in/nicole-yip-5b792292/
Medium: https://medium.com/lego-engineering
LEGO: https://lego.com
ServerlessDays Virtual 2020: https://www.youtube.com/watch?v=9oYS_5eL610

Watch this episode on YouTube: https://youtu.be/_09c6maJ3Uc

Transcript

Jeremy: Hi everyone, I'm Jeremy Daly and this is Serverless Chats. Today I'm chatting with Nicole Yip. Hey Nicole, thanks for joining me.

Nicole: Thanks for having me.

Jeremy: So you are an Engineering Manager at the LEGO Group, so I'd love it if you could tell the listeners a little bit about yourself and your background and what you do at the LEGO Group.

Nicole: Sure, so as you said, I'm an Engineering Manager. I have joined the company about a year and a half ago, as a Senior Infrastructure Engineer and then moved up to Engineering Manager and I've joined in the Direct Shopper Technology Team. So we look after www.lego.com, all of the pages where you're browsing for products, completing your order through checkout, and redeeming VIP vouchers, that's what my team looks after. And specifically, I look after the platform. So I head up the platform squad there where we look after the infrastructure and hosting and developer experience CI/CD, security, and operations of the site. So quite a big remit there and specifically my background as in AWS and managing production workloads and that's really where my interest is and what I'm doing at the LEGO Group.

Jeremy: Awesome, well so I am super excited that you are here because I love the LEGO Group, not just because I loved LEGOs as a kid, but also because I started talking with Sheen Brisals a long, long time ago and he was so super excited about the whole serverless process and just building things with serverless. And so it was really interesting to hear the process that the LEGO Group has gone through. And it's been, I think over a year since I talked to him on the show here. And so I'd love to set the stage here because there is this talk that you gave at ServerlessDays Virtual recently about this audit process that you do at the LEGO Group in order to make sure that you're following best practices with serverless and that you're always kind of upgrading. And I wanna get into that but I think to set the stage for everybody to know just how serverless the LEGO Group is, maybe you could give us just a quick timeline of where it started and where you are now from a serverless perspective in your engineering group.

Nicole: Yeah sure. So the story starts a little bit before I joined back in 2017 when we had, it was kind of an event that was the last straw, the straw that broke the camel's back, so they say. And yeah, so there was a launch event that was highly anticipated, we didn't survive. And that led to us looking at options that weren't on-premise, that weren't hosted on-premise. So we started back in 2018, actually scoping out serverless and AWS and seeing if it would work for us. So we migrated a single user-facing service and a couple of backend services over to the cloud, got them running and they were handling high season traffic by the end of 2018. And so yeah, fully in production and ready to go. And that then led on to us making the entire lego.com site serverless. So the pages that I mentioned that are within our team's remit, we moved all of them into Fargate instances and serverless Lambda functions. And so the only on-premise system we have is our source of truth, our warehousing system and we've wrapped that up in Lambda functions so we only talk to that asynchronously.

Jeremy: Nice, so now where are you now? You've mentioned Fargate, you've grown the number of engineers that are working in serverless, so like where are your just rough numbers, like how many Lambda functions do you have? Things like that.

Nicole: Oh, this is fun. So we had four Lambda functions in 2018 during Black Friday, Cyber Monday. In 2019, which was last year, we had just gone fully serverless. The platform was handling high season levels of traffic and at that point we had four Fargate instances and I think it was 36 serverless services. And a serverless service can be made up of many Lambda functions. I think we had around 150 to 200 Lambda functions. No, I think it was around 150 Lambda functions in production at that time. And now that we've gone through another Black Friday, Cyber Monday high season period, we're now over 260 Lambda functions in production with over 56 serverless services and still the same original four Fargate services. So we're growing pretty quickly.

Jeremy: I would say. So going back to the engineers 'cause this is another thing that fascinates me is how different companies group engineers together to manage different services and to make sure that, you know again, everyone's sharing information across teams and so that everyone's working together. And you mentioned you're on the platform squad so you are broken into squads. 'Cause I find it fascinating but I think the listeners might be interested in, how are those squads set up? What's the makeup of them? Like what are the disciplines that are there? You know just how does that work?

Nicole: Yeah so within the Direct Shopper Technology Team we have seven different squads right now. But back when we went serverless, we had two squads. We had a back end and a front end focus squad. So we largely had Lambda focused and serverless focused engineers in one squad and React, Fargate, Express, and Apollo engineers in the front end focus squad. And then once we launched the platform live we reorganized all of the squads into being product-focused. So they split up into five different squads each focused on different areas of the site. So that was one squad for checkout, one squad for VIP rewards, one squad for just exploration and how you discover products, and so on. And these squads are as full stack as we can get them at the moment. So the application engineers, QA engineers, the product owners, and now we're including stakeholders in there as well.

So people from other teams that are actually setting these requirements, these business requirements. So we've got quite a few people in each of those squads but we still have an essential platform squad. And the reason for that is it's a brand new platform that literally was written and launched like a year and a couple months ago. So we're still in that pattern of having a dedicated operations or platform squad but we're actively trying to move away from that. We're trying to train up each of those application engineers in the product squads to become operations focused, to have that mindset of building things in a way that makes operations easier, because next year, they don't know it yet, they're going to be operating their services there. They're going to be the ones that are on call. They're going to be the ones who have to wake up in the middle of the night. And yeah, that's where we're going.

Jeremy: That's awesome, I mean I love that idea too, of cross-functional teams. It's always been a big thing that, I shouldn't say always, but more recently has been something that I think works really well. And I love that idea of training up the application developers to be responsible for their applications. They're not responsible for the servers right 'cause there are servers but there are no servers. So that's what I think an interesting approach. I'm gonna have to have you back on to tell me how that works out later on 'cause I'll be interested to see that. All right so this idea of training up application developers to understand how to run or to be able to manage their own applications, their own stacks, and just in terms of making sure that what it is that each team is building now, sort of what you're responsible for on the platform team. I mean a big thing here is ensuring best practices, right? I mean to make sure that everything is written in a way that makes sense, that you're not duplicating a lot of code, that anything that can be shared again, I think the security aspect of it.

What else? I mean just in terms of maybe Middleware and things like that. Like the common things that you need to have, there are a lot of unknowns in the serverless world, right? I mean this is something that didn't really exist five years ago and certainly to the level now that people are using managed services. The amount of knowledge you need, the amount of best practices that has just expanded dramatically. So I'd love to talk about the serverless audit that you did at the LEGO Group. Because again, I may be overusing this word, but I find it fascinating, I love this idea of being able to say, okay in our organization we have a set of best practices and we're now adopting a new technology that we can go to conferences and we can see some of these people talking about this stuff, but in order for it to work for us, in order to make sure that we're the most productive and secure and move as quickly as we can, we need to adopt our own standards. So let's start right at the beginning. So why, and maybe I gave it away, but why did you say, "Hey we need to do this audit of all of our serverless applications."

Nicole: Yeah so it's really two reasons why. So the first one is that in order for us to have written so many serverless functions over the last year and a half, we've had to grow the team. We've got a lot more engineers now. So we went from having around 20 engineers back at in like July last year. Now we have around 60. So it's a three times increase in the number of engineers in the squad. And they're at all different levels of their careers. They're juniors, mid-seniors, and not all of them have been exposed to AWS, let alone serverless and serverless best practices. So when we started building out the new platform we had some engineers who knew serverless and it was manageable for them to talk to a single architect and have a shared vision of how we were gonna build our services.

Now with 60 engineers and still one solutions architect that's not scalable. And so the reason to try and the way that we're trying to share out that knowledge and really make sure that we're maintaining that high standard of building functions that are scalable and meet all of those requirements of security and all of that is really just by writing it down and creating a standard. And that's where the audit came from. So we took all of the best practices that we were already doing in our newer services. So every time we write a service, you improve a little bit here or there and we just said, "Okay all of our new services are implementing these things. Let's write that down. Let's have a look at our old services and see what needs to improve." And so really the audit came out of us having a lot of old services that weren't really being maintained or kept up to date with the new standard that we were setting with our new services.

And so then came the fun of auditing these 36 services that we had last year and really calling out all of the areas that we needed to improve them on. And we also took the opportunity to start thinking, "Well we're growing the team so quickly. We can't have a central operations team anymore either." So at the time there was four infrastructure engineers if you could believe it for this global website. And we couldn't handle having from 20 to 60 engineers throwing functions at us. So we took this opportunity to also add an operations related thing. So making them have Canary deployments and basic alerting to implement on their services. And this is really setting the stage for where we want to go next year, of well now you've written these services, now you can own them and operate them. And that's how the standard came together. Both from existing practices and new ones to help operate the services.

Jeremy: Right, now we were talking the other day and you referred to the older services as legacy serverless applications. I think is quite funny 'cause it is amazing. I mean they're only a couple of years old. You said 2017 so just these services that are just a few years old. I mean so much has changed with serverless over the last couple of years. I mean again, we just finished re:Invent or we're in the middle of re:Invent right now. There's so many new things that are happening and I'm sure these will evolve over time as well. So let's go through the focus areas. 'Cause I thought this was a really interesting approach and if you're listening to this take some notes because honestly this is really good. I mean your team did an amazing job of outlining what these focus areas should be and helping standardize that. And so if anybody's listening, this is just a roadmap for you, if you're listening to do this. So let's go through that. So what were these eight focus areas? What were the main things that you said this is what we need to make sure we get right in our serverless applications.

Nicole: So we started out with, because we're an infrastructure team and we had the frustration of having to get this operations burden off of us and into the squads, we started with alerting observability and logging guidelines. So those were the three that are really key for if you want to know what's going on with your service and production. If something goes wrong you need to have observability and traceability through your stack. And you also need the logging to know if particular transactions or batches have been dropped. So those were the three first focus areas. We then started thinking about how do we get engineers to get their own code into production. So we added, we thought about safety and deployment. So deployment mechanism being another focus area that we added in.

So I mentioned before Canary deployments adding in that safety net when you're deploying into production, that's something that we needed to, in order to actually hand off deploying services to the squads. And then the others kind of come out of the best practices of what we were already doing in our newer services. So integration testing, making sure that you've written integration tests that are run in the right parts of the pipeline. So unit tests on Prs, integration tests at that stage as well, and also in QA and acceptance. We also have a standard around secrets management. So we've mandated that we use SSM parameter store as our secrets manager but also we want to use it in a specific way so that it's easier to audit when we come through and say, "So which service does this secret belong to?" So we've got a certain naming convention for that. And then there's another thing for Middleware. So Middleware has made a huge difference to our services. Where one of our developers put it really nicely, "Where it takes all of the bits that you always have to do as a developer that no matter what-

Jeremy: The bootstraps stuff, right.

Nicole: Exactly yeah, yeah. The stuff that you're always writing whenever you start up a new service, Middleware takes care of that and then your handler just becomes business logic. And that's essentially what we're trying to focus on, right. We're trying to get our functions to do something with the business outcome. And so taking all of that out of the handler makes it not only easier to read but it tells you exactly what the service does, 'cause you can read the handler and it's all there. And then things like pausing like validating that the request that comes through is in the expected format. Making sure that responses that go out are consistent, things like that. So Middleware has been amazing. So that was another focus area to get onto all of our services.

And the final one was documentation, which I know it's been around for a while. It's not fun but it's so important when you're growing a team. 'Cause what we found directly from one of our engineers was saying that, "You've got new engineers going into these new squads. I've told you we've changed them up two or three times already. We're gonna keep changing them up." And so when you're inheriting services or when you're onboarding a new person, not only are you pairing with them but you can give them a document that says here's exactly what the service does. Here's how you get it up and running, here's all of the service limitations around it. Here's the architecture diagram. And so that was really key to add into the service audit to make sure that not only can we maintain our services, are they built according to best practices, but we can also maintain them with new engineers. So that's where we got to with our focus areas.

Jeremy: Right, so now you come up with these focus areas and this is all based off of learning from multiple years of implementing this stuff. You're building new services, you're learning from that. So you put together this set of focus areas and I've worked with a lot of engineering teams, I've worked on a lot of engineering teams, I've managed some engineering teams, and I can tell you, and I'm sure you know this, that the best way to get somebody not to do something is say, "Hey, here's something that we came up with that you now need to implement and we know better than you so this is the way that we do it." So that's like, I guess the declaration from on high doesn't always go over well. So how did you take all these squads with all these engineers and get this big of a project or this big of initiative implemented?

Nicole: Yeah so the rollout was really interesting. We actually had to do it in two phases. So the first way we did it was we conducted the audit within the platform squad. We went through the code base, looked at every single service, compared them with each of the checkboxes and the checklist, and raised tickets for each of them. So if we saw that there was no alerting on that service then we raised the ticket, linked it back to our guidelines and said, "Implement these alerts as a basic level." Did the same thing about Canary, same thing for each of the focus areas. So we ended up with a bunch of Jira tickets that then went out to all of the squads that owned each of these services. And then they sat in the backlog.

Jeremy I was gonna say and how did that go?

Nicole: Yeah, it didn't go too well. We went and talked to all of the product owners. We did demos on why we were putting in each of these tickets for each of these services, but it was still in a time when we had just formed these squads. We had just told them that they were owning these services. And now we were throwing tickets at them saying "You own this service now can you please fix it?" And so that sat for a couple months. And then we got onto the... We then tried to like reinvent our approach and go, okay let's focus on one overarching goal. We just want observability. We just need to know exactly what's going on in our platform. And that went really well.

So we just said to each of the squads, "Can you just implement our monitoring and traceability tool? Here's a guide on exactly how to do it." And we did a couple demos on here's all the benefits, here's all the information that gets pulled out of them and that really started to gain traction. I think within the first three months we had a couple services on, within six months we had even more. And now we're, so this was back in March, when we started that roll out. And now in December, we have all of our services into our monitoring tool. And not only that, the ones who adopted it early are in and creating their own dashboards and using these metrics already and of their own initiative. So we haven't given them any guidance or motivation to go and start actually owning and operating their services. We're still trying to focus on the last few to get your logs and your monitoring and the ones who adopted it early are already taking it and running with it which is amazing and exactly where we wanted to go.

Jeremy: It's amazing. All right, so then the other thing is, and this is probably people asking this question, is there's something called the Well-Architected Framework, you know about this and you know about the Serverless Lens for this. So I think this came slightly after you started working on this auditing process. So what did you do ... now AWS is publishing a bunch of standards saying, "Hey here's how we're supposed to do it." Now I think that it's very broad, right. It doesn't get into the weeds. It is very helpful but so how did you reconcile those eight focus areas you had with the Serverless Lens that came out?

Nicole: Yeah so it's quite interesting actually because preparing for our talk, the AWS Well-Architected Framework has been out for about four or five years already and it was across all of the AWS best practices and services and very, very generic. And it was really just a guideline, you came across it in the certifications and things like that. Then the Serverless Lens came along and it was really focused around best practices in the serverless space, but still something you only really encounter when you're getting guidance from architects or in the certification process. Then they launched the Well-Architected Tool, so different from Framework, where they integrated it into the AWS console and I think that was in 2018, I believe. And that was really just a checklist where you could check through and say you've considered these things. And then the bit that came just after we implemented our audit was the Serverless Lens as part of that tool.

So the Serverless Lens was around when we were creating this. It's just that we weren't overly aware of it. But when the blog post came out saying it was part of the tool, then we realized, oh we could use this and see maybe it can highlight some gaps in our audit process because we weren't really seeing the Serverless Lens as an audit. It was more like guidance on how to write a good service. And then we had to think about what our audit was. Oh, it's guidelines of how we build good services. So there was an overlap for sure. And when we did the comparison we realized actually we have two gaps or like two pillars that we haven't really covered in our serverless audit. And so that's giving us somewhere else to really go and define and call out for our engineering team.

And I mean, on reflection, the reason that those gaps are in our audit is because we're already doing them. So you know how I said that the guidance, like the Well-Architected and the Serverless Lens guidance, is built into the core principles of how you build on AWS well. And so because we've had a core AWS architect in the team from the beginning, because anyone who's gone through the certification process is vaguely aware of each of these pillars, you know operational excellency, performance, cost optimization, reliability, security, we're already building really well in those spaces and so we didn't feel the need to bring our older services up to date in those spaces. So that's why we haven't explicitly defined what we do in those spaces. We already have practices and patterns that are best practice for us. And so long as you're following those patterns you're doing okay. But for completeness, it's really good for us to add it to the standard because, I don't know, maybe we won't always have an AWS architect and who knows those guidelines, maybe we switch to a new service, they won't always be the same. So that will be like our next evolution of the audit.

Jeremy: Right and that's one of the things that I actually really liked about how you built your own standard first and then took the Well-Architected Framework and said where does this fit in? And maybe where do we have those gaps? 'Cause I think that is a good way to do it. I mean the Well-Architected Framework and the Serverless Lens specifically, I mean it's really just a bunch of questions, right. It isn't like here's exactly how you do it. It's just have you considered these things and you kind of check them off. So let's just quickly talk about that, where those overlaps were and how you took one of these pillars and said okay we're gonna go a little bit deeper and we're gonna add additional standards to it. So let's start with the operational excellency pillar. So what else did you add, sort of is part of your audit on top of that?

Nicole: Yeah so it's really about what did we define as within this pillar? So within that pillar there were one, two, three, four, five focus areas. So alerting, observability, and logging, so everything around operations, and then also the bits around how to get your service into production. So the deployment part, deployment mechanism and integration testing. So all of those five focus areas were within the operational excellency pillar. And that's because that was always held centrally within the infrastructure team, within the platform squad. And so we really had to define this is how we do it because we were about to get all of the application engineers to do it on their own. So that's why we had so much definition within that pillar.

Jeremy: Right and then the security pillar.

Nicole: Yep, so security pillar would be the secrets management focus area. Really most of the way that we were building was already following some pretty secure patterns. I mean it's an E-commerce shop, we have to be pretty secure from the get-go. And it was really that tidy up of how we manage our secrets.

Jeremy: Okay and then the reliability pillar.

Nicole: Yep so reliability and the next one, the performance pillar, are the ones where we haven't defined guidance around them because as I said before, our services were built to be pretty performant and pretty reliable. E-commerce throwing them through Black Friday, Cyber Monday.

Jeremy: Right, now is that something within that pillar though, I mean thinking about resiliency or chaos engineering, things like that or just even request rates and stuff like that. That's all stuff though that's sort of being considered within your audit, right?

Nicole: So again within our audit explicitly is areas of the services that we needed to bring up to speed for our older services. The ones that are being built now are following our existing patterns and so they're still quite implicit. So long as you're following one of the pattern, architectural patterns that we've set out, that we've proven work for us, then reliability and performance come along with that. And the next evolution of the audit we'll want to define and explore each of those areas. So we do have chaos engineering on the roadmap and that should be part of our exploration into the reliability pillar of defining what do we do now to increase our reliability but what can we do in the future? And so chaos engineering is going to be part of that.

Jeremy: Right, all right so then cost optimization. The cost optimization pillar. That's one of those things where, I mean maybe, I mean I've always thought about costs, right. 'Cause I've always been very early in companies and having to think about that and infrastructure costs. I think as companies get bigger, engineers don't typically think about that. And I had a guest, Eric Peterson, who said, "Every line of code an engineer writes for the cloud, they're making a cost decision," which I think is a brilliant way to think of it. So from a cost standpoint, what did you implement there?

Nicole: So within our audit we only have the Middleware component and specifically this adds to our cost optimization because we use the SSM Middleware. So SSM or the parameter store, there are two tiers; there's a free and a paid tier. And we've had to opt into the paid tier purely because of the rate limiting that is on the free tier. So by introducing Middy, by caching secrets when a container starts up, rather than on each Lambda invocation, saves us quite a lot. And we're doing a lot more in cost optimization on the infrastructure side, on the supporting infrastructure for our Fargate containers rather than our serverless services because they're pretty performant already. And so that's why we don't have too much more in the cost optimization area for our engineers to pick up on just yet.

Jeremy: Right, right and I know your team uses EventBridge quite a bit and I mean there's all kinds of pricing around, so I just think that's an interesting way or an interesting thing for developers to start thinking about is what services I use and how that's gonna affect overall costs when sort of building out and planning that stuff, so very interesting there. Okay, so you started getting people implementing it and it sounds like you're doing pretty well with that which is amazing 'cause that in and of itself getting your engineers to adopt something new is a huge hurdle. But what about some of the challenges people have faced or what are the responses you're getting from these developers? 'Cause I know they're doing it, right, but what has been the impact on performance or morale or just how people are dealing with it?

Nicole: Yeah and as a platform squad, our customers are the engineers. And so their feedback is paramount to how we approach anything really. And that's why we had to change our approach right at the start of instead of putting tickets into their backlog, we had to reframe it as, "Let's just know what's happening in our platform." That's something that everyone can align with, right. I mean if you're gonna put a service out there, you kind of want to, whoever's operating it, whether it's yourself or another team, you want to help them know what's going on. So that was an easy one for them to really align with and get on board with, which was great. And then as I mentioned before, off their own back, they then started using it themselves which is amazing. And also jumping on more of the audit categories and saying, "Okay, well that was really useful. What else is in this audit? What else can I do?" And so like half of the services have alerting on them now even though we haven't made that as our key focus to get rolled out. And so it was really that feedback of silence for the first part. And then they started getting on board and adopting it quicker and quicker. And then not only did they start implementing other parts of the audit, they started adding into the audit. So they added the documentation thing, that wasn't in there from the start.

There's also something about shared code. So when we have a monorepo structure and so shared code tends to be using learner named spacing. We would refer to it as "at namespace slash shared code package." And that's tricky when you look at how we release code because if you make an update to that package, it then triggers off the pipelines for all of those services. So it gets a bit messy checking what's going on. So we've published our shared code as private GitHub packages. And now one of the squads has added that into the audit of use the GitHub packages if you're gonna use any of the shared code. So that's something that as an infrastructure or platform squad, we don't really see that too much. And so it's really the engineers who are adding in those aspects of things that we wouldn't consider but are definitely part of really like best practices when you're writing services. It's a mesh of infrastructure and engineering, right. And so that's kind of the spectrum of reactions of how we've rolled it out.

And then one team has actually said, "Okay, we're gonna own this. We're gonna take all of these tickets that you've given us at the start of the year. And we're gonna put them into a roadmap and a timeline." And they listed out all of their services, all of the audit categories, added in their own, that were really just opinionated as part of their squad. So they decided that they wanted all of their functions to be named. They didn't want any default exports which yeah good thing to have. And they put it into a timeline, they set themselves deadlines. This was not imposed from any other squad let alone the platform squad and they delivered it. And now all of their services are pretty compliant with the audit. And they inherited, I think, three or four old services that they knew nothing about beforehand. And they learned the services, they put in the test, all of the things to bring it up to speed and now they can operate the services.

Jeremy: I'm just curious 'cause you said I think you had 36 services that you were applying the audit to, and I'm sure they were all in different levels of compliance just because as they were being built. But how did you balance and hopefully you can answer this, but how did the teams balance that idea of going back and refactoring all this code as well as continuing to move forward to support new features and things like that? Like how do you get to that sort of operational maturity but still balance this idea of still releasing new features?

Nicole: Yeah, that's something that we've had to tackle with each squad. So each squad has their own PO with their own, I guess idea of what should be prioritized, delivering features or engineering work, and how you strike that balance. And each squad has found their own balance. I don't think any two squads have been the same. So I mentioned that one squad who created their tech investment roadmap. They were building features all the way up until I think it was September, and then they switched and said, okay we're going to bring in, I think it was like 30% tech investment for every single sprint, until they got it done. And so they were fairly structured in their approach. There are some squads who just brought in one or two tickets every sprint. There were some squads who didn't pick it up at all and have kind of made the last push right at the end. So it was we really left it up to each of the squads to figure out how they wanted to strike that balance with the one overriding goal of, we just want to know what's going on in the platform if you can help us there as a minimum, that's great. And all of the squads helped us achieve that, and some of the squads went way beyond.

Jeremy: Yeah, that's awesome. All right. So let's talk about the Middleware again for a second because this is another thing where, I know it was always a big frustration in the past was, you're bootstrapping new services and you're doing the same thing over and over and over again. And the promise of serverless, or at least one of the promises of serverless, was just write your code, focus on your code. And so there's a lot of different ways that you can bring in some of the bootstrapping stuff whether that's with layers or custom images and now they just released containers, so now you could have your own packages or your own containers or base containers that you could use. So you chose the Middleware route so how was that sort of implemented and what are the things that that does?

Nicole: So we chose Middleware before layers were even introduced. So that's another thing that maybe we'll switch over to layers, who knows. So at the moment what Middleware does is it gives us a consistent way to handle errors, to manage logging and also to manage secrets. So those are the three Middlewares that we have customized and have implemented on most of our services and the rest are covered by the service audit. So it's giving us a consistent logger so that when you're looking in our logging system which merges all of the logs for all of those applications together, you can filter it out consistently and performantly as well, because there's a way that you can log a structured logging where you can say, function name is this and logging level is that. And having that consistent across all of the services makes it a lot easier to go and cross-reference and find areas that are affecting multiple services.

Jeremy: Yeah so that's an interesting thing where, when you start standardizing things and again, the audit process I think, again it's clear cut, it makes a lot of sense. Like once people start using it, you said the developers were like, "Oh, well we could do more things." So you mentioned documentation as one of those things where the developers just said, "Hey you know what, this is great. But we need to add some documentation to this process." I mean did the development teams come up with other improvements? Like did they adopt new standards or other things that they kind of pushed back up to the top?

Nicole: Yeah, so I mentioned the thing around shared code and introducing the standard to use published packages. I'm trying to think of if there was more, we added in another one around doing um, no, I'm not sure about that one. Yeah, I think the main one that was added was around how we handle shared code. And it's really exciting to see the engineers are actually going back and feeling that empowerment to go and update or even add to an audit. Some of the focus areas that we had written from a platform perspective they went and rewrote after we had paired and taught them what to do. They wrote it in a way that made way more sense to them because it's different when you've come from an infrastructure background and when you've come from an engineering or development background. The words don't have the same meanings and the way that you phrase things is slightly different. And so rewriting it into their own words, means that it's going to be a lot easier for the next engineer who comes along who has to adopt it. So those have been some of the really great contributions back into the audit.

And we're hoping that it's not something that the platform team have to add to any more at all. We're hoping that it's going to be maintained by the application engineers, who you know, they're the ones who have been building up all of these patterns and practices already. We kind of just put it into our own words and said, "Here we wrote it down," and I'm hoping that as they keep building and improving their services they keep adding it in. I mean one of the squads I know is using Lambda pre-warming on their services. So rather than using provision concurrency, they actually have a package that's going and triggering their Lambda to make sure that at least one or two containers are warm. That's something that if they find that it's really giving them an improvement on their cold start times, maybe they'll add it into the audit for specifically for services that aren't invoked often, but often enough to have like a really big impact on cold start time. Maybe they'll add it into the audit too. So we're hoping it becomes a living thing that the engineering team keep and uses, I guess, a way for them to communicate with each other. 'Cause they're different product squads, right, we need to try and figure out how to share the knowledge that each of the different squads are gaining when they're building new things. So this is one mechanism to try and centralize that.

Jeremy: Right, now have you standardized around like a particular programming language or is it just whatever the squad wants to use?

Nicole: So we started off with one or two squads, right. And so we said everything is written in Node.js and the front end has React and Node.js. So that's the languages that have been selected and we're sticking with them.

Jeremy: Right, it seems to be popular. Although some people do like to jump into, Go or Python or something like that.

Nicole: Yeah, I mean every language is fit for a certain purpose, right? So if we run into a situation where one of the other languages suits us better, because it's balancing that need of being able to have any engineer maintain any part of the platform. So having everything written in a standard language or does this thing really need to be performant, does it really need to do a specific thing? In which case it kind of has to be written in another language, then that's when we'll start making those calls. But at the moment we want every engineer to maintain any part of the platform.

Jeremy: Yeah and I like that. I like the fact that the front end and the back end are both essentially written in JavaScript. It's just that way, especially with cross-functional teams, I mean even if somebody has to go in and make a small change on either side, I mean it's just good to have. So all right that's awesome. So what about sort of advice to others? And then when I say others, I mean other engineers maybe, that are getting in audit or some sort of thing like, "Hey you need to do this." But also to the teams that were like you, that are trying to implement these standards to make the quality of your applications better, to make the standards better, the best practices better, so what's some of that advice that you might be able to give to people.

Nicole: I mean, the core advice, whether you're on the receiving end or the implementation end of this kind of thing is empowerment. So feel that you are empowered to really make a difference and make your services better. And don't feel like anything is ... don't try to make it an imposed standard, make it something that is there to help. Because the main thing around this checklist is if you do these things, you can own and operate your services. You'll be able to find the bugs quickly when you're under pressure. That's what we're trying to empower our engineers to have in their mindset, of you can build good services and here's our tips and tricks on how to do that. If you follow these things, we know that they work. And so it's really about taking that line of we're here to help, we're here to empower you to do good things and do great things. And not have to try and guess that, if I do this, is that going to work? Or if I do this, will that work with the whole stack that we've got? Because no one can know every part of the platform, right.

And so that's really been the consistent message that I would want to get across with this process of, if you're implementing it, make sure it comes through from an empowerment and an enabling perspective of, this is what you should know and here's some starting points on how to do it. And if you're on the receiving end, maybe don't wait to be on the receiving end. Maybe start writing down what your best practices are and say, start sharing it out, start sharing the knowledge because once you start leveling up as a team, then you know your whole platform benefits, right. So yeah and also keep it as a living document, best practices don't stay still. So I mentioned that we implemented Middleware before layers were even introduced. We now have a layer as part of the standard because our monitoring system has written a layer. And so the standard has been evolved to implement this layer, implement Middleware for other things. Maybe the things in Middleware will move into layers as well. So keep these up to date, keep them as the document of, here's what a good service is for your team. And then any new joiners, anyone looking at the legacy serverless applications if you have them, knows what to do.

Jeremy: Love that, great advice. All right, last thing. You said it's a living document or it's an ongoing thing. You mentioned chaos engineering on your roadmap. What are the other things that the LEGO Group is considering adding to this process?

Nicole: So this process started at the start of this year, right? It was a big table in confluence. It's very hard coded. It's very manual. The most automation we have is we've linked Jira tickets, so that they automatically get ticked off when they're done. It's not great. The actual implementation of the process is not great. The outcomes are amazing. So what we want to do with this, is actually figure out how we can show all of our engineers, all of the services they own, what state they're in according to our audit. So you can maybe give each service a score and say you're lagging behind a little bit on operations. So it's written really well but you're gonna find it a bit hard to operate. Or maybe it's written well, it's really easy to operate, but a new person will have no idea what's going on when they join so get that documentation up. We want to try and figure out how to get that feedback back to our squads in an application that's not yet another application, right. So we're trying to figure out where that fits within our ecosystem. And then I've got this secret, it's not really a secret anymore. I talked about it in a conference. I want to put a leaderboard in. So I want to make it competitive. I want to make it a point of pride of, my service is the best in the platform, that kind of thing. And I'm hoping that that will put a bit more fun around an audit process, you know, and really drive our platform to continually be better. And I mean if a new team introduces a new standard, that then tanks all of the other squads' scores, I mean I'm all up for it, you're continually improving, right.

Jeremy: There we go. Awesome. Well, that's amazing and like I said, honestly if you're listening to this and you're thinking about anything even remotely close to this, this is a great sort of roadmap for you. And I think the talk from ServerlessDays Virtual is up. You can get that, I think search serverlessdays.io or virtual.serverlessdays.io. You can find that. So again, Nicole thank you so much for being here, for sharing this, for continuing to do this great work. And again putting this information out there is awesome because again, I think it can help a lot of teams. So if people want to get a hold of you, how do they do that?

Nicole: Yep so the best place to find me is on Twitter @Pelicanpie88. I know it doesn't resemble my name but I picked it several years ago and LinkedIn as well if you want to talk to me on a professional level, but mainly Twitter for AWS and serverless stuff. And I guess the main message here is DevOps as a journey, infrastructure is a journey, where I'm just kind of sharing where we are right now. And I'd love to hear more stories about where everyone else is at the moment in their journey.

Jeremy: Yeah, it's amazing. All right and then don't forget to check out the LEGO Engineering Medium blog, that's at medium.com/lego-engineering and then of course, lego.com. This will probably be after Christmas, you won't have time to put your orders in but maybe some January LEGO purchases for the kids and family. So again, Nicole thank you so much. I'll put all this stuff in the show notes. It was great having you.

Nicole: Thank you for having me, this was a great talk.

What is Serverless Chats?

Serverless Chats is a podcast that geeks out on everything serverless. Join Jeremy Daly and Rebecca Marshburn as they chat with a special guest each week.

Serverless Chats

Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

More episodes

Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

Episode #82: Continuously Improving Serverless Standards at the LEGO Group with Nicole Yip

Chapters

Show Notes

What is Serverless Chats?