The fourth episode in a five part series on the Big Bang rewrite completed at Saltside in 2014/15. This episode discusses how we split the monolith and created a new architecture to support Saltside well into the future.
Adam Hawkins presents the theory and practices behind software delivery excellence. Topics include DevOps, lean, software architecture, continuous delivery, and interviews with industry leaders.
[00:00:00] Hello and welcome. I'm your host, Adam Hawkins.In each episode I present a small batch, with theory and practices behind building a high velocity software organization. Topics include dev ops, lean software architecture, continuous delivery, and conversations with industry leaders. Now let's begin today's episode.
[00:00:26] Hello again, Adam here to continue the salt side Chronicles. Previous episodes covered the run-up to the rewrite and the factors that created the big bang we write. This episode is a deep dive into the technical plan for paying back technical debt and creating the next generation architecture.
[00:00:48] Allow me to begin by describing the system as it was and the specific challenges. we faced. When the rewrites started, there were a handful of services without any real bounded contexts. There was a monolithic rails app at sofa, desktop website, and handled an extraordinary amount of back office functions. There was a web service that served them mobile and feature phone versus the website.
[00:01:11] This integrated with the monolith, a mix of APIs and shared libraries. There was also an admin service that managed the entire customer support. End of the. business. This included the business critical features for reviewing, approving, and rejecting ads. The admin service communicated with the same DB as the monolith, as you can imagine, there were multiple problems with this architecture.
[00:01:34] The first is the config loading, which I discussed in depth in episode two. Next is the horrendous and agregious boundary violation of sharing the same database across different services. That folds into the next issue, that there was no real APIs for core business functions like search ads or sign up user.
[00:01:53] Lastly is the gravity of the monolith. The monolith just did a lot of stuff. The more I learned, the more surprised I was, even in hindsight, I'm surprised by just how much that could be stated. We discovered entirely unknown business support features that were split off into separate services over the course of the.
[00:02:12] rewrite. This is not an exhaustive description of the system. All of the issues discussed in their previous episodes are layered on top and even stem from the underlying technical architecture. There were many ways to redesign this system from the ground up. My goal was to achieve these business and technical objectives.
[00:02:32] One a single API built, deployed, and operated by the platform team that supported customer facing applications like the websites and mobile. Two separate the business logic and configuration three enabled microservice architecture inside the platform team four speed up the time to launch a new market by making config editable by hand five, use independent infrastructure for each market service environment, six automate deployment of each market service environment through shared tooling.
[00:03:07] Seven you shared tooling for every service built by the platform team and a web team, and eight and last but not least avoid architecting the team into a corner. I was confident that we can achieve all these goals, especially since the team had already built the system once. And we had a clear picture of the current technical issues.
[00:03:27] So I got to work sketching out and iterating on an architecture that could meet these. objective. The first piece of that puzzle was how to reorganize the current state of services into independently built, deployed, and operative services with data autonomy. Let's start with the design considerations for achieving that.
[00:03:47] A big factor in the final architecture was setting bounded contexts for each specific business domain, defining APIs and avoiding cycles in the final service graph change rate was another. factor. Certain areas of the system changed daily, some monthly and some not even for years. I did not want to mix these concerns because tech debt matters more or less, depending on the velocity.
[00:04:10] Another factor was ownership. I use the term loosely because the org structure at the time, the platform team managed all the APIs and infrastructure. The web team owned all the web experiences, mobile team built the Android and iOS apps. I led the platform team.
[00:04:30] We were responsible for all the services that compose the quote backend. The scale of product requirements made it infeasible for one team member to operate at max capacity across all different parts of the product domain. An example of this is sending transactional emails to the user maintaining config payments, or handling adversity.
[00:04:50] I wanted bounded contexts that represented areas, different members of the back-end team could own in the sense that they'd be primarily working in those areas. These people would pair with others, review PRS and make sure that changes met product requirements and were usable by downstream consumers.
[00:05:08] And most importantly met our engineering standards. I'm still surprised to this day, it's such a complex system. Really only had two core data objects add and user every one of these bounded contexts would operate on one or both of these in some way. So I did not think it was a wise to allow each service to have its own representation of these two concepts.
[00:05:36] I know some domain driven design people are shaking their finger at me, but oh, well, allowing each service to own its ad representation could not work because of the relationship between ad and config. This acted as a guardrail in creating services that map exclusively to entities instead services focused on verbs instead of nouns.
[00:05:59] So how would all these services communicate? Bear in mind, this is 2015. The company had used HTTP and Jason for years. And that was the status quo. I thought that using HTTP and Jason for all our internal service was a miscalculation at the time using HTTP. And Jason would require that every service, right API specs requests and responses, and then the underlying validation.
[00:06:24] code. This opens a door to bike shedding over proper Jason design, which in turn leads to the creation of client code repeated in all consumer services. I wanted to avoid all that. And instead, focus on what mattered, easy communication across surfaces that led me to investigate RPC frameworks. We needed to support Ruby, go and node JS.
[00:06:48] I investigated proto buff, but ultimately settled on thrift. Choosing a binary format and generated code model effectively mitigated all those concerns and got the team focused on writing specs in the thrift IDL. This was an amazing choice for multiple reasons. We could dry run the system design for a typology philosophy of writing off the thrift files.
[00:07:08] Then tracing flows across services and verifying. There were no cycles. Plus we could dry run the RPCs instructs themselves, such that data required to complete the request was provided by the. strugs. Using thrift also helped in defining differences. In bounded contexts, we defined a common ad and users truck and can fix trucks and thrift services exchanged these trucks and then decorated them internally as needed.
[00:07:33] That kind of gives some aspect of the domain different design, but internally this led us to use value objects by default, as our architecture, another factor was infrastructure isolation. We needed to move away from a single instance, serving all markets to an instance per market. This meant moving from single infrastructure to end infrastructures, where N is the number of markets engineering was a bottleneck in this aspect of the business strategy.
[00:08:02] The business strategy called for moving into markets when the time was right, the previous market launch required months of just writing configuration in sql. So I want it to move engineering off their critical path here in a way to a new market could be quickly created whenever the business desired without impacting any of the other markets.
[00:08:23] This move to per market infrastructure meant embracing automation and configuration such that market X could run at scale X and market Y at scale, Y this meant customizable vertical sizing of application instances in backing data stores like Mongo, DB, or elastic search. And customizable horizontal scaling for each market and need service.
[00:08:45] This serves as a nice segue between software architecture and infrastructure architecture. We decided that service was the unit of design in the system. I say service and the toll factor absence. A service was composed of one to end processes. Some processes could be web servers. They could be thrift services or a background, job processes, Cron processes, or et cetera.
[00:09:09] You know, the process itself made no difference. It was just a process. The only thing that didn't matter was load balancing. It was HTTP or TCP, vertical size, horizontal scaling. It was an autoscaler fixed ownership of data stores, language, agnostic, infrastructure, and independent development and deployment.
[00:09:31] I say independent development and deployment in the sense that a developer could clone the kit repo for one service, run the test suite. And then if the test pass, deploy straight to productio., the test suites covered all functional requirements and no regressions. Given any dependent service had a strict contract defined in the thrift IDL, then it's necessary and sufficient to say that if the service calls external service X and service asks returns, Y then a, B, and C should happen. This behavior is testable completely through mocks and stops. We had no reliance on the integrated environment like QA or staging that was magical back to the infrastructure and software architecture. Docker had hit 1.0 throughout this time. We've bet. Docker would unlocked three capabilities for our team.
[00:10:19] One language and runtime, agnostic infrastructure and tooling by designing everything for containers. Two removing language and run-time limitations from developers, thrift supported the languages we use at the time. Some we were interested in and those, we had zero interest in at the time like Java, but that did come in handy years down the line.
[00:10:41] So if go was the right choice for a service, then go, could be used. Just figure out how to build a Docker image. And then you're good to deploy. And third development environment standardization. It would be impractical to require every developer to maintain working environments for and different languages on their machine.
[00:11:01] Containerizing the development environment. If a system had make and Docker, then they were good to go. Docker did deliver on all three of these goals, but the caveat on the first point around runtime agnostic infrastructure Docker had just hit 1.0. So this was the very early days of the ecosystem. The best you could do was build a Docker image and push it to a registry.
[00:11:26] Getting that container running and production is up to you. There was no easily available community support and production ready, container orchestration tools. Kubernetes was still in development and mezzos was probably the top contender at the time. Maysles. Remember that? What about DCOS? How about that one today?
[00:11:45] We take Kubernetes. yaml interest, support and stability for granted? Now it's easy enough to just pay for Kubernetes. AWS has EKS. GCP has GKE. Azure has ACS and even digital ocean has a hosted offering. None of this existed at the time. Our choice was to bind ourselves to a fledgling Mazzo, sor DCOS open source project that might meet our requirements or bite the bullet and build something salt.
[00:12:17] side specific from scratch. We had no intention of building an entire container orchestration system. We just needed a way to get Docker containers running behind a load balancer. Our previous deployment system was as simple blue green strategy using golden images. So moving to container orchestration would be a big jump in terms of complexity, let alone, if the orchestration system worked, ultimately, we decided to build our own solution to handle our per market service and environment requirements. It was the less risky approach that didn't preclude us from adopting container orchestration systems and the future. Our solution was named Apollo. I say our solution, but the credit truly goes to my good friend, Peter Peter and I had brainstormed on what a solution would look. It was largely inspired by Heroku. There was an appall dot Jason file that combined the proc file with config for vertical and horizontal scaling in each market and environment pair, as well as what data stores were required like Mongo DB, mySQL, rails, etcetera. Again, this was not a container orchestration system. This was a solution for automating infrastructure that can run processes via a container.
[00:13:30] The solution generated a CloudFormation template that loaded easy two instances with whatever was required to run the container. The instances ran in an auto scaling group that were kind of to a load balancer. Rolling deploys happen with a script that used ha proxy to swap incoming connections between the running container and the next version.
[00:13:49] This provided zero downtime deploys. Infrastructure level changes, say to a backing data store or vertical or horizontal scaling where deployed by a CloudFormation stack update It worked out amazingly well. We use this tooling and infrastructure for a long time after the rewrite, but we did end up migrating to Kubernetes years later after the project had stabilized, but that's an entirely separate episode of the salt side Chronicles.
[00:14:17] Docker plus Apollo provided us everything we needed from the infrastructure side. All we had to do now was just write The services. We ended up creating what you would call a hub and spoke design. Although I think of it more like a solar system or a canal below with a sun and orbiting bodies recall the state of a system before this whole thing started.
[00:14:38] There was a monolith that did all sorts of stuff, a web service and an admin services. Each service shared the same data stores and coordinated through client libraries for different things. There was no true ownership of different things, different concerns smeared across the system like cream cheese on a bagel, except it wasn't tasty.
[00:14:56] It was the opposite, hard to swallow and completely unjoinable. My mission was to untangle this mess by creating a service topology that relied on boundaries and did not have circular penancies. let me first address the gut object at the root of many of these issues. That's the config recall that config has all the information about a market, such as the locations on a map categories and, and maybe posted and what data each ad requires given its importance, its global state that must be accessible at any part of the system.
[00:15:29] Previous system stored config in a, mysql database code access database via client library whenever and wherever they needed. it. This resulted in tens or even hundreds of database quarries, depending on the context also config rarely changed. I think this turned out differently than originally designed.
[00:15:48] I gather it was assumed that config could be changed on demand through some sort of market Edmund dashboard that never happened because of how intertwined the config and business logic was changing config required, changing business logic, which meant co changes. The end result was that config was static and practice.
[00:16:06] It was dynamic in the sense that systems had to understand the different config flags could trigger different code paths, but not in the sense that can, could fig would vary in real time. Also the quote, whenever wherever database architecture was not suitable for a mobile app. So we needed something different.
[00:16:25] The fact that config was static and enabled us to switch to a load once use always model. Config became simple strucks read at boot time, stored in memory, and then access that way. Zero latency, zero network overhead and zero external dependencies. We created a DSL to write the config in a way that made sense to engineers and product owners.
[00:16:47] The same strategy applied to the backend services, web app and mobile app, lower configure once, and then update only one instructed to do. so. Just a sidebar here. Product owners actually reviewed PRS for config correctness, since it was expressed in ways that engineers and product owners understood. It was also able enough that a PO could make slight changes via the edit button on GitHub.
[00:17:11] That was a thousand percent better than the previous workflow. Anyway, this implied another change, recall that there was a single instance for all. markets. This required. Parameterizing every bit of code with the market since that informed which config to use this created messy internals, since there were many different ways to pass quote market around through the different layers, some parts use HDP hitters, some parts identified it from the ad.
[00:17:40] In question others accepted it as a string, as a function perimeter. The move to one instance per market implied that code would only serve one market for the lifetime of the process. As a result, there is no need for a market parameter with any external API or other function call. We also took significant steps to replace checks for F market equals X or category ID equals Y with feature flags.
[00:18:09] in config. That made it easier to understand what the intended business purpose was. It also made it much easier to test surfacing a feature flag, exposed to logic, to all applications in the system so that they may be updated and act accordingly. I was ruthless in this area and encourage all developers to be as well.
[00:18:29] Any PR that inferred behavior on market am or any other aspect of config was immediately rejected. This approach plagued the previous system. I would not repeat. it. So at this point, we had moved away from the idea of dynamically loaded and real-time config. Instead, we set a config loaded once and stored a memory, such an amazing change.
[00:18:50] Now, the next question is who owns the config? Config is the sun and my source. That's my analogy, or the hub in the hub and spoke model ,config combined with add, and user created the holy Trinity of the system. They can not be separated. So I put a bounded context around these three entities. This became the core service.
[00:19:12] The core service was the source of truth for all configuration ads and users. It contained three processes, a web server for the API used by the mobile and web applications, a thrift server for internal RPCs and a background job processor, the design focused on creating the user-facing API and encapsulating common concerns across internal back-end services.
[00:19:35] Nonessential logic was implemented in a command query architecture, the core service didn't index, as for searching, it told the search service to do that. The core service didn't handle reviewing ads. It told the review service to NQ ads for review the core service didn't send welcoming moles. It told the email service to do that in this way, the core service centralized as much of the business workflow as possible and delegated the implementation to other services.
[00:20:03] And some cases, other services reported it back to the core service for their processing. This is where the hub and spoke model fits most traffic flows from the hub to the spokes to external services. And sometimes it flows back. Let me demonstrate using a key business workflow. It goes something like this, a user postman ad, the ad is submitted to the review team.
[00:20:25] The customer support team reviews the ad, and then decides to approve or reject it. If the ad is approved, then it's live on. If it's rejected. And then the user is notified and requested to make changes. Once the user edits the ad, then the review cycle repeats. I must interject something at this point.
[00:20:43] This is an extremely terse summary salt side had hundreds of different business rules for this process. Ads could be auto-approved or rejected. A reviewer could reject an ad for a multitude of reasons. There were rules around automatically re and queuing. ads. There were rules around delay, time spamming and all sorts of stuff.
[00:21:05] In other words, this area of the system was extremely complicated. It was so complicated that no one person or even group of people, including engineering and product fully understood the requirements. They only discovered them by comparing the new system to the old one. This was a powerful case for creating an architectural boundary between the core service, the review queue and the internal Edmund app.
[00:21:28] So here's how we did. Core service called the enqueue add RPC implemented by the review service, the review service managers, the review queue. So it implemented all the business rules that could automatically approve an out or rejected or market as requiring manual review and also track business KPIs.
[00:21:47] Like how long an ad waited in the queue. It provided a set of RPCs to access and manage the queue. The admin service represented the middleman between the review queue, the core service and the customer support team handling all the back office work. It had two processes. One was the first server with RPCs for the review service.
[00:22:08] And the second was an HTTP and Jason API used by the admin app. The admin app was the internal tool used by the customer support team to review ads and a host of other critical business function. Once the customer support teams started to approve or reject an ad. The admin service called the core service approve, add or reject at RPC and told the review service to remove the ad from the queue.
[00:22:35] This separated all the different concerns, primarily cleaving, the user facing and internal facing flows into different code islands. It solved all of our problems and enabled new capabilities. In fact, it probably saved the rewrite a few times. Two engineers. We're entirely responsible for this whole admin area, one platform engineer, and one web engineer, the web engineer, and he did a running a version of the admin service to develop their application, all backend services.
[00:23:05] used hexagonal architecture. So the platform engineer built a development version of the admin API with a fake review service. So it could be shared with the web engineer. All of this was independent of any staging environments or any other parallel. work. This established a fast and isolated feedback loop between the two of them and the product owners that enabled them to discover business requirements that no one knew existed at the start.
[00:23:30] This architecture paid dividends for years to come and other areas, the rules around approving and reviewing at that some of the highest change rates across the system, admin focused part owners could quickly iterate on this process, independent of everything else. So at this point, the design has the following services: Core service review service and admin service core service manage config ads and users review service, manage the review queue, admin service coordinated things between the app and API review service and core service. This became the admin loop and the system because ads flowed from the core service through the review service via the enqueue ads RPCs and through the admin service.
[00:24:12] And finally, back to the core service via the approve and reject RPCs. So the adamant loop got adds into the system. How do users find them? that's where this search surface comes in. What good is a classified site. If users can't find what they're looking for. And of course, like most other things in this product searching was far more complicated than it appeared on the surface.
[00:24:37] Why? Because config and other product requirements, our solution was to declare a boundary and write the RPCs. We had many different search requirements. We decided to represent them with two different RPCs one RPC handled searches done, and the user-facing apps. The second RPC was a general purpose search and tended for use by any internal service, you can kind of think of this as the lowest common denominator.
[00:25:05] This was a better solution than shifting unique search requirements into individual services because each service must index. data. So whenever a core service changed an ad, then it called search services index at RPC. This worked because core service was the only actor allowed to directly modify ads. The same solution applied to any other searchable entity, user searches had the core API, the core API that's the API use by the user-facing mobile and web applications, excepted calls and HTP and Jason.
[00:25:40] Transform those to the relevant proof RPC call to the search service and transform the result back to HTP and Jason, the admin service and review service. We're also heavy search consumers. The review service needed to make all kinds of searches for all the auto approve and reject business rules. One such rule was automatically reject this ad.
[00:26:01] If something is posted in the same category with the seller title and description in the past three days. The admin service used the general purpose search RPC to implement these rules and the massive searches inside the Emond app and the customer support team needed with all kinds of filters.
[00:26:17] Regardless of the consumer core service told the search service to index stuff, and then consumers could search for it through declared RPCs. This brings the service count up to core service search service, review service, and admin service. These four services accounted for the bulk of the. product. However, there was still more boundaries to create.
[00:26:38] I was lucky to work on a key feature before the rewrite started. It was my first serious contribution to the business I wrote and launched the first paid feature in the product prior to that everything had been free. But with the introduction of the paid features, users could purchase upgrades like promoted ad or bump ups, promoted ads were featured in searches.
[00:27:02] Bump ups would continually move their ad to the top of the search results. These paid upgrades were designed to help users sell their stuff. Launching this feature included a web service for processing payments through various payment providers. This was a key requirement because we needed different PIM providers for each market.
[00:27:22] This was especially complicated because it was hard enough to find payment providers for these developing markets. The payment service served up a list of payment gateways. A user could click on one and then be redirected through the payment flow on the gatewa. And then back through to our system.
[00:27:38] This batch was already in place. So all we had to do was add the thrift RPCs integrate existing payment service and the core service and the core API. So here's the services up to this point, core service search service review service, admin service, and payment service. So what happens after a completing a payment?
[00:27:59] Well, the user needs an email receipt enter the email service, email service represented a large chunk of business complexity. Again, it followed established pattern of being more complicated than it appeared on the surface. The full requirements were not known at the beginning. Instead, they were discovered by comparing the old and new systems, then making whatever changes.
[00:28:23] Salt site email, hadn't added a quirk of dealing with translations or localizations, depending on your preference. Also, the product has sent a lot of emails, combine that with email specific business logic and unique role to play in the business process made it easy to declare a boundary. The email service declared an RPC for every email sent by the product.
[00:28:46] Any service likely the core service could call the RPC to deliver the email. The caller would provide everything required. Like the add on user in the RPC. So email service had a functionally stateless design. We also built a live development environment using fixer data where developers could make code changes and then see HTML emails in their browser.
[00:29:10] It was great fun. Plus it moved email invitation like customization and localization away from the color that kept everything cleanly, separated, and accessible to any actor in the system. This brings the services to the core service search service review service, admin service, payment service, and now email service now comes a group of services that fall into not change often or no product owner, or generally less important bucket.
[00:29:40] These were typically a marketing related thoughts. I didn't have a dedicated marketing person let alone a team for it. At this point, people had undertaken various initiatives. That had made it to production. And unfortunately people had come and gone leaving abandoned projects, but of course, a lot of it was still running and production.
[00:30:00] One such project generated a site map, XML file. Luckily one engineer had created a library for this functionality already. We extracted the existing bits into a new repo and this became the site map service. I don't think it receives any commits after we got it working. No changes and requirements equates to maintenance only in my mind.
[00:30:21] So let's keep that one over here with similar support levels. Another was Google remarketing. This is something to do with keyword generation and Google ad words. I can't remember exactly what the point of this thing was. Other than that, there was certainly no person who could give clear requirements for.
[00:30:37] it. All I can remember is that it loaded Ash from the core service and generated a CSB file of keywords and uploaded that to a place our Google ad words account could read from then there was a third service for something I can't remember. That goes to show you how unimportant it was in the grand scheme of things.
[00:30:57] So here's the count so far. Core service search service, review service, admin service payment service, email service site map service. remarketing service and unknown marketing service. There are still a few services that didn't make it into the summary, but that's the highlights. Each service was independently deployed and developed.
[00:31:24] It had full ownership, open state of stores and API. We had declared thrift RPC. All services shared a common definition of config entities add and user all contextual information was passed into the RPC call to avoid cycle calls among services. All services were deployed to separate per market infrastructure.
[00:31:46] This architecture solved the problems that plague the previous. system. I know that because I continue to work. in salt side from there two and a half years after the rewrite that allowed me to observe the long tail impact of these architectural changes. I witnessed the architecture scale up support twice as many developers, I witnessed those developers build and deploy and operate their own services.
[00:32:08] I witnessed architectural boundaries hold and empower future part work salt side eventually did have their first profitable year. That would have never happened without the. rewrite. I'm very happy with the role I played in this massive project and what the engineers I worked with. It was truly a technical success, but God damn, it took way more time and effort than anyone expected.
[00:32:30] Now we're reaching the end of the salt side Chronicles. All that's left is to revisit this undertaking through the lens of everything I know. Join me tomorrow for the final retrospective episode that wraps up this batch Visit smallbatches.fm for the show notes. Also find small batches FM on Twitter and leave your comments in the thread for this episode.
[00:32:53] More importantly, subscribe to this podcast for more episodes, just like this one. If you enjoy this episode, then tweet it or posted to your team slack for rate this show on iTunes, it also supports the show and helps me produce more small batches. Have you back again for the next episode. So until then, happy shipping,
[00:33:16] are you feeling stuck, trying to level up your skills to applying software then apply for my software delivery dojo. My dojo is a four week program designed to level up your skills, building, deploying and operating production systems. Each week, participants will go through a theoretical and practical exercises led by me designed to hone the skills needed for a continuous delivery. I'm offering this dojo at an amazingly affordable price. to small batches listeners spots are limited. So apply now at softwaredeliverydojo.com.
[00:33:51] Like the sound of small batches? This episode was produced by pods worth media. That's podsworth.com.