Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.
Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Welcome to Screaming in the Cloud, I’m Corey Quinn, about, oh I don’t know, two years ago and change, I wound up writing a blog post titled, “Developer Portals are An Anti Pattern,” and I haven’t really spent a lot of time thinking about them since. This promoted guest episode is brought to us by our friends at OpsLevel, and they have sent their CTO and co-founder Ken Rose, presumably in an attempt to change my perspective on these things. Let’s find out. Ken, thank you for agreeing to, well, run the gauntlet, for lack of a better term.
Ken: Hey, Corey. Thanks again for having me. And I’ve heard, you know, heard and listened to your show a bunch, and really excited to be here today.
Corey: Let’s begin with defining our terms. I’m curious to know what a developer portal is. ‘What would you say a developer portal means to you?’ Like it’s a college entrance essay.
Ken: Right? Definitely. You know, so really, a developer portal is this consolidated place for developers to come to, especially in large organizations to be able to get their jobs done more easily, right? A large challenge that developers have in large organizations, there’s just a lot to do and a lot to take care of. So, a developer portal is a place for developers to be able to better own, manage, and run the services, they’re responsible for that run in production, and they can do that through access, easy access to self-service tooling.
Corey: I guess, on some level, this turns into one of those alignment charts of, like, what is a database and, like, how prescriptive you want to be. It’s like, well is a senior engineer a database because you can query them and they have information? Would you consider, for example, Kubernetes be a developer platform, and/or would the AWS console?
Ken: Yeah, that’s actually an interesting question, right? So, I think there’s actually two—we’re going to get really niggly here—there’s developer platform and developer portal, right? And the word portal for me is something that sits above a developer platform. I don’t know if you remember, like, the late-90s, early-2000s, like, portals were all the rage.
Like, Yahoo and AltaVistas were like search portals, they were trying to, at the time, consolidate all this information on a much smaller internet to make it easy to access. A developer portal is sort of the same thing, but custom-built for developers and trying to consolidate a lot of the tooling that exists. Now, in terms of the AWS console? Yeah, maybe. Like, it has a suite of tools and suite of offerings. It doesn’t do a lot on the well, how do I quickly find out what’s running in production and who is responsible for it? I don’t know, unless AWS shipped, like, their, you know, three-hundredth new offering in the last week that I haven’t, you know, kept on top of.
But you know, there’s definitely some spectrum in terms of what goes into a developer portal. For me, there’s kind of three main things you need. You do need some kind of a catalog, like, what’s out there who owns it; you need some kind of a way to measure, like, how good are those services, like, how well built are they; and then you need some access to self-service tooling. And that last part is where, like, the Kubernetes or AWS could be, you know, sort of a dev portal as well.
Corey: My experience with developer portals—there was a time when I loved it. RightScale was what I used—at some depth—back in I want to say 2010, 2011 because the EC2 console was clearly not built or designed by anyone who had not built EC2 themselves with their bare hands and sweat of their brow. And in time, the EC2 console got better where it wasn’t written in hieroglyphics, as best we could tell, and it became ‘click button to launch instance.’ And RightScale really didn’t have a second act and they wound up getting acquired by our friends over at Flexera years later. And I haven’t seen their developer portal in at least eight years as a direct result of this.
So, the problem, at least when I was viewing it purely in the context of AWS services, it feels like you are competing against AWS iterating forward on developer experience, which they iterate slowly, sometimes, and unevenly across their breadth of services, but it does feel like at some level by building an internal portal, you are, first, trying to out-innovate AWS, in some ways, and two, you are inherently making the trade-off of not using recent features and enhancements that have not themselves been incorporated into the portal. That’s where the, I guess the start, the genesis of my opposition to the developer portal approach comes from. Is that philosophy valid these days? Not as much. Because I can see an argument for it shifting.
Ken: Yeah, I think it’s slightly different. I think of a developer portal as again, it’s something that sort of sits on top of AWS or Google Cloud or whatever cloud provider use, right? You give an example for example with RightScale and EC2. So, provisioning instances is one part of the activity you have to do as a developer. Now, in most modern organizations, you have, like, your product developers that ship features. They don’t actually care about provisioning instance themselves. There are another group called the platform engineers or platform group that are responsible for building automation and tooling to help spin up instances and create CI/CD pipelines and get everything you need set up.
And they might use AWS under the covers to do that, but the automation built on top and making that accessible to developers, that’s really what a developer portal can provide. In addition, it also provides links to operational tooling that you need, technical documentation, it’s everything you need as a developer to do your job, in one place. And though AWs bills itself is that, I think of them as more, they have a lot of platform offerings, right, they have a lot of infra-offerings, but they still haven’t been able to, I think, customize that, unless you’re an organization that builds—that has kind of gone in-all on AWS and doesn’t build any of your own tooling, that’s where a developer portal helps. It really helps by consolidating all that information in one place, by making that information discoverable for every developer so they have less… less cognitive load, right? We’ve asked developers to kind of do too much that we don’t… we’ve asked to shift left and well, how do we make that information more accessible?
Regarding the point of, you know, AWS adds new features or new capabilities all the time and, like, well you have this dev portal, that’s sort of your interface for how to get things done. Like, how will you use those? Dev portal doesn’t stop you from doing that, right? So, my mental model is, if I’m a developer, and I want to spin up a new service, I can just press a button inside of my dev portal in my company and do that. And I have a service that is built according to the latest standards, it has a CI/CD pipeline, it already has a—you know, it’s registered in PagerDuty, it’s registered in Datadog, it has all the various bits.
And then there’s something else that I want to do that isn’t really on the golden path because maybe this is some new service or some experiment, nothing stops us from doing that. Like, you still can use all those tools from AWS, you know, kind of raw. And if those prove to be valuable for the rest of the organization, great. They can make their way into the dev portal; they can actually become a source of leverage. But if they’re not, then they can also just sit there on the vine. Like, not everything that eight of us ever produces will be used by every company.
Corey: Many years ago, I got a Cisco pair of certifications because recession was hitting and I needed to be better at networking. And taking those certifications, in those days before Cisco became the sad corporate dragon with no friends we all know today, they were highly germane and relevant. But I distinctly remember, even now, 15 years later, that there was this entire philosophy of pretend that the entire world is Cisco only, which in networking is absolutely never true. It feels like a lot of the AWS designs and patterns tend to assume, oh yeah, you’re going to use AWS services for everything. I have never yet found that to be true, other than when I’m just trying to be obstinate.
And hell is interoperability between a bunch of different things. Yes, I may want to spin up an EC2 instance and an AWS load balancer and some S3 storage or whatnot, but I’m also going to want to monitor it with PagerDuty, I’m going to want to have a CDN that isn’t CloudFront because most CDN these days don’t hate you in quite the same economic ways and are simpler to work with, et cetera, et cetera, et cetera. So, there’s definitely a story wherein I’ve found that there’s an—the interoperability of tying these things together is helpful. How do you avoid falling down the trap of oh, everyone should be multi-cloud, single pane of glass at cetera, et cetera? In practice that always seems to turn to custard.
Ken: Yeah, I think multi-cloud and single pane of glass are actually two different things. So multi-cloud, like, I agree with you to some sense. Like, pick a cloud and go with it, like, unless you have really good business reasons to go for multi-cloud. And sometimes you do, like, years ago, I worked at PagerDuty, they were multi-cloud for a reliability reason, that hey, if one cloud provider goes down, you don’t want [crosstalk 00:08:40]—
Corey: They were an example I used all the time for that story—
Ken: Right.
Corey: —specifically the thing woke you up was homed in a bunch of different places, whereas the marketing site, the onboarding flow, the periphery stuff around it was not because it didn’t need to be.
Ken: Exactly.
Corey: Like, the core business need of wake you up was very much multi-cloud because once upon a time, it wasn’t and it went down with the rest of us-east-1 and people weren’t woken up to be told their site was on fire.
Ken: A hundred percent. And on the kind of like application side where, even then, pick a cloud and go with it, unless there’s a really compelling business reason for your business to go multi-cloud. Maybe there’s something credits or compliance or availability, right? There might be reasons, but you have to be articulate about whether they’re right for you.
Now, single pane of glass, I think that’s different, right? I do think that’s something that, ultimately, is a net boon for developers. In any large organization, there is a myriad of internal tools that have been built. And it’s like, well, how do I provision a new topic in the Kafka cluster? How do I actually get access to the AWS console? How do I spin up a new service, right? How do I kind of do these things?
And if I’m a developer, I just want to ship features. Like, that’s what I’m incented to do, that’s what I’m optimizing for. And all this other stuff I have to do as part of my job, but I don’t want to have to become, like, a Kubernetes guru to be able to do it, right? So, what a developer portal is trying to do is be that single pane of glass, bringing all these common set of tools and responsibilities that you have as a developer in one place. They’re easy to search for, they’re easy to find, they’re easy to query, they’re easy to use.
Corey: I should probably have asked this earlier on, but let’s disambiguate for a little bit here. Because when I’m setting up to use a new service or product and kick the tires on it, no two explorations really look the same. Whereas at most responsible mature companies that are building products that are—services that are going to production use, they’ve standardized around a number of different approaches. What does your target customer look like? Is there a certain point of scale, a certain level of complexity, a certain maturity of process?
Ken: Absolutely. So, a tool like OpsLevel or a developer portal really only makes sense when you hit some critical mass in terms of the number of services you have running in production, or the number of developers that you have. So, when you hit 20, 30, 50 developers or 20, 30, 50 services, an important part of a developer portal is this catalog of what’s out there. Once you kind of hit the Dunbar number of services, like, when you have more than you keep in your head, that’s when you start to need tooling like this. If you look at our customer base, they’re all you know, kind of medium to large-sized companies. If you’re a startup with, like, ten people, OpsLevel is probably not right for you. We use all playable internally at OpsLevel, and you know, like, we’re still a small company. It’s like, we make it work for us because we know how to get the most out of it, but like, it’s not the perfect fit because it’s not really meant for, you know, smaller companies.
Corey: Oh, I hear you. I think I’m probably… I have a better AWS bill analytic system running internally here at The Duckbill Group than some banks do. So, I hear you on that front.
Ken: I believe it.
Corey: But also implies to me that there’s no OpsLevel prospect or customer deployment that has ever been greenfield. It’s always you’re building existing things, there’s already infrastructure in place, vendors have been selected across the board. You aren’t—don’t to want to starting a company day one, they’re going to all right, time to spin up our AWS account and we’re also going to wind up signing up for OpsLevel, from the sound of it.
Ken: Correct—
Corey: Accurate? Inaccurate?
Ken: I think that’s actually accurate. Like, a lot of the problems, we solve other problems that come as you start to scale both your product and your engineering team. And it’s the problems of complexity.
Corey: What do those painful problems look like? In other words, what is someone sitting at home right now listening to this, or driving to work debating whether want to ram a bridge abutment or go into the office depending on their mental state today, what painful problem did they have that OpsLevel is designed to fix?
Ken: Yeah, for sure. So, let’s help people self-select. So, here’s my mental model for any [unintelligible 00:12:25]. There are product developers, platform developers, and engineering leaders. Product developers, if you’re asking questions like, “I just got paged for the service. I don’t know what this does.” Or, “It’s upstream from here. Where do I find the technical documentation?” Or, “I think I have to do something with the payment service. Where do I find the API for that?”
You know, when you get to that scale, a developer portal can help you. If you’re a platform engineer and you have questions like, “Okay, we got to migrate. We’re migrating, I don’t know, from Datadog to Honeycomb, right? We got to get these fifty or a hundred or thousands of services and all these different owners to, like, switch to some new tool.” Or, “Hey, we’ve done all this work to ship the golden path. Like, how to actually measure the adoption of all this work that we’re doing and if it’s actually valuable?” Right?
Like, we want everybody to be on a certain set of CI tooling or a certain minimum version of some library or framework. How do we do that? How do we measure that? OpsLevel is for you, right? We have a whole bunch of stuff around maturity.
And if you’re engineering leader, ultimately, questions you care about, like, “How fast are my developers working? I have this massive team, we’ve made this massive investment in hiring all these humans to write software and bring value for our customers. How can we be more efficient as a business in terms of that value delivery?” And that’s where OpsLevel can help as well.
Corey: Guardrails, whether they be economic, regulatory, or otherwise, have to make it easier than doing things incorrectly because one of the miracle aspects of cloud also turns into a bit of a problem, which is shadow IT is only ever a corporate credit card away. Make it too difficult to comply with corporate policies and people won’t. And they’re good actors; they’re trying to get work done. They’re not trying to make people’s lives harder, but they don’t want to spend six weeks provisioning an EC2 cluster. So, there’s always that weird trade-off.
Now, it feels—and please correct me if I’m wrong—once someone has rolled out OpsLevel at their organization, where it really shines is spinning up a new service where okay, great, you’re going to spin up the automatic observability portion of it, you’re going to spin up the underlying infrastructure in certain ways that comply with our policies, it’s going to build the CI/CD pipelines around it, you’re going to wind up having the various cost instrumentation rolled out to it. But for services that are already excellent within the environment, is there an OpsLevel story for them?
Ken: Oh, absolutely. So, I look at it as, like, the first problem OpsLevel helps solve is the catalog and what’s out there and who owns it. So, not even getting developers to spin up new services that are kind of on the golden path, but just understanding the taxonomy of what are the services we have? How do those services compose into higher-level things like systems or domains? What’s the whole set of infrastructure we have?
Like, I have 50 AWS accounts, maybe a handful of GCP ones, also, some Azure. I have all this infrastructure that, like, how do I start to get a handle on, like, what’s out there in prod and who’s responsible for it. And that helps you get in front of compliance risks, security risks. That’s really the starting point for OpsLevel building that catalog. And we have a bunch of integrations that kind of slurp all this data to automatically assemble that catalog, or YAML as well if that’s your thing. But that’s the starting point is building that catalog and figuring out this assignment of, like, okay, this service and this human, or this—sorry—team, like, they’re paired together.
Corey: A number of offerings in this space, which honestly, my exposure to it is bounded simultaneously to things that are ten years old and no one uses anymore, or a bunch of things I found on GitHub. And the challenge that both of those products tend to have is that they assume certain things to be true about a given environment: that they’re using Terraform to manage everything, or they’re always going to be using CloudFormation, or everyone there knows Python or something else like that. What are the prerequisites to get started with OpsLevel?
Ken: Yeah, so we worked pretty hard to build just a ton of integrations. I would say integrations is are just continuing thing we have going on in the background. Like, when we started, like, we only supported a GitHub. Now, we support all the gits, you know, like GitHub, GitLab, Bitbucket, Azure DevOps, like, we’re building [unintelligible 00:16:19]. There’s just a whole, like, long tail of integrations.
The same with APM tooling. The same with vulnerability management tooling, right? And the reason we do that is because there’s just this huge vendor footprint, and people, you know, want OpsLevel to work for them. Now, the other thing we try to do is we also build APIs. So, anything we have as, like, a core integration, we also have kind of like an underlying API for, so that there’s, no matter what you have an escape hatch. If like, you’re using some tool that we don’t support or you have some homegrown thing, there’s always a way to try to be able to integrate that into OpsLevel.
Corey: When people think about developer portals, the most common one that pops to mind is Backstage, which Spotify wound up building, internally, championing, open-sourcing, and I believe, on some level, turned into a product because if there’s one thing people want, it’s to have their podcast music company become a SaaS vendor, which is weird to me. But the criticisms that I’ve seen about and across the board have rung relatively true, including from people internal at Spotify who have used the thing, which is, well first is underestimating the amount of effort that is necessary to maintain Backstage itself, that the build versus buy discussion is always harder to bu—engineers love to build, but they shouldn’t be building things outside of their core competency half the time, and the other is driving adoption within the org where you can have the most amazing developer portal in the known universe, but if people don’t use it, it may as well not exist and doing the carrot and stick approach often doesn’t work. I think you have a pretty good answer that I need not even ask you to elaborate on, “Well, how do we avoid having to maintain this ourselves,” since you have a company that does this, but how do you find companies are driving adoption successfully once they have deployed OpsLevel?
Ken: Yeah, that’s a great question. So, absolutely. Like, I think the biggest thing you need first, is kind of cultural buy-in and that this is a tool that we want to invest in, right? I think one of the reasons Spotify was successful with Backstage and I think it was System Z before that was that they had this kind of flywheel of, like, they saw that their developers were getting, you know better faster, working happier, by using this type of tooling, by reducing the cognitive load. The way that we approach it is sort of similar, right?
We want to make sure that there is executive buy-in that, like, everybody agrees this is, like, a problem that’s worth solving. The first step we do is trying to build out that catalog again and helping assign ownership. And that helps people understand, like, hey, these are the services I’m responsible for. Oh, look, and now here’s this other context that I didn’t have before. And then helping organizations, you know, what—it depends on the problem we’re trying to solve, but whether it’s rolling out self-serve automation to help developers, like, reduce what was before a ton of cognitive load or if it’s helping platform teams define what good looks like so they can start to level up the overall health of what’s running in production, we kind of work on different problems, but it’s picking one problem and then you know, kind of working with the customers and driving it forward.
Corey: On some level, I think that this is going to be looked down upon inherently just by automatic reflex of folks with infrastructure engineering backgrounds. It’s taken me some time to learn to overcome my own negative reaction to it. Because it’s, I’m here to build things and I want to build things out in such a way that it’s portable and reusable without having to be tied to a particular vendor and move on. And it took me a long time to realize that what that instinct was whispering in my ear was in fact, no, you should be your own cloud provider. If that’s really what I want to do, I probably should just brush up on you know, computer science trivia from 20 years ago and then go see if I can pass Google’s SRE interview.
I’m not here to build the things that just provision infrastructure from scratch every company I wind up landing at. It feels like there’s more important, impactful work that I can do. And let’s be clear, people are never going to follow guardrails themselves when they have to do a bunch of manual steps. It has to be something that is done for them. And I don’t know how you necessarily get there without having some form of blueprint or something like that, provided for them with something that is self-service because otherwise, it’s not going to work.
Ken: I a hundred percent agree, by the way, Corey. Like, the take that, like, automation is the only way to drive a lot of this forward is true, right? If for every single thing you’re trying—like, we have a concept called a rubric and it’s basically how you measure the service health. And you can—it’s very customizable, you have different dimensions. But if, for any check that’s on your rubric, it requires manual effort from all your developers, that is going to be harder than something you can just automate away.
So, vulnerability management is a great example. If you tell developers, “Hey, you have to go upgrade this library,” okay, some percentage [unintelligible 00:20:47], if you give developers, “Here’s a pull request that’s already been done and has a test passing and now you just need to merge it,” you’re going to have a much better adoption rate with that. Similarly with, like, applying templates being able to [up-level 00:20:57], you know, kind of apply the latest version of a template to an existing service, those types of capabilities, anything where you can automate what the fixes are, absolutely you’re going to get better adoption.
Corey: As you take a look at your existing reference customers—which is something I always look for on vendor websites because, like, oh, we have many customers who will absolutely not admit to being customers, it’s like, that sounds like something that’s easy to say—you have actual names tied to these things. Not just companies, but also individuals. If you were to sit down and ask your existing customer base, “So, why did you wind up implementing OpsLevel and what has the value that’s delivered to you been since that implementation?” What do they say?
Ken: Definitely. I actually had to check our website because we, you know, land new customers and put new logos on it. I was like, “Oh, I wonder what the current set is out right now?”
Corey: I have the exact same challenge. Like oh, we have some mutual customers. And it’s okay. I don’t know if I can mention them by name because I haven’t checked our own list of testimonials [unintelligible 00:21:51] lately because say the wrong thing and that’s how you wind up being sued and not having a company anymore.
Ken: Yeah. So, I don’t—I definitely, you know, want to stay [on side 00:22:00] on that part, but in terms of, like, kind of sample reference customer, a lot of the folks that we initially worked with are the platform teams, right? They’re the teams that care about what’s out there, and they need to know who’s responsible for it because they’re trying to drive some kind of cross-cutting change across the entire, you know, production footprint. And so, the first thing that generally people will say is—and I love this quote. This came—I won’t name them, but like, it’s in one of our case studies.
It was like, “I had, like, 50 different attempts at making a spreadsheet and they’re all, like, in the graveyard, like, to be able to capture what’s out there and who’s responsible for it.” And just OpsLevel helping automate that has been one of the biggest values that they’ve gotten. The second point, then is now be able to drive maturity and be able to measure how well those services are being built. And again, it’s sort of this interesting thing where we start with the platform teams. And then sometime later security teams find out about OpsLevel, and they’re like, “Oh, this is a tool I can use to, like, get developers to do stuff? Like, I’ve been trying to get developers to do stuff for the longest time.”
And they—I file Jira tickets and they just sit there and nothing gets done. But when it becomes part of this, like, overall health score that you’re trying to increase a part of the across the board, yeah, it’s just a way to kind of drive action.
Corey: I think that there’s a dichotomy of companies that emerge. And I tend to see the world through a lens of AWS bills, so let’s go down that path. I feel like there are some companies presumably like OpsLevel, whereas if I—assuming you’re running on top of AWS—if I were to pull your AWS bill, I would see upwards of 80% of your spend is going to be on this application called OpsLevel, the service that you provide to people. As opposed to the other side of the world, which is large enterprises, where they’re spending hundreds of millions of dollars a year, but the largest application they have is a million-and-a-half a year in spend because just, they have thousands of these things scattered everywhere. That latter case is where I tend to see more platform teams, where I start to see a lot of managing a whole bunch of relatively small workloads. And developer platforms really seem to be where a lot of solutions lead, whereas 80% of our workload is one application, we don’t feel the need for that as much. Is that accurate? Am I misunderstanding some aspect of it?
Ken: No, a hundred percent you’d hit the nail on the head. Like, okay, think about the typical, like, microservices adoption journey. Like, you started with, you know, some small company—like us—you started with a monolith. Ah, maybe you built out a second app—
Corey: Then you read on Hacker News and realize, “Oh, if we want to hire people, we’ve got to be doing what all the cool kids are up to.”
Ken: Right. We got a microservice all the thing—but that’s actually you know, microservices should come later, right, as a response to you needing scale your org and scale your—
Corey: As someone who started building some application with microservices, I could not agree more.
Ken: A hundred percent. So, it’s as you’re starting to take steps to having just more moving parts in your production infrastructure, right? If you have one moving part, unless it’s like a really large moving part that you can internally break down, like, kind of this majestic monolith where you do have kind of like individual domains that are owned by different teams, but really the problem we’re trying to solve, it’s more about, like, who owns what. Now, if that’s a single atomic unit, great, but can you decompose that? But if you just have, like, one small application, kind of like the whole team is owning everything, again, a developer portal is probably not the right tool for you. It really is a tool that you need as you start to scale your engineer work and as you start to scale the number of moving parts in your production infrastructure.
Corey: I tended to use to think of that in terms of boring companies versus innovative ones and I don’t think that’s accurate. I think it is the question of maturity and where companies lead to. On some level, of OpsLevel starts growing and becomes larger and larger in different ways and starts doing acquisitions and launching into other areas, at some point, you don’t have just one product offering, you have a multitude of them. At which point having something like that is going to be critical. But I have to ask, given that you are sort of not exactly your target customer profile, what are the sharp edges been on using it for your use case?
Ken: Yeah. So, we actually have an internal Slack channel, we call OpsLevel on OpsLevel. And finding those sharp edges actually has been really useful for us. You know, all the good stuff, dogfooding and it makes your own product better. Okay, so we have our main app, we also do have a bunch of smaller things and it’s like, oh yeah, you know, we have, like, I don’t know, various Hackaday things that go on, it’s important we kind of wind those down for, you know, compliance, we have our marketing site, we have, like, our Terraform.
Like, so there’s, like, stuff. It’s not, like, hundreds or thousands of things, but there’s more than just the main app. The second though, is it’s really on the maturity piece that we really try to get a lot of value out of our own product, right? Helping—we have our own platform team. They’re also trying to drive certain initiatives with our product developers.
There is that usual tension of our, like, our own product developers are like, “I want to ship features.” What’s this security thing I have to go take care of right now? But OpsLevel itself, like, helps reflect that. We had an operational review today and it was like, “Oh, this one service is actually now”—we have platinum as a level. It’s in gold instead of platinum. It’s like, “Why?” “Oh, there’s this thing that came up. We got to go fix that.” “Great. Let’s go actually go fix that so we’re back into platinum.”
Corey: Do you find that there’s often a choice you have to make internally, where you could make the product more effective for your specific use case, but that also diverges from where your typical customer needs or wants the product to go?
Ken: No, I think a lot of the things we find for our use case are, like, they’re more small paper cuts, right? They’re just as we’re using it, it’s like, “Hey, like, as I’m using this, I want to see the report for this particular check. Why do I have to click six times to get?” You know, like, “Wouldn’t it be great if we had a button?” Right?
And so, it’s those type of, like, small innovations that kind of come up. And those ultimately lead to, you know, a better product for our customers. We also work really closely with our customers and developers are not shy about telling you what they don’t like about your product. And I say this with love, like, a lot of our customers give us phenomenal feedback just on how our product can be better and we try to internalize that and you know, roll that feedback into the product.
Corey: You have a number of integrations of different SaaS providers, infrastructure providers, et cetera, that you wind up working with. I imagine that given your scale and scope and whatnot, those offerings are dictated by what customers say, “Hey, we’re using this thing. Are you going to support that or are you not going to maintain our business?” Which is a great way to wind up financing a lot of product development and figuring out what matters to people. My question for you is, if you look across the totality of your user base, what are the most popularly used integrations, if you can say?
Ken: Yeah, for sure. I think right now—I could actually dive in to pull the numbers—GitHub and GitLab—or… I think GitHub, like, has slightly more adoption across our customer base. At least with our customers, almost nobody uses Bitbucket. I mean, we have, like, a small number, but, like, it’s… I think, single-digit percentage. A lot of people use PagerDuty, which you know, hey, I’m an ex-PagerDuty person [crosstalk 00:28:24] and I’m glad to see that.
Corey: I have a free tier PagerDuty account that will automatically page me for my home automation stuff. Specifically, if you know, the fire alarm goes off. Like, yeah, okay, there are certain things I want to be woken up for, but it’s a very short list.
Ken: Yeah, it’s funny, the running default message when we use a test PagerDuty was, “The server is on fire.” [unintelligible 00:28:44] be like, “The house is on fire.” Like you know, go get that taken care of. There’s one other tool so that’s used a lot. Datadog actually is used a ton by just across our entire customer base, despite its… we’re also Data—we’re a Datadog partner, we’re a Datadog customer, you know? It’s not cheap, but it’s a good product for, you know, monitoring and logs and there are [crosstalk 00:29:01]—
Corey: No other than cloud infrastructure providers, I get the number one most common source of inquiries is Datadog optimization. It has now risen to a board-level concern in many cases because observability is expensive. That’s a sign of success, on some level. Meanwhile, I’m sitting here, like, Date-a-dog? Oh, my God, that’s disgusting. It’s like Tinder for Pets. Which it turns out is not at all what they do.
Ken: Nice.
Corey: Yeah.
[audio break 00:29:23]—optimizing their Slack integrations, their GitHub integration, et cetera. Or are they starting with the spinning up the servers piece of it?
Ken: A lot of the time—and again, that first problem they’re trying to solve is just get me a handle on everything we have running in production. You know, if you have multiple AWS accounts, multiple Kubernetes clusters, dozens or even hundreds of teams, God help you if you’re going to try to, like, build a list manually to consolidate all that information. That’s really the first part is, like, integrate Kubernetes, integrate your CI/CD pipelines, integrate Git, integrate your Cloud account, like, will integrate with everything and will try to build that map of, like, here’s everything that’s out there, and start to try to assign it to, like, and here’s people that we think might be responsible in terms of owning the software. That’s generally the starting point.
Corey: Which makes an awesome amount of sense. I think going at it from the infrastructure first perspective is where I’ve seen most developer platforms founder. And to be fair, the job is easier now than it was years ago because it used to be that you were being out-innovated by AWS constantly. Innovation has slow down there. And you know that because of how much they say the pace of innovation has only sped up.
And whenever AWS says something in a marketing context, they’re insecure about it. I’ve learned this through the fullness of time observing that company. And these days, most customers do not use the majority of features available for any given service. They have solidified to a point where you can responsibly build on top of these things. Now, it seems that the problem is all the ‘yes, and’ stuff that gets built on top of it.
Ken: Yeah. Do you have an example, actually, like, one of the kinds of, like, ‘yes, and’ tools that you’re thinking about?
Corey: Oh, absolutely. We have a bunch of AWS environment stuff so we should configure CloudWatch to look at all these things from an observability perspective. No, you should not. You should set up Datadog. And the first time someone does that by hand, they enable all have the observability and the rest and suddenly get charged approximately the GDP of Guam.
And okay, maybe we shouldn’t do that because then you have the downstream impact of that on your CloudWatch bill. So okay, how do we optimize this for the observability piece directly tied to that? How do we make sure that we get woken up when the site is down or preferably before that, but not every time basically, a EBS volume starts to get a little bit toasty? You have to start dialing this stuff in. And once you’ve found a lot of those aspects, being able to templatize that and roll that out on an ongoing basis and having the integrations all work together feels like it’s the right problem to be solving.
Ken: Yeah, absolutely. And the group that I think is responsible for that kind of—because it’s a set of problems you described—is really, like, platform teams. Sometimes service owners for like, how should we get paged, but really, what you’re describing are these kind of cross-cutting engineering concerns that platform teams are uniquely poised to help solve in an [unintelligible 00:32:03] organization, right? I was thinking what you said earlier. Like, nobody just wants to rebuild the same info over and over, but it’s sort of like, it’s not just building an [unintelligible 00:32:09]; it’s kind of like solving this, like, how do we ship? Can we actually run stuff in prod? And not just run it but get observability and ensure that we’re woken up for it and, like, what’s that total end-to-end look like from, like, developers writing code to running software in production that’s serving traffic? And solving all the problems [unintelligible 00:32:24], that’s what I think of was platform engineering.
Corey: So, my last question before we wind up wrapping this episode comes down to, I am very adept at two different programming languages, and those are brute force and enthusiasm. What implementation language is most of what you find yourself working with? And why is it in invariably going to be YAML?
Ken: Yeah, that’s a great question. So, I think there’s, in terms of implementing OpsLevel and implementing a service catalog, we support YAML. Like, you know, there’s this very common workflow, you just drop a YAML spec, basically, in your repo, if you’re a service owner. And that, we can support that. I don’t think that’s a great take, though.
Like, we have other integrations. Again, if the problem you’re trying to solve is I want to build a catalog of everything that’s out there, asking each of your developers hey, can you please all write YAML files that, like, describe the services you own and drop them into this repo? You’ve inverted this, like, database that essentially you’re trying to build, like, what’s out there and stored it in Git, potentially across several hundreds or thousands of repos. You put a lot of toil now on individual product developers to go write and maintain these files. And if you ever had to, like, make a blanket update to these files, there’s no atomic way to kind of do that, right?
So, I look at YAML as, like, I get it, you know? Like, we use the YAML for all the things in DevOps, so why not their service catalog as well, but I think it’s toil. Like, there are easier ways to build a catalog. By, kind of, just integrate. Like, hook up AWS, hook up GitHub, hook up Kubernetes, hook up your CI/CD pipeline, hook up all these different sources that have information about what’s running in prod, and let the software, let the tool, automatically infer what’s actually running as opposed to requiring humans to manually enter data.
Corey: I find that there are remarkably few technical holy wars that I cannot unify both sides on by nominating something far worse. Like, the VI versus Emacs stuff, the tabs versus spaces, and of course, the JSON versus YAML folks. My JSON versus YAML answer is XML: God’s language. I find that as soon as you suggest that, people care a hell of a lot less about the differences between JSON and YAML because their job is to now kill the apostate, which is me.
Ken: Right. Yeah. I remember XML, like, oh, man, 2002. SOAP. I remember SOAP as a protocol. That was a thing.
Corey: Some of the earliest S3 API calls were done in SOAP, and I think they finally just used it to wash their mouths out when all was said and done.
Ken: Nice. Yeah.
Corey: I really want to thank you for taking the time to do your level best to attempt to convert me, and I would argue in many respects, you have succeeded. I’m thinking about this differently than I did half an hour ago. If people want to learn more, where’s the best place for them to find you?
Ken: Absolutely. So, you can always check out our website, opslevel.com. We’re also fairly active on LinkedIn. If Twitter hasn’t imploded by the time this episode becomes launched, then they can also check us out at twitter.com/OpsLevelHQ. We’re always posting, just different content on, like, how to be successful with service maturity, DevOps, developer productivity, so that you know, ultimately, that you can ship out to customers faster.
Corey: And we will, of course, put links to that in the [show notes 00:35:23]. Thank you so much for taking the time, not just to speak with me, but also for sponsoring this episode. It is appreciated.
Ken: Cheers.
Corey: Ken Rose, CTO and co-founder at OpsLevel. I’m Cloud Economist Corey Quinn and this has been a promoted guest episode of Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment which, upon further reflection, you could have posted to all of the podcast platforms if only you had the right developer platform to pull it off.
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.