Kevin Kohn: Hey everyone. Welcome back to the tech sessions podcast. Today I have a really special guest named Roy Douber who's the guru of all things observability. We're going to talk about why observability is important today, how it impacts your environment. And why we all need to be considering it as an aspect of our daily lives.
So without further ado, let's get started.
Kevin Kohn: Welcome everyone to our Cloud Tech Sessions Observability Masterclass. My name is Kevin Cohn, Vice President of Cloud and Application Engineering here at e360. With me today, I have our observability guru himself, Roy Douber. Roy, please tell us a little bit about yourself. Introduce yourself.
Roy Douber: Roy Douber, I have been here at e360 for a few years now, working on observability, DevOps, AI, various solutions for various clients, primarily observability and prior to that I was a lead site reliability engineer at Qualcomm and thanks for the kind introduction and let's roll right into it.
Kevin Kohn: Sure thing, well, you know, I want to pick on your introduction a little bit because you know, we brought you here and to the team to just be this amazing observability guru and you are, but you're far more than that to the organization. I mean, you have done everything from database optimization for really creative structures and customers in the cloud space. You've helped us re architect data centers.
There's been all kinds of things that your skills have lended themselves to, which gives you also, I think for this particular discussion that much greater value because you bring in the whole end to end view of a data center and much of what you do to this conversation, wouldn't you agree?
Roy Douber: Yeah, so I guess, let's delve a bit deeper into the background. I guess, in my career, I've owned various observability products within my companies that I've worked for and I've worked as a site reliability engineer, which means you get thrown into an outage and the ask is fix it, please. And the fix it please means there's usually millions of dollars on the line an hour.
There is angry clients potentially there is angry stakeholders, potentially and there is a service that is not working for either the client or the stakeholder internally and then, so during the outage, you obviously help provide a solution for the outage that means you have to have a strong understanding of complex systems, databases, applications, logging metrics events be able to read various programming languages help with CI/CD processes and what have you and so really the crux of it all is you find out the answer, but the answer is generally not enough because you've arrived at you've arrived at a place where the service is no longer operating.
And so a lot of the work comes after the fact, you resolve the outage and then the real work starts and you work with the engineering team to solution a better observability story around their application. That's been my, the last probably eight years of my life.
Going in, implementing, writing, thinking of solutions, devising solutions, showcasing the data in a way that's meaningful for the clients or the stakeholders and being able to tell the story, not only to a developer necessarily. So sometimes the end solution is engineering focused, sometimes it's operator focused sometimes many times it's also executive level focused and I think that's kind of the essence of observability is, it's a culture.
It's a tool it's a mechanism to acquire data and then also showcases in a way that helps the business drive revenue at the end of the really, if your application is down, you're losing revenue. If your application is up and hardened, you're driving revenue and better client outcomes.
Kevin Kohn: Very well said and you know, all of us who have been in the data center space have been involved in an outage from time or two, right? For those of us who've been in a long time, many, many different times we've had outages. One of the things I've observed no pun intended there, is that oftentimes if a company has properly invested in observability.
They may have had more advanced notice or they might have had the ability to either remediate or avoid that outage altogether. Can you talk a little bit to that and what your experience may have been in that site?
Roy Douber: Oh, the symptoms prior to the outage it's an interesting topic, right? Because the world is mixed when it comes to solutioning around this. Often the data is right there in front of you if you just had a moment to look and sometimes it's not right so, I see this dichotomy out there in the business world often some companies are just starting in their journey and some companies have all the data right in front of them ready to go, they just don't know where to look at the right time, right?
Or where to look ahead of time. At the small symptoms that lead to, you know, the bigger problem, right? and, you know, that's a tooling conversations, that's a an acquisition of data conversation generally speaking,
if you have all the pieces in place then you start moving towards the culture of observability. It's really, I mean, if you think about DevOps and how it's kind of evolved into, it used to be like a position, but I would say it's more of a culture. I think observability nowadays is more of a culture. You have to have quite a few pieces together and the people that understand those pieces in order to be able to answer the questions that the business is seeking.
Kevin Kohn: Well, on that note, why don't we get into what exactly is observability? So let's define it and go there since you already started that topic, so
Roy Douber: I think that the essence of observability is being able to answer questions about systems that you didn't know beforehand.
So being able to ask a question about what the system is doing in real time and say, okay, can I go and look at my instrumentation and look at my data and look at my dashboards and craft an answer for this question. But I would say it's also a culture of basically well instrumented tooling and technology and the mechanisms that allow you to acquire to process, to store, to visualize and to act on that data.
And with that, you get the capability to answer those questions, hopefully, or devise a, you know, a path towards those answers. So you may not have the data in front of you right away, but maybe there's a way to access that data and then answer complex questions around business systems or business metrics or applications that exist in your ecosystem.
Kevin Kohn: Now, I think a lot of that is incredibly powerful and required for especially large environments. Whenever you get into hundreds of or even thousands of servers, you got tons of nodes of networking. You have all your storage environment that you have to maintain and manage.
It becomes very complex pretty quickly, so you can see a justification for investing in observability for large environments. Do you feel like the same justification exists for smaller environments that it does for the large and why?
Roy Douber: To answer it simply, if you ever want to grow big enough, I think it makes sense to start early.
But really, I mean I struggle with the fact that a lot of companies still see observability as a checkbox item that you don't, you just throw it in there and you have the checkbox in place and we're all good to go. And, I think the investment needs to come in the training of people in the engineering time spent on developing it so if an engineer spends 80 percent of their time driving a product, let's say building a product internally, 20 percent of it roughly should be allocated towards observability and improving the environment that they operate in. Otherwise you'll end up in a scenario where the application comes into dysfunction.
Or those people that know the product in and out leave and there is no collateral left behind to understand what it is that it, that was done, how it operates, why it operates, what are the problems, where the skeletons lie, all that, all those kinds of questions that you'd want to answer on any complex application.
Kevin Kohn: I think that is applicable regardless of how mature your organization is, would you agree?
Roy Douber: Yeah, I think a small business ought to start early, they don't have to start necessarily with any premium tooling, they can start with open source, but I would advise to start early, build your application so that you can easily instrument the metrics, the logs, the traces.
And it'll serve you in the long run as your customer base grows, as your use cases grow, as you leverage new technologies and implement them into your application, such as AI let's say having a modular enough piece of software that allows for observability is also very, very key.
Kevin Kohn: Well, that takes me kind of into the next aspect here, which is, okay, I talked and asked you about a more traditional data center architecture just a few moments ago, but we all know that the world is a lot more complex than that and has been for some time. With the adoption of serverless with the adoption of Kubernetes with the adoption of, of now AI, that's taking the world by storm.
The landscape is changing. We're beginning to have more and more ways of addressing and serving workloads and providing applications than we have historically had in the past. So in the context of these new means and methods that I just described in the cloud computing environments that we're all part of, how do you see the observability story intersecting with the complexity of these new environments?
Roy Douber: Such a good question. Look, I think workloads are becoming more and more ephemeral, right, so your Kubernetes pod could go away at any moment, your AI answers change depending on who asks the question and how they ask the question and all of that data needs to be feedback for the devs feedback for potentially even the clients, right, like clients often care, they go to a status page, they want to see that the service that you're providing is up and running.
Serverless is almost entirely ephemeral, right, like the function runs and then it goes away. If you have a problem with that function, what do you do?
I think, traditionally the tooling has been, I'm going to set this up, I'm going to install it on my machine and then I leave it alone and it runs in the background but the use cases have evolved with ephemeral jobs that go away. You have to have a solution that supports that and I think oftentimes businesses are don't have that solution in place yet.
And so that's one of the things we specialize in is coming in, seeing where the gaps are looking at their most critical applications, asserting what can and can't be done with the current tool set that exists and trying to, you know, add solutions where it makes sense.
The observability landscape is massive, right? There's a lot, there's a lot happening, there's consolidation within the observability landscape, there's all these new vendors. Oftentimes there's new solutions that come to play that are not yet ready for prime time, and that's where we'll have expertise to say, yeah, you know, go with this or add this, or this will make your solutioning more flexible and give you the ability to pivot as needed.
This, this specific solution, you want to add AI to your product, great, let's add this solution, so you could see what people are asking when they're flagging a question, where there's room for improvement with your inference, where there's room for improvement with your runtime, how do you improve the response time of the AI itself or the model?
Is there room to use a different model that gives a very similar answer? How do you answer those questions, if you don't have observability, I see it, nowadays as a must have, you cannot do it without, you cannot work in a world like this without it.
Kevin Kohn: No, that's so well put. If I may just drill into something you said at the beginning of that explanation, observability allows you to kind of see inside and what your apps are doing, see how they're running or what your environment is currently doing.
There a lot in that statement. Right? It's not just, okay yes, my app and it's running. But what about is it performing to its optimum? Like, is it doing the best that it could do from a performance standpoint? And it's not just that. It's also, am I selecting the right infrastructure that it's that it's running on to make sure that it's operating in the most efficient way possible, right?
So there's multiple aspects to this, can you dive into that at all and you know, what are some of those things that are very interesting from the perspective of those other aspects?
Roy Douber: Yeah, so there's definitely a cost optimization angle to this whole, whole equation, right? Should you be running in containers? Can it be serverless? Can it just run on one server? It's very dependent on the type of application and where the bottlenecks in the application are. is it a memory heavy application? Is it a CPU bottlenecked application? Is it a storage bottlenecked application? All of these things need to be answered.
All of that data needs to be processed and based off of that, you can make a proper architecture decision as you architect and rearchitect your application, depending on ideally, you always architect your application optimally the first driver and we all know the realities of the world, right?
Kevin Kohn: Sure.
Roy Douber: sometimes, you have to make changes as the world evolves, right? Some of these AI use cases require you know, more CPU, more memory, so that you can answer questions against the model or fine tune the model or deal with you know, rebuilding the application, so that it compiles in time so that you could release it more quickly to the public. All of these questions get answered via observability.
Kevin Kohn: And that example you just shared is so fluid because with AI, you first have to define the use case, and then you have to get started on that path, maybe that use case works out the way you thought, maybe it doesn't, and you have to tweak how you're approaching it. So the observability has to be able to ebb and flow and morph with your activity.
You can't necessarily be a slave to the observability tool and how it's set up. It's got to be malleable, for the use cases that you're putting before it, would you agree?
Roy Douber: Yeah, you got to be able to, I mean, oftentimes in any application that you run out there, there is the CI/CD process, right? And in a blue grill, green release cycle, you want to see how the application is performing in real time. So if you're using Kubernetes, you could say, okay, I'm going to release this application to 5 percent of my user base.
I want to be able to see how that 5 percent behaves as compared to the 95 percent that remained on the prior version and slowly release to more and more users in my community, right my community of users and that also, you know, again, if you're adding AI components and you just change to the latest model, right, ChatGPT 4.0, I want to see how ChatGPT 4.0 performs over ChatGPT 3.5 right, and what kind of answers and how are the answers and how people are using it, what questions are they asking in aggregate?
All of those things can be defined via observability and give you the insights that you need to make proper decisions as to whether it even makes sense, right, like, is the extra cost of paying for a ChatGPT 4 giving my user base a better answer. around the questions that I've embedded within my application.
Kevin Kohn: That's perfect. If I may you know, pivot a little bit now into how does observability increase or improve your code quality, right? And then after that, I'm going to talk about revenue, how, how does it help you get to that? But let's start off with code quality.
Roy Douber: So I think Developers are challenged with keeping up with this constant evolution of software engineering. You now not only have to write software, you have to understand cloud technology, you have to understand various database technologies, you have to understand, be able to write various languages, understand multiple new models that are coming out in the AI space, all of these things lead to some level of confusion and it makes it very, very difficult, I would say to deliver quality software.
And so having observability in place, gives the developers a level of confidence in their code that they know that they're going to have the insights that they need to answer the questions either in dev or in production in real time as things unfold as users use the system and use the system in various ways that maybe they didn't anticipate originally. Without it, your code confidence goes way down and you have to think and rethink about whether adding those lines of code or those functions is going to, how it propagates through the environments.
With it, you have answers in real time about what's really happening with your environment, how users are using it and how you can make the next set of changes to make the product better. So in my mind, you know, a product cycle that maybe would take a month before could take hours, days, you know, maybe a week, right?
As the developer grows more confident, they have more code confidence to push the code to production. To me, this is a key driver of revenue that's often overlooked because you really can't put a number to it you know, what's the value behind an engineer that's able to push more code to production? It's really tough to answer, but if you have all the necessary tooling and mechanisms in place, then the engineer can feel comfortable that they'll be able to answer questions about it or push a hot fix, or deal with the fallout of any bug they've introduced into the service pretty quickly and so to me, it's crucial. And that's why I say it's no longer a checkbox, I think it's absolutely crucial to have observability in place.
Kevin Kohn: Great, so this becomes a truly unique and do more with less story, right, for the engineer. So we all know how busy we are, we all know how much is demanded of us, we all know companies want to squeeze every last bit of productivity out of their staff who are very valuable to the end result of their companies and their products and what they're doing in the market.
And so having observability in place allows you to optimize from what I'm hearing you say the resources, the limited resources that each company has to try to make them go as far as they possibly can, does that sound, did I summarize that?
Roy Douber: Yes! So I would say it gives the developers an extra hand in figuring out issues. It also comes in super valuable when you do, we are dealing with impact of an outage. Those they do happen. We talked about them before and if you can target the right individuals and get to them more quickly, usually in a big enterprise, you have, like, an operation center and there's an outage and there's impact across, you know, the application multiple servers, networking devices, they see the alerts firing off. If they can get to those individuals more quickly.
As a result of the observability that's been instrumented and the architecture is good and the operators understand what's going on, it also can lower the boundary for helping with triage. So it lowers the boundaries of helping with triage, it gives the engineers an extra hand and ultimately, in many cases, the outages cost a lot more to the brand reputation to the business than then actually having those observability tools and the manpower behind them properly instrumented ready to go so you can resolve that outage in minutes instead of hours or days in some cases.
Kevin Kohn: I somewhat smiled when you said that only because we're all so familiar with a recent outage that a software company just had where they released code that wasn't ready and it bricked a ton of machines all across the world.
Well, if only they had the observability working in that case, I think we may have avoided a major catastrophe but it costs billions of dollars and lost revenue across airlines and across many other companies as a result. So, that's why I smirked a little bit because it's so apropos to the real world use cases we're all experiencing today, it seems.
Roy Douber: Yeah, like I mean, we could talk about this in name the *CrowdStrike outage* is the classic case of not having the proper gates in place right as you push code to production.
Realistically, you know, if you only push to a portion of your clientele before you thoroughly pushed out the entire payload to everybody in the world at once, then the signs would have been there to stop pushing this to the remainder of your clientele. So I think this is a kind of potentially there was a mistake there.
You know, the engineers that were in charge of that functionality maybe skipped skipped an environment or skip testing across various operating systems as they push the code and instead of pushing to dev, they push straight to production, which sometimes happens but also the other gate would be a slower rollout where you first push to a subsection of your clientele based and then you slowly this is the standard Kubernetes rollout where you do AB testing and you say, okay A is the new production release and B is the old production release. I want to see how people use A or I want to see how A behaves in this new environment and that was thoroughly missed in this case.
So if they would have sent it just to, you know dev, you know, or either their own dev systems or, you know, a portion of their clientele, they might've avoided this catastrophe.
Kevin Kohn: So again, pivoting a little bit, what are some misconceptions of observability that you've run into? What are some of the things you're seeing out there? What do people not understand about observability that you'd like to share?
Roy Douber: So I think a lot of companies still don't view it as a must have but in my experience over the years working with clients, those that do have it and those that are even doing it marginally well, they don't have to do it perfectly generally have a cleaner product, better product, higher stock price, you name it.
If you correlate the way the company runs their observability practice to just everything about that application, how it behaves, what happens when I go to a page and it's broken, what the behavior is like how it degrades and you could see this in various products out there.
It correlates greatly to a strong engineering culture and an ability to resolve issues more quickly. So that to me is one misconception is that people still, for some reason, believe that it's a check box item. I believe it's crucial and I would encourage most companies out there to go out there and instrument the heck out of your mission critical applications because if you don't, you're moments away from an overnight outage that comes knocking exactly when you don't need it to.
Kevin Kohn: If I may, I think you're being a little bit too generous and saying that most people considered a checkbox item. I would even go as far as to say, a lot of people think observability is optional and how many environments do I go into in a week or talk to clients in a week, they don't even have an observability story to tell like they don't have a strategy, they don't have a plan, they haven't picked a vendor, or maybe they have multiple observability tools in their environment, which makes it also equally hard to track down issues when you have to interface them all together and compile to all the forensics to bring that back.
Roy Douber: Yeah, I agree, it's more often than not, even today that we still go into clients and either they have 30 tools and they can't make sense of any of the data that the tools represent or there might be one guy in each division of the company that understands that tool really well, but then nobody else has access to it and nobody knows what's going on.
So one recommendation I would make there is democratize your data, make sure that everybody has access to the tooling and ideally, consolidate as well, right? So generally, if you have 30 tools, you're 25 tools too many and in order to democratize your data you want to consolidate and make sure that you're very mindful of the gaps that each tool covers.
Kevin Kohn: Yeah, so I really like what you were saying about how you might have 30 tools or 25 tools, and yet you're only, it's not even as good as maybe only having five or one, which is the kind of the point that I think you were trying to make right?
Which sometimes less is more and standardizing and having a process around it where end-to-end the teams understand the one or two or three tools you're using allows you to be able to better remediate those problems as you have them. I think it's one of the things that you brought up. Is there something I missed there? Is there a component that I missed?
Roy Douber: No, I think that was good. so yeah, consolidate your tooling, give access to the tooling. You don't want your teams pointing fingers at each other, right? You want to all look, look at the data from a vantage point of, hey, here's my data, here's your data, let's come together and try to solve this together.
I think the silo triage is is a big problem in a lot of companies and if you properly architect your observability story then you can build a solution that allows access for all your engineers and that brings a lot of engineering prowess to bear too, because maybe right now, the engineer is is a network engineer or a storage engineer, but in their previous life, they were also an operator and they understand a few things about this kind of issue or that kind of issue.
So, the idea is that you facilitate and have the mechanisms in play to allow for better triage stories to allow for multiple passive triage, right? So people can look at a problem within a piece of application and go down the path until they say, either this is still a valid issue that I'm seeing in the environment based on a symptom that I see, or I've exhausted all my option going down this path and so I'm axing this off the list.
Now I can go join the triage of some other team, tiger team that is chasing this problem from a different angle and I think it's key to have all the data in one place or as close to one place so that you can properly perform this level of triage.
Kevin Kohn: Thank you! I mean, that was really informative. I think we covered a ton of ground today with regards to observability and the aspects of it. So Roy, I wanted to thank you for taking some time out of your day to go through and walk us through some of the very important aspects of observability in a customer's environment and our environments, making sure our data centers are running effectively, efficiently, and accommodating for all the new technologies that are coming out.
Roy Douber: Thank you, Kevin. I really enjoyed this. Maybe there's a part 2 in the future. I think there's a lot more, you know, misconceptions and things we can discuss specific to maybe specific technologies or things like that we haven't covered here but this was a broad stroke and I think it was very, very interesting.
Kevin Kohn: Well, thanks for getting us kicked off with the 101 observability class and let's look at a 201 later on. So that would be great.