Explore the evolving world of application delivery and security. Each episode will dive into technologies shaping the future of operations, analyze emerging trends, and discuss the impacts of innovations on the tech stack.
00:00:05:03 - 00:00:33:09
Lori MacVittie
Welcome back to Pop Goes the Stack, your front row seat to emerging tech, digital duct tape, and the slow unraveling of sanity in production. I'm Lori MacVittie. Brace yourself, because today we're going to talk about the most expensive lie in modern infrastructure: logging is free. To do that, we're going to explore OpenAI's discovery that logging isn't free
00:00:33:15 - 00:00:37:28
Lori MacVittie
and a little bit of depth with our co-host, Joel Moses. Joel.
00:00:38:01 - 00:00:39:16
Joel Moses
Good to see you, Lori.
00:00:39:18 - 00:00:45:05
Lori MacVittie
And our observability expert, Chris Hain. Chris, welcome back.
00:00:45:08 - 00:00:47:11
Chris Hain
Hey, guys. Thanks for having me. Good to be here.
00:00:47:13 - 00:01:27:21
Lori MacVittie
No pressure, no pressure.
Chris Hain
No, yeah.
Lori MacVittie
All right. So what spurred this discussion was OpenAI reclaimed 30,000, let me repeat that, 30,000 CPU cores simply by tweaking Fluent Bit; some logging. Now, they didn't rewrite the models. They didn't buy more GPUs. They fixed telemetry overhead. Cause it turns out that your observability pipeline can be one of your biggest workloads especially once AI shows up and starts generating logs like it's trying to win some kind of contest.
00:01:27:23 - 00:01:48:04
Lori MacVittie
So let's dive in and talk about what happened--right, so a little bit of the details--and what this really means for regular enterprises that don't have 30,000 cores to begin with, but who want to make sure that they're optimizing every resource that they have. So, Joel, you want to dive in a little bit.
00:01:48:07 - 00:02:14:16
Joel Moses
Well. I mean Chris, of course, is an expert in observability, so I'm going to defer to him to kind of explain exactly why Fluent Bit was so absolutely CPU hungry in this particular case. But this represents something that, you know, the assumption is and has always been that observability or monitoring or logging is basically free. And it is free, but it's free in the same sense that air conditioning in the summer in Texas is free.
00:02:14:19 - 00:02:28:19
Joel Moses
It's absolutely free because you don't think about what it actually costs because you need it. And so that's one of the reasons this wasn't noticed. But, Chris, why don't you tell us: what was Fluent Bit doing that was so busy.
00:02:28:22 - 00:02:37:24
Chris Hain
Sure. Well, so first I'll just say, if you really think logging is free, you've probably never paid a Splunk bill at a medium to large size organization, so.
00:02:37:28 - 00:02:38:14
Joel Moses
Good point.
00:02:38:20 - 00:02:39:20
Chris Hain
Anyway
00:02:39:22 - 00:02:44:06
Lori MacVittie
DataDog. Maybe DataDog bill?
00:02:44:09 - 00:03:07:27
Chris Hain
So basically, Fluent Bit is running as a DaemonSet on all of their nodes, which means all of the nodes are running Fluent Bit, basically watching these logging files, reading the data that's coming off of them, and sending it off to some probably centralized logging place somewhere. I think they mentioned ClickHouse. So Fluent Bit is basically everywhere, and it's doing work on every single node in their network, or in all of their clusters.
00:03:07:29 - 00:03:28:12
Chris Hain
And one of the things that it's doing is watching for updates to these files. And actually, you brought up a very good point that I hadn't thought of, hadn't dug into it as deeply as it at that level of thinking. But basically it was doing a bunch of overhead work that was not providing the amount of value that they felt was, you know,
Joel Moses
Right, right.
00:03:28:15 - 00:03:44:00
Chris Hain
good enough for the amount of CPU overhead that it was adding. So when you added up this extra work that was being done and multiplied that by the number of nodes in their system, it wound up that it's eating, you know, what it...I didn't do the math, it's like $1 million a month worth of
Lori
It was
Chris Hain
something.
00:03:44:03 - 00:03:45:06
Joel Moses
That's a lot.
00:03:45:09 - 00:03:47:18
Lori MacVittie
It was a lot. Yeah, it was a lot.
00:03:47:21 - 00:04:06:04
Joel Moses
So, effectively what Fluent Bit was doing is it wants to see whether a file has changed. It's a pretty simple operation if you're going to watch for log file additions and you want to see if additional lines have come in to get an update from the system that the system has new data attached to a file or a file has been deleted, or a file has been moved, etc..
00:04:06:07 - 00:04:33:18
Joel Moses
And the facility to do that under Linux is something called Inotify. It's one of them; it's not the only one, but it's one of them. Inotify though on 64 bit systems, every single time you use it it allocates a buffer space of about one kilobyte of memory. And if you call Inotify repeatedly on a large number of files, it allocates that memory and it deallocates that memory in rapid succession.
00:04:33:18 - 00:04:54:02
Joel Moses
And so what they were actually dealing with is the fact that they were allocating a whole lot of memory, a whole lot of times through this Inotify interface. And so the fix was to simply turn that off and go back to something that's fairly simple--they just periodically scan the system to see what changed. And that solved the issue.
00:04:54:04 - 00:05:19:24
Joel Moses
The funny thing to me about that is turning off a feature like that turns off some functionality that gives you notifications faster. But it also didn't understand why necessarily, just turning off the feature does not help you understand why, the problem was actually occurring. And that's memory back pressure. And so what that shows me is that the solution to this problem was not fixing Fluent Bit.
00:05:19:24 - 00:05:46:08
Joel Moses
The solution to this problem probably was: optimize your host system well. They were hitting limits. And adjusting the limits or managing the limits, which is SRE work, that probably would have solved the problem, too. You know, it's interesting, if the simple fix is to turn off a feature, the real bug in the system is probably hubris.
00:05:46:10 - 00:06:17:07
Chris Hain
Definitely. Yeah, one of the things I liked about this story before I thought about it at that level of depth, was just the fact that, you know, problems like this exist everywhere, especially in, you know, large scale systems. You can pretty much just throw a dart at your architecture diagram, go dive into it and, like he was kind of mentioning it in the article is, like really try to understand what's going on at a deep level and just throw all your assumptions about, you know, where things are, you know, costly or where resources are being burned.
00:06:17:10 - 00:06:31:29
Chris Hain
Throw all those away, start from scratch, and just really deeply understand it and you're probably going to find something that you can, you know, spend a little bit of time to optimize. Probably not 30,000 cores worth of optimization in most cases,
Joel Moses
Ha, no.
Lori MacVittie
You could get 5.
Chris Hain
but still meaningful amounts.
00:06:31:29 - 00:06:33:03
Lori MacVittie
Five would be good.
00:06:33:06 - 00:07:08:21
Joel Moses
Now the article did point out something that I think is a good thing. And they found this problem simply by running system profiling. Meaning
Chris Hain
Yep.
Joel Moses
they were measuring how many system calls were taking place and what types of system calls in particular that Fluent Bit was calling. And they came across one particular call. It's fstat at 64, which is a fairly expensive call that Inotify makes. And when they finally figured out exactly what function was doing that inside of Fluent Bit, they noticed that you could adjust that by simply flipping the feature off.
00:07:08:23 - 00:07:31:15
Joel Moses
I think you can learn two things for that. Number one is you should know how to run a system profiler.
Chris Hain
Absolutely.
Joel Moses
Number two is, you should know how to figure out within an application that you're running, where bottlenecks are. And this is a system bottleneck that occurred inside of Fluent Bit. So, kudos to the team that found it.
00:07:31:15 - 00:07:51:06
Joel Moses
Now, as for the solution to turn off the feature, okay, you can do that. There's probably other solutions that rely on system tuning that they could have taken advantage of. But it's an interesting find. And it brings up a really important distinction which is that, you know observability again is not free. It's not free in terms of storage.
00:07:51:13 - 00:07:58:18
Joel Moses
And it's not free in terms of CPU. And there's probably better ways we should be observing systems, aren't there?
Chris Hain
Absolutely.
00:07:58:21 - 00:08:30:29
Lori MacVittie
Yeah. Well it also says that, I mean 30,000 cores. So if you're reclaiming 30,000 cores, that assumes that 30,000 cores were working on observability. Which, right, ten years ago we would have never believed you would need that much. Right? Observability was, well, just write a log a little bit, write a little bit of information. But AI multiplies the type of information that you need and then also the different places it needs to go.
00:08:30:29 - 00:08:51:12
Lori MacVittie
If it's collecting realtime stats about how the system is performing, what it's doing, how many resources it's using, that might go to one system because it's more realtime almost AIOps like. And if you've got things like, I don't know, audit logs and you know, keeping track of what's coming in and out, that's a different system.
00:08:51:12 - 00:09:02:05
Lori MacVittie
So you've got multiple pieces of observability data that are absolutely important today. You can't skip it either.
Joel Moses
Yeah.
Chris Hain
Yeah.
00:09:02:07 - 00:09:19:15
Joel Moses
And the other thing is the amount of information. I mean, Chris, help us out here like, profile for us a particular system that is running an AI workload and how many microservices is that? And how many log files, how many counter tables? What are we talking about? How much stuff is
Lori MacVittie
Oo, two.
Joel Moses
actually, relative to five years ago, how much stuff are we monitoring?
00:09:19:15 - 00:09:23:26
Lori MacVittie
Two, my lord. Two.
00:09:23:29 - 00:09:46:15
Chris Hain
Well, I mean, you just think about it like your access log at the front end, the thing that, you know, receives that request.
Joel Moses
Right.
Chris Hain
Maybe 25, ten years ago, that was the thing you logged. Today, yeah you've got microservices behind it, so at least a span for every one of those, hopefully. They're all generating their own logs about what they're doing at the point in time. Like the overhead of logging and tracing
00:09:46:15 - 00:10:03:21
Chris Hain
and modern observability practices can way outstrip that, you know, original just 1-bit of processing you were doing on that one request. So, yeah, it's not uncommon for the observability stack in an organization to be the predominant thing that you're spending money to run.
00:10:03:28 - 00:10:22:24
Joel Moses
Wow, that's wild. So, you know, it's good, I've heard it said that "you can't optimize what you don't measure." And so I think that that by default let's someone think that you should just collect everything
Chris Hain
Sure.
Joel Moses
and then store everything and then you can optimize later on if you happen to need it.
Chris Hain
Yep.
00:10:22:24 - 00:10:31:26
Joel Moses
But this is what happens when you measure everything, right? So what should someone do to prevent that from occurring?
00:10:31:28 - 00:10:59:03
Chris Hain
Yeah. Well, I mean, one of the other things about like legacy logging versus kind of more modern LLM things is like originally logging was a thing that some like guy wrote and said, "hey, I'm going to leave you this little cookie in case this event ever occurs." But now it's like this arbitrarily large block of data that an LLM has generated that you might need to go back and parse out for, you know, performance or for, you know, training your models as you go forward.
00:10:59:03 - 00:11:23:28
Chris Hain
So the volume of things is like no longer bounded by the number of requests that are coming in. It's like, what did the models decide to do with that request? And how much text did they decide to spit out? So that's another huge shift. But yeah, the other thing that's kind of happening is like the metadata that is associated with any particular log is becoming increasingly important.
00:11:24:00 - 00:11:41:13
Chris Hain
You know, it's no longer enough just to say, "oh yeah, it came off this pod on this node and here's the message and whatever other attributes might have attached to it." Now you also need to know all this additional context. You know, what model was being generated? What tool was being called? Or, you know, some of these other things that just weren't originally there.
00:11:41:13 - 00:11:54:18
Chris Hain
So like the idea of having more structure attached to the log as additional info versus just having the log line be the thing that gets you all the information you need, is a thing that's kind of evolving a lot as well.
00:11:54:20 - 00:12:03:27
Lori MacVittie
It's tracing on steroids. Like, I mean this is, that's crazy. Like, who did what, right? It's the 5-W's +, we have to know.
00:12:03:29 - 00:12:07:18
Chris Hain
Yeah. Who did what and what was going on while they were doing it. And like, everything that you need to know.
00:12:07:18 - 00:12:12:12
Lori MacVittie
Why did they do it?
Chris Hain
Yep.
Lori MacVittie
Now we need to know why. Why model? Why did you do that?
00:12:12:14 - 00:12:37:11
Joel Moses
So if you're doing a high degree of monitoring local to the same host where this is being generated, obviously you're going to get into situations like this which are contention; where observability is contending with the active workload.
Chris Hain
Yep.
Joel Moses
What sorts of things might we be able to do to reduce, or how about eliminate, contention? Can we observe elsewhere?
Chris Hain
Potentially.
00:12:37:14 - 00:12:58:17
Chris Hain
I mean, so typically sampling is a thing that is done very heavily, right? Like I can't afford to take everything out of here. So I will take the ones that hopefully I have some degree of insight that they are important, right. If I'm doing a thing called tail sampling where I can say, "hey, this took too long, I'm going to ship that up to the back end to figure out maybe why."
00:12:58:19 - 00:13:23:04
Chris Hain
But generally speaking, yeah, I think there's a lot more effort happening to kind of shift that observability work to other places than it's traditionally been. So you'll see a lot of eBPF-based profilers, as well as just like auto instrumentation elements showing up in agents these days. So, you know, it doesn't help the too much data problem, right?
00:13:23:04 - 00:13:32:24
Chris Hain
Because now you don't even have to wait for the software engineer to go instrument his app, it's just done for you automatically. So now you've got more data than even the...
00:13:32:26 - 00:13:53:25
Lori MacVittie
Yeah, but you have to pay attention to how it's doing it. I mean, you mentioned sampling and then I think to the different kinds of telemetry that we're collecting and I go to--of course I like the Ops side--so I'm thinking about: performance stats, what's going on, lots of numbers, lots of things that don't necessarily have a lot of change--
00:13:53:25 - 00:14:19:23
Lori MacVittie
one hopes. Right. And you know you use, you've talked about this right, using sampling basically what's time-series data in order to make sure that you get the important changes like, "hey, something changed!" You notice it right away because you're not sending everything and having to comb through it. But it's also faster. It reduces how much bandwidth, CPU, I mean, everything that you need by only taking what you need.
00:14:19:23 - 00:14:30:28
Lori MacVittie
So that's one place you can immediately look at to optimize. Right, a high speed telemetry plane for things like number-based metrics.
00:14:31:00 - 00:14:33:05
Chris Hain
Sure. Yeah. I mean,
00:14:33:08 - 00:14:35:06
Lori MacVittie
Sure. That's it. All right, yeah, there we go guys.
00:14:35:08 - 00:14:36:28
Joel Moses
Absolutely, you got it.
Lori MacVittie
That's the whole thing.
Chris Hain
Problem solved.
00:14:36:29 - 00:14:39:20
Lori MacVittie
We got it. We're done. Thanks.
00:14:39:22 - 00:14:54:26
Joel Moses
So what things are occurring in modern observability systems to kind of help this along? Like the ability to do, say, tunable collection is one. So, like, on demand collection of something; I don't need it right now, but I need to prepare for the time when I do need it. What's going on out there?
00:14:54:26 - 00:15:15:11
Chris Hain
Or like the local collection, but that you can then, you know, drain a certain amount of history when you find there's a problem. So it's instead of always on, it's like locally, you know, buffering a bunch of data that maybe you wouldn't otherwise pay to store and forward. But when you need it, you can click the button and away it goes.
00:15:15:18 - 00:15:36:24
Chris Hain
Some of those concepts are very interesting, and they could even extend down to like the user device, right? So if I've got an app running on a phone, I can, you know, use a little bit of a ring buffer there to hold the interesting stuff. And then when things are really breaking and it's their client specifically, I can now start getting it instead of having to take all of the data from all of the clients all the times.
00:15:36:27 - 00:15:53:05
Chris Hain
So that kind of stuff is super interesting, I think. But yeah, it's a never ending battle between balancing what you need to solve a problem or to identify a problem and what you want to actually pay to have available to you.
Joel Moses
Right.
00:15:53:07 - 00:16:04:04
Lori MacVittie
Well, let's be fair. That came about because in the beginning, in the olden times, we didn't have...I, yeah, when the deep magic was born, we
00:16:04:06 - 00:16:07:18
Joel Moses
When dragons roamed the land.
00:16:07:21 - 00:16:09:18
Chris Hain
Too much data was not the problem.
Joel Moses
Yes.
00:16:09:18 - 00:16:27:12
Lori MacVittie
No. Right, it was almost a dearth of data and
Chris Hain
Yeah, exactly.
Lori MacVittie
you couldn't solve the problem because you didn't have enough information. So we kept adding information. And now we've almost got to the point where we have too much information, now we can't find what we need in that. So. Yeah.
00:16:27:14 - 00:16:45:14
Chris Hain
Yeah. So many security events and things that occur; like the data that you needed to detect, it was there, it's just it wasn't being looked at or whatever. So yeah, this is a problem that is new, is different than the problems we had ten years ago. But it's, you know, still a problem.
00:16:45:17 - 00:17:18:17
Joel Moses
Yeah. Nothing says modern distributed system quite like spending more compute on observing reality than creating it.
Lori MacVittie
Oo, oooo.
Joel Moses
I mean, it's, that's amazing. But it's true. I mean, the more elements that you include in a system, the more observability components you have to have, the more data it generates; it's a compounding problem. And it's something that I'm looking forward to things like the open telemetry effort to help solve some of these things. The ability to have systems that can kind of configure on the fly or request the observability components.
00:17:18:17 - 00:17:39:17
Joel Moses
Those are exciting things to me. Personally, I'm excited about the prospect of DPU technology being involved in monitoring workloads. So instead of doing it on the CPU that the host has to do all the other things, you do it on a sub-CPU that is nearby and tasked for observability. I think those things are pretty cool.
00:17:39:19 - 00:17:52:23
Chris Hain
Yeah. Offloading a lot of, you know, the data enrichment pieces in terms of what is being collected and how it's being shipped out and being able to do that in hardware locally there at the node, that's potentially a game changer.
00:17:52:25 - 00:18:04:02
Joel Moses
I agree.
Lori MacVittie
Oh it's, yeah, it's cool. And I think eventually we're going to have many, many PUs that are tasked with doing different things. You almost have to.
00:18:04:02 - 00:18:06:22
Joel Moses
Oh, I hate that, Lori. I just hate that.
00:18:06:25 - 00:18:07:10
Lori MacVittie
I know you hate it but
00:18:07:10 - 00:18:11:15
Joel Moses
I have a GPU and aCPU and a DPU and an NPU and a...pew.
00:18:11:17 - 00:18:46:15
Lori MacVittie
IPU, QP, BBQ.
Joel Moses
Nice.
Lori MacVittie
I just want a barbecue on my ser-...we're past time with that because if we go further it's going to get terrifying. So, you know, what can you know your typical enterprise--again, doesn't have 30,000 GPUs or CPUs to reclaim--but still wants to explore this, to get it right, to do something about it before it becomes a problem.
00:18:46:15 - 00:19:05:04
Lori MacVittie
Because AI's not going away; it is multiplying the problem. You know, aside from a dart--which I mean I like that Chris, but I don't think that's really a strategy that most, you know, CIOs would take--you know, what's your advice? What should they learn from OpenAI's mistakes?
00:19:05:06 - 00:19:24:03
Chris Hain
Start by, you know, top down. Where are you spending all your money? What are the expensive things happening? Because probably if you can shave a percent or two off that, then you're going to have more impact than shaving, you know, 100% off the thing that doesn't matter. So use your observability data to figure out, you know, where the resources are going.
00:19:24:06 - 00:19:40:25
Chris Hain
And then start there. Dive in. Go as far as you can with the metrics and the high level stats about what time is being spent doing which things. As you drill into that, you'll find naturally that at some point you'll get to a space where you say, "okay, this is taking a long time. Does that make sense or not?"
00:19:40:27 - 00:20:00:22
Chris Hain
Something like a profiler could really help you figure out whether that is, you know, what is expected or not. So being able to kind of drill down from the very high level like, "hey, the CPU usage across our fleet is this," down to the very specific "this, you know, process is running," "this is called a billion times and that looks bad."
00:20:00:25 - 00:20:12:29
Chris Hain
That key, you know, having the availability of data to do that is key. And it exists in more places today than it did definitely ten years ago. So let's go use it, I think.
00:20:13:01 - 00:20:37:09
Joel Moses
Yeah, and I agree. I think monitoring and learning how to run profiling. I think that's a really important aspect, and not forgetting to optimize. A lot of time is spent optimizing applications; very little is spent on optimizing the host that runs the application and I think that that's important to realize. What I've learned is that, you know, logging is not free.
00:20:37:11 - 00:20:52:24
Joel Moses
Logging is only free in the sense that it's not like free parking or free lunch; it's more like free puppy, free kitty, or free child. There are downstream costs to this transaction.
00:20:52:26 - 00:21:19:00
Lori MacVittie
Oh. Wow. I would say, right, "measure what matters." You know, figure out what matters to you if it's cost, if it's performance, right, measure those things. Actually pay attention to it. And cost is going to be a big one. And, right, don't measure necessarily things that don't matter. You don't need to, right, resolve every IP address in realtime.
00:21:19:03 - 00:21:36:23
Lori MacVittie
Okay. Like you can do that later. Probably. It's really, yeah, you want the basics. So with that that's a wrap for Pop Goes the Stack. Tap sub-subscribe. That, why is that word so hard to say. Subscribe. Say that. You guys
00:21:36:26 - 00:21:39:11
Joel Moses
Subscribe! Hit that subscribe.
00:21:39:14 - 00:21:52:25
Lori MacVittie
See! Subscri- tap the button that lets you know when another episode is out, okay? Stash some backups and rendezvous here for the next meltdown. Keep your pager on ready.