Technology Explorations in Data & AI | Azure Log Analytics Costs Are Out of Control

Azure Log Analytics costs often take up 20% or more of a cloud bill, even though most teams only check logs when something breaks.

Azure's default analytics logs are powerful, but they're also expensive and often unnecessary for day-to-day log inspection. Switching application logs to Basic Logs can reduce Log Analytics costs by up to 60%.

In this episode, Niels walks us through a real customer case where logging costs dropped by thousands per year. They explain the difference between Analytics, Basic, and Auxiliary logs, show when Basic Logs are sufficient, and discuss practical setups using Azure Container Insights and FluentBit. This includes building a custom FluentBit plugin in Go as well as real-world gotchas like missing short-lived pods and why dynamic credentials matter.

Resources:

Custom FluentBit plugin: https://github.com/nclaeys/fluent-bit-go-azure
Click here to watch a video of this episode.
Full playlist: https://www.youtube.com/playlist?list=PLJ_da7qdfL80rA7byzC_CmyrfJWjcCTnb

Chapters:

(00:00) - Intro: why optimize Azure log costs?
(03:10) - What kind of logs are we dealing with?
(06:13) - Plan types & the cost difference
(09:59) - FluentBit vs Azure Container Insights
(13:33) - How FluentBit works in K8S
(16:41) - Can you lose log data?
(17:36) - A custom plugin for Azure Workload Identity
(21:05) - Why not use Azure Container Insights?
(22:35) - Do all clients benefit?
(23:41) - Summary & takeaways

Data & AI: Technology Explorations is a biweekly show from Dataminded. Each episode a Dataminded engineer demos a tool or technique worth knowing about -- working code, honest takes, no hype.

Music by Aleksandr Karabanov from Pixabay

Creators and Guests

Host

Jonny Daenen

Head of Knowledge at Dataminded

Guest

Niels Claeys

Partner at Dataminded

What is Technology Explorations in Data & AI?

Deep dives and practical demos on the technologies shaping modern data and AI development. Join the Dataminded team as we explore, unbox, and critically review the latest tools, from building AI agents and RAG systems to optimizing cloud costs and accelerating data pipelines. We cut through the hype to show you what actually works in real data engineering practice, complete with demo code!

Jonny Daenen (00:00)
How can you reduce your Azure Cloud logging costs with more than 60 %? Quite often we see Cloud builds with 20 % or more logging costs. You can reduce those costs with one simple trick and that is changing analytical logs to basic logs. In this video, we'll show you how. Let's have a look.

Jonny Daenen (00:26)
Hi everyone, welcome to technology explorations at Dataminded. In this series, we give you an initial look in new or interesting technologies. My name is Jonny knowledge lead here at Dataminded, and today we'll have a look at Azure log optimization. And for that, I've brought with me Niels, one of our lead experts at Conveyor. Hi Niels.

Niels Claeys (00:44)
Hi, thanks for having me.

Jonny Daenen (00:46)
Before we start, Niels, could you briefly introduce yourself, what is your role at Dataminded, and what do you do day to day?

Niels Claeys (00:51)
Yeah, sure. Like Jonny already mentioned, I work for Conveyor. Conveyor is a data product workbench, as we call it. It's an accelerator that we have built to service our customers. One of the things we do is we manage their infrastructure. today we will mainly talk about the Azure part.

Jonny Daenen (01:10)
Now, tell me why is this relevant to talk about log optimization? It's quite a technical topic.

Niels Claeys (01:15)
Yeah, yeah, I know. But I think it's a super good use case of how a platform and platform team within an organization can have a big impact. So log optimization is important mainly from a cost perspective. We...

want to have logs in order to be able to troubleshoot things, but if you don't govern or you don't control which logs do you store, you quickly end up with a huge bill, So there some optimisation is required

Jonny Daenen (01:44)
So basically what you're telling me is logs can be a big factor of your cost and you want to reduce those costs.

Niels Claeys (01:50)
Yes, well that's basically it. slide. This is an example of one of the environments where we run at. You see on top the full costs, which is roughly 16k for our deployment And we see that the logs itself cost 1500 euros a month, which is quite a big chunk. these logs, are from time to time used if there is a problem.

but it's not that it's a thing that everyone checks all the time.

Jonny Daenen (02:16)
Yeah, so if I look at this, see the green part. That's almost the same size as your storage cost even. this is a really big part of log information that you pay for.

Niels Claeys (02:26)
Yeah, that's partly because, well, since we run a lot of batch jobs, there are a lot of logs. So of course, well, that means that the costs are high, but also the fact that by default logging in Azure is quite expensive if you compare it to AWS or GCP. So it's vital to make sure that you only store the logs that you need.

So one of the things we worked on for this customer specifically is indeed to get their logging costs down. And you see that it was rolled out in the middle of November. So you see already a big reduction. If you would see at the end of December, then the total cost will be even a bit lower than this, roughly a third, I think, from what the initial cost was, which is quite a big difference for them.

around a thousand euros a month that they save.

Jonny Daenen (03:10)
Yeah, okay, that's quite significant. And for me to understand what kind of logs are we basically talking about?

Niels Claeys (03:17)
Yeah, so

what logs can you store within Log Analytics Workspaces on Azure, you can have both metrics and you can have logs. But what we focus on here is purely application logs, or in our case, then batch jobs. If you run a Spark cluster, the logs of Spark, if you run dbt the output of what dbt runs in your container. If we run also Airflow, so all the Airflow logs are also in there.

applications that you run, in our case then on Kubernetes, all those logs they get ingested in the log analytics in those tables and then people can inspect them or can view them through our UI which is basically doing queries in log analytics on top of one of these tables.

Jonny Daenen (03:57)
Okay, I see. So basically we run jobs in containers. All of these containers is spit out logs, whether it's dbt, Spark, Airflow, they are collected in log analytics workspaces in Azure. And then on top of that, you can connect your own tools or applications to read those logs and do analytics on top of that. Okay.

Niels Claeys (04:14)
Yeah, that's it.

Of course, they have also their own UI, but we provide it integrated within our it's good to immediately showcase this quickly,

So let's take our staging environment. So this is what the Log Analytics workspace looks like. You get basically a bunch of tables and in our case there will be quite a lot. So for all different types of use cases, you see here Kubernetes control plane logs, you can see the container logs and then you see some custom logs.

which I will dive into a bit further. What is also important to notice this plan, where we will discuss later, you have different types of plans in Azure, and by default, all of them are analytics plan, but one of them is basic.

If you want to have a look at them, you want to interact with them, you have also a UI where you again can see all these same logs that exist. And basically this is a table where you can use the custom query language.

which is a custom language that Azure provides where you can just query your logs and you see this is Kubernetes, see the pods, the namespace, etc. This is a Go container and then you see basically what the log entry is here.

this gives a bit of information on how this ingestion part works. So we have our infrastructure which is in our case the Kubernetes cluster. We use

the Logstream API, which is basically an endpoint where you can just push your logs to. And then there is a pipeline on Azure side where you again can do some filtering and in the end decide where the data should end up. So in our case,

We don't do much in Azure because we already did it in FluentBit, but if you would use Azure Container Insights, for example, you could do some filtering here and say, okay, I'm only interested in logs from kube system namespace and write them to the kube system table, or you could split them up and say, okay, the other namespaces, write to a different table. So you're flexible there.

Alright, so that's a bit on the logs. What I mainly want highlight

when using log analytics on Azure is that you have these different types of plans in Azure. you have analytics logs, which is the default, basic logs and then auxiliary logs. And all of them have certain restrictions or certain performance features. I think what's most important for using logs is that, well, the analytics logs, are mainly created

to be very expressive. You can make very complex queries and join different tables together, which is nice, but it's most of the time not what you need when you're just looking at logs. It's more if you want to do a real analysis. the logs are all from this application. So we mainly just want to query one table and then do some nice filtering, some grouping, et cetera on there. So we don't need these analytics tables.

The consequence is also you have less capabilities on basic tables, but they're also way cheaper to ingest. So there is a factor five difference between ingesting logs in analytics table and in a basic table. it's good to know these differences and it's important to, I think, switch from using analytics tables for logs to using basic tables.

Jonny Daenen (07:29)
And what is this auxiliary column?

Niels Claeys (07:32)
Yeah, so the auxiliary column is another type of table that they also support.

The main problem with that is that the performance when doing queries on these types of tables, that is not that high. It's mainly used for archiving or, I don't know, certain kind of data that you query very rarely. And if you query it, the speed is not that important, then you can use that. It's also, again, quite a bit cheaper to store. But the performance is way worse if you query it.

Jonny Daenen (08:00)
Okay, I see. So basically the analytical method allows you to do more complex queries across different tables, but you pay a lot more. And then the basic version does not have these cross table capabilities, but you can still query your own table. It's a lot cheaper as well. And the auxiliary ones, they're basically not in scope for most of the workloads.

Niels Claeys (08:15)
Yep.

good summary.

So what is important, what I maybe also want to highlight is how can you ingest data in these tables almost any method can ingest analytics table, which is why this is the default. For basic logs, you have some constraints and they call it here DCR based custom tables.

So that's basically their new logstream API. That's the one that can ingest data in here. The old API was not able to, you were not able to use basic tables for that.

so if you use the new ingestion API, you can still choose between analytics or basic tables, depending on what you want.

Alright, so a quick view on what is the cost difference. You see that for analytic stables, they're both pay as you go model, you can also reserve some capacity, So you pay roughly 2.5 euros per gigabyte for analytic stables and for basic logs tables you pay only 50 cents, or 56 cents.

Jonny Daenen (09:15)
Yeah, and so that means like only one fifth of the cost. So I would think you would end up with 20 % of the cost if you convert everything to these type of logs.

Niels Claeys (09:26)
Yes, so that's roughly the case. There is one caveat, is that...

For basic tables you do need to pay if you query them for analytics tables that's implied within the cost. So if you query these tables, the cost gets added on top of the ingestion cost for basic tables. So that's why it's not fully a reduction of ⁓ only paying 20%. It will be a bit higher, but that depends on how often you use them. I think in the case when you store logs, this is a really interesting model because logs you don't query that often.

so that makes it a good model.

so that gives you the high level idea. and the question then becomes how can we use these basic logs? And I think what's important to note here is that if you use Azure Container Insights, for example, then.

It will by default create this endpoint, will ingest your data and it will ingest it in a table which is what they call

the container log V2. So that's the new way how Azure would ingest these logs. And you also see that by default the plan is an analytics plan, but you can update this yourself and you can manage this table yourself. And I think you're able to switch this to.

This new Azure Container Insights, they can use both analytics and a basic table because the schema that they use is compatible with both. So that's a nice thing. Now, in our case, we don't use Container Insights. We use FluentBit, as I already mentioned.

then that means that we need to configure...

these data collection rules and data collection endpoints ourselves. As I mentioned, it's just a pipeline that you configure, but it also means that you can manage your table yourself. And you see this, this is what we call a custom table, which is why the suffix is custom. And you can create this just through Terraform and say that, I want the basic plan.

So we have this endpoint and then we can push just metrics using FluentBit. And if you would just look at the...

and bit on the Log Ingestion API, they describe how you could interact with this. You see this is the major trick. You specify, it's Log Ingestion. You can use Client ID Secret. And then you specify the data collection endpoint. And then this is the data collection rule. And this is the rule that I described before. So if you specify these two, then...

it will know where it needs to write the data to.

Jonny Daenen (11:53)
And how do the end point and the rule, how do they collaborate in the ingestion part?

Niels Claeys (11:58)
Well, the endpoint is basically an HTTP endpoint where you receive the data and actually in the endpoint you also need to configure what's the individual rule. But if you would here, they do some validation that you use the correct rule because one endpoint can be used by multiple rules. So you can distinguish that. In our case, we always use one-on-one, but you're more flexible.

Jonny Daenen (12:16)
⁓ I see, okay.

Yeah.

And so what you're showing here is basically the alternative for the thing you showed first. At first you showed like when you have the native Azure setup, they collect your metrics and they also manage that log table for you, which is then by default analytics log, can swap that to become basic, but then Azure manages it. What you show now is the alternative where you run FluentBit next to your container, push data to the endpoint and there that custom rule makes sure that this

going to the right log table. And that you manage with Terraform, right? You create that log table, the basic version with Terraform.

Niels Claeys (12:52)
Yeah, that's correct.

Yes, we manage everything of this with Terraform.

Jonny Daenen (13:01)
Okay. And the FluentBit part How does that work?

Niels Claeys (13:03)
The FluentBit is typically run as a daemon set, which means it runs on the host, which is needed because all the logs for all the containers on one host are on one specific directory on that host machine. So the daemon set can read these and it will then push all these logs to, our case, this data collector endpoint.

Jonny Daenen (13:25)
so it is not on the container level, this is on the node level that this runs. Okay.

Niels Claeys (13:29)
Yes, yes, yeah,

this runs all, it always runs on the node.

Alright, have people who want to know something more about FluentBit.

basically, Fluentbit by itself is also a bit of a pipeline. You specify, okay, what's the input that I use, which basically in Kubernetes cases, which files am I processing? You can parse these and then you can also filter them. And we use this quite extensively to filter out lots of certain applications that we will anyway not look at in Log Analytics.

So we filter them out here and then the buffering happens and then you push it to a certain output. And the output is what I showed before. That's the end point.

Jonny Daenen (14:07)
That's the endpoint you configured.

So you have two parts of filtering. Basically you have the filtering that happens here already at FluentBit level. And then in your rules for that endpoint, you actually also can do some extra filtering or routing to the right log. ⁓

Niels Claeys (14:20)
Yeah,

yeah, but there we don't really use that to be honest, because we have the flexibility here and we're more flexible to filter here. but if you would use only the Azure Container Insights, for example, then it might be useful to have the filtering on their side because you have no flexibility to ingest it in the beginning,

Jonny Daenen (extra) (14:40)
I had one more question on how the Kubernetes setup works. So you described we are launching containers typically, and we're collecting these logs from all these containers in FluentBit, which runs on the node. So containers run on the node. How does this log information gets to FluentBit?

Niels Claeys (extra) (14:56)
that's a very good question. what first is important, if you run containers, you write all your logs to standard out.

And if you do that then Kubernetes will make sure that the logs of every application are bundled together and they will reside on the node. I can quickly show this. I have a terminal open here.

this is one note. I will execute in the FluentBit component. Why do I do that? This FluentBit is a daemon set on the note, so it has access to certain directories. And of course, one of the directories that it has access to is the log directory, because otherwise it cannot do its job.

And then we see here, and we see all the containers that exist in the node, and we see that they have all this append log, and if you would get...

then you would see these are just all the log entries of the Karpenter component that are visible here. And this is basically what FluentBit uses. It just watches this directory for every new append that happens and it gets these events in and then pushes them to, yeah, in our case, our...

DataCollector endpoint. can show this. The output you saw before is basically the output that you also see in the standard output from this container itself.

Jonny Daenen (extra) (16:12)
Yeah. And so FluentBit continuously monitors these files and basically does a tail on these, on these logs and then pipes it batch per batch towards the endpoint.

Niels Claeys (extra) (16:23)
Yeah, that goes a bit back to the slide that I showed you. So indeed it watches the files, every new event that comes in. It buffers a couple of them to not send every individual log line itself. It can buffer them, but it will then send a batch of new events to the endpoint.

Jonny Daenen (extra) (16:41)
Yeah. Is there also a risk of losing some of this log data if something, something crashes?

Niels Claeys (extra) (16:47)
Well, yeah, if your FluentBit, for example, crashes repeatedly, might miss certain events or certain events might be removed. There is also a very famous issue that is still not resolved. And that's basically that if you have certain containers that start up very quickly, earlier than your FluentBit component starts up, then FluentBit will not detect that a component has been running. So let's imagine that

my pod is super quick and it does three things or something and it writes two or three entries. If it finishes before FluentBit correctly started, it will not see that these logs were generated and it will not push in these events. It's sometimes ⁓ an annoying thing if your FluentBit component takes for some reason long time to start up.

Jonny Daenen (extra) (17:28)
Okay

Niels Claeys (17:36)
There is one small addendum. We made our own version of the output plugin because we wanted to support...

Azure workload identity, which is basically we want our containers to have a certain identity, which was not supported with the default FluentBit plugin. And by default, you saw this client ID, client secret, which are static credentials. We don't like to use static credentials. We want to have dynamic credentials that get fetched and gets refreshed every hour or every x minutes, depending on what you specify.

Jonny Daenen (18:11)
Also that is a workload identity to be used by your FluentBit ⁓ daemon.

Niels Claeys (18:15)
Yes.

Because it needs to be authenticated of course to be able to push the right metrics to this endpoint. The endpoint checks who is the one sending this not that you Jonny for example cannot send me logs or spam my endpoint so they validate is the data that I get here by an authorized identity. was quite a cool thing to do because FluentBit is written in

Jonny Daenen (18:27)
Yes, indeed.

Niels Claeys (18:37)
C.

I cannot write C but I can write Go. In Go you have this option to expose your Go binary as C-shared library, so that you can basically compile it and that it can be used by C in order to be executed. That's also something that FluentBit describes in detail, I never used it and it's quite a cool mechanism.

Jonny Daenen (18:49)
Okay.

Yeah, so you basically create your own library and hook it up to the FluentBit software.

Niels Claeys (19:06)
Yeah,

so this is basically the interface that they provide. So FluentBit, they require for an output plugin to have certain methods. these are called by C and these are just what you need to implement yourself. And basically you need to be able to register your plugin.

and then you need to be able to process the data. basically you get the data as input and you need to be able to convert this and then send the entries to the endpoint in our case. So if we then look at the detail, we see this is the Azure logs client and we just upload the data.

Jonny Daenen (19:41)
And you basically provided them the authentication mechanism so that you don't have to work with this client secrets mechanism.

Niels Claeys (19:45)
Yep.

Yes, yes,

yes, so that's. we use just the standard. Azure identity and say OK.

We want to load the credentials, the default credential chain that it uses and the way you can configure it. It will check are there environment variables or do which mechanism should it use. And basically it gives you then credentials and then with these credentials, you can fetch a temporary token and you can use that.

Jonny Daenen (20:13)
So this is quite a lightweight wrapper basically to send your logs if I understand correctly, where you use the SDK and then the sequence it does to fetch your secrets. And in this case, so if you would supply an environment variable, it will pick it up and then the plugin will be used to forward all the logs to the right endpoint.

Niels Claeys (20:25)
Yep.

Yeah, so this is very lightweight because ⁓ the bulk of the work is basically underneath this in the SDK.

So you see that we set some typical environment variables on the container, which are then used by the SDK to know what is the login.

Jonny Daenen (20:48)
Okay, cool. This is freely available, I assume.

Niels Claeys (20:51)
This is open source, this is freely available. Anyone who doesn't like to long-lived credentials being passed to a Kubernetes pod, they can use this plugin to authenticate.

Jonny Daenen (21:05)
I have a few follow up questions, Niels,

you mentioned FluentBit, you also managed the out of the box way of working that Azure provides. Why would I want to go for the FluentBit implementation? Also considering you had to write your own custom plugin to connect, why would I just not do the swap to basic in the UI and keep that?

Niels Claeys (21:17)
Mm-hmm.

It's a very good question. It depends a bit on your setup. I mentioned already FluentBit gives us some more flexibility on filtering certain entries, certain logs before we send it. But one of the main reasons for us is that it consolidates our stack. We run Conveyor on both AWS and Azure, and on both of these setups we now have FluentBit. Otherwise, we have two different stacks.

Jonny Daenen (21:44)
Okay.

Niels Claeys (21:47)
which makes it different. then the third reason we started initially with what was the, it was back then not called Container Insights, it was called something different. But the issue that we had with the standard component was it used quite a lot of memory and quite a lot of CPU for processing these logs, which means that every machine that we had, we lost quite some resources, which is why we went for the Fluentbit, which is very lightweight and a very

a small component which leaves more room on our nodes, which leaves more room for our customers to run workloads. So that's why we chose it.

Jonny Daenen (22:24)
Yeah.

Niels Claeys (22:25)
The setup is similar, but it requires 5 or 10 times more memory than the FluentBit does.

Jonny Daenen (22:33)
okay.

So it's quite resource hungry.

So you implement this also as part of conveyor.

does that benefit all our clients or do they still need to do something themselves because this is quite some deep dives that you need to do to fix your log cost I can imagine

Niels Claeys (22:47)
Yeah, so we manage it for our customers. They don't see anything of this. The only thing that they see from this full stack is that they can query at the end in our application the logs and they can see logs. They are fully abstracted away. I agree that it's quite some work to dive into these things if you are setting up your own stack,

that

it's core feature of a platform team to see which improvements can we make on our setup in order to either reduce costs, improve performance, without the users being impacted.

Jonny Daenen (23:19)
Yeah.

And so in this case, if, if one of our clients would actually use Conveyor, we solve that issue for them.

Niels Claeys (23:25)
Yes, because most of Conveyor runs in the AWS or Azure account or subscription of the customer and they have full control over their data. So this runs there, which means that all the reduction that we bring in costs they immediately see on their bill.

Jonny Daenen (23:41)
Okay, then I think we can wrap it up. from your side what is your total view on this?

Niels Claeys (23:46)
Yeah, so in summary, I think it's an important skill of a platform team to look into new development changes within cloud provider, new features that pop up. Be aware of them, integrate them or look into what does it mean for my stack? Can they help me? Can they reduce my cost? Can they improve my efficiency?

And then specifically for the Azure logs, I think it's crucial to look at, okay, what is our logging cost? Do we pay a lot for logs? Then, okay, how can we improve it? Improving can be done either filtering out certain components, certain logs that we don't need anyway, And secondly, if

we log to a Log Analytics workspace, try to focus on using basic log tables, because they are five times cheaper than the analytics tables.

Jonny Daenen (24:35)
Yeah, okay. What I take away from this from a very high level is basically if you're running a platform, you treat it as a product, so you need to observe your costs anyway. That is part of having a good platform and good hygiene. And in many cases, we see clients that have a very high log cost. even though it seems simple, it takes up a lot of your cloud bill.

And so keeping an eye on what is possible there and some simple tricks seemingly can even reduce your cloud bill by a lot. but also make sure that your technical people know what they're doing. Like you did in this case, you managed to actually change that high cost and reduce it with two thirds

Niels Claeys (25:11)
Yeah, that's a, from my level, that's a perfect summary

Jonny Daenen (25:14)
Niels, thanks a lot for your explanation. If you can share your repository we will put it in the description of this video and other people might be using your FluentBit plugin very soon. Thanks a lot for being here today Thanks everybody for watching and we'll see you next time. Bye bye!

Niels Claeys (25:32)
Bye bye.