Postgres FM

Nikolay and Michael discuss real-time analytics — what it means, what the options are, and some tips if you're trying to implement it within Postgres.

Here are links to a few things we mentioned:

Loose index scan / skip scan with recursive CTE (wiki)
Zheap (wiki)
cstore_fdw (now part of Citus)
Timescale compression docs
Hydra founders interview (on Postgres TV)
Materialised views episode
pg_ivm
Timescale continuous aggregates docs
Clickhouse
Snowflake
Replication episode
Timescale bottomless storage on S3 (blog post)
pg_partman
Querying Postgres from DuckDB (blog post)
Heap blog (filter by “Engineering”)
Incremental View Maintenance (wiki)
PostgreSQL HyperLogLog
Faster counting (by Joe Nelson on the Citus blog)

------------------------

What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.

If you would like to share this episode, here's a good link (and thank you!)

Postgres FM is brought to you by:

Nikolay Samokhvalov, founder of Postgres.ai
Michael Christofides, founder of pgMustard

With special thanks to:

Jessie Draws for the amazing artwork

Creators & Guests

Host

Michael Christofides

Founder of pgMustard

Host

Nikolay Samokhvalov

Founder of Postgres AI

What is Postgres FM?

A weekly podcast about all things PostgreSQL

033 Real-time analytics
===

Michael: [00:00:00] Hello and welcome to Postgres fm weekly show about all things Postgres Cure. I am Michael, founder of PG Mustard, and this is my cohost Nikolai, founder of Postgres ai. Hey, Nikolai, what are we talking about today?

Nikolay: Hi Michael. Let's talk about one more controversial topic. Uh, can we run analytical queries in pogs and what to expect from it?

Michael: Yeah, right. So like real time analytics workloads. maybe in, combination with transactional workloads. Sometimes people say hybrid. but it, I think it's a fascinating topic and luckily another listener request. So thank you. And sorry it's taken us a while to get to this one, but, really appreciate it.

Nikolay: Hybrid is, just, term OTP guys invented to, to pull, everything to their side. I mean, it's still ltp, but let's run, aggregate. Queries, in a regular LTP database and try to achieve better performance. Right? Let's try to get rid of, uh, analytical databases like, like vertical or snowflake and so on.

This is H T A [00:01:00] P because it's not, it's, it's not vice versa. It's, it works less softened, right? Like analytical for ltp, it, it usually doesn't work because, uh, latencies, they, they of are awful.

Michael: Well, I was, well, we are diving straight into the deep part of the conversation, but I have seen some, like, I have seen some go after trying to do both, like. Single store and like a few, I was gonna bring this up later on, but I do see some analytical providers like Click House, starting to add features that allow them to do some of the, like, or to, to mitigate some of the, um, trade

Nikolay: do, do we have already updates on deletes?

Michael: I haven't checked.

but I think,

Nikolay: do, do, do they work? Okay? Uh, do they have already good? isolation for, for sessions, I mean, isolation control,

Michael: But this is

Nikolay: snapshot, isolation, and so on.

like analytical data systems. Uh, were not designed to be a i d compatible, uh, if they don't need it, they need to sacrifice some aspects, for, for [00:02:00] the sake of processing huge data volumes.

Right. So I doubt like for example, Verica, if it would be so easy, like I cannot imagine someone runs Verica to handle some like, I dunno, like social media or e-commerce kinds of workloads. I mean user facing workloads when customers expect, , all pages to work below a hundred millisecond. You Right.

Better. One millisecond. On average, it should be below one millisecond, uh, on server side. And I cannot expect like that. Someone can run vertical for that or snowflake. It's like strange. So that's why I say HT a p. It's all, it's all p With attempt to improve performance of analytical queries. You, you don't agree obviously,

Michael: Well, it's inter, I think it's, I think this is a really interesting time, and [00:03:00] maybe that's true at the moment, but I think there's an interesting almost, race between OTP systems, adding, adding the ability to do more analytical work and the more analytical side of things, trying to do more, transactional processing.

And there's, at the moment, most people either have two separate. . So like extracting the data from the transactional side and putting it into the analytical side, or they try and get by as long as possible with the transactional side. And, optimizing for some, maybe a smaller amount of, analytical workload.

But I wanted to start by thinking like, what if you think of real time analytics, like what's. , what do you think that includes? Or what's that the first thing that that makes you think?

Nikolay: my bachelor thesis, uh, in university was about et. and how to, how to, uh, make clean, clean up data to get, to get rid of Azure, to have a unified schema for all data from various [00:04:00] sources. And, then my first job was in the company called Program Bank. We program, uh, software systems for, for banks using Delph and Oracle.

It was interesting time. It was more than 20 years ago already, and I learned that banks, obviously they have process of data load during nights and a lot of processes during weekends, and then I later when I started to fly, a lot Europe to, to US and back and forth. And, uh, I've noticed that, since I have, bank accounts on both continents, I noticed that even 20 years later, some banks still have nightly operations, so you cannot access your bank account during, your daytime in Europe when it's night in the US and so on.

So, when. Here, realtime analytics. I imagine that this should go away, this approach and systems should be accessible always. And even if you need to provide some reports, like monthly reports or [00:05:00] something, it should be done asynchronously, at least or better, like in, in the same system. Like, okay, we have LTP system and we know how to handle, analytical queries.

Which process? Uh, like month, month, Of data. So this is what realtime analytics to me, I, I not, not like, like banks do. Some banks, there are, there are new banks which already don't have this approach, and they follow, good practices.

Michael: Yeah. Well, I mean, some of those sound like they're not even showing you the transactional data, nevermind the kind of aggregated data. But that, for, for me, this, for me, this makes me think of, let's say you're looking at a, software as a service dashboard and you want to look at your, like, total of some number or some average of some number, but it includes Up to like today and up to this minute of information. So you, you might be looking at page views on a website, but it includes the last hour or the last few minutes. And if somebody visits your website now, they start showing up in that dashboard as well. It's not, there's not [00:06:00] really a delay.

Nikolay: There is no delay. And it also, it doesn't go down view count unless some bots are involved. And also if you open two tabs of the same page, uh, you have. , at least similar numbers, right? You have the joke. Uh, how many like, uh, Mongo GB reached 1 million downloads, or according to, uh, one of our notes, uh, 9 99 downloads, and according to the other one, 1,000,002 downloads, right?

You know this right? And

Michael: I hadn't heard that joke before. No.

Nikolay: eventual consistency joke. So yes. I agree. And, this is historically a very difficult task for Row Store as pogs and with Mvcc, implementation POGS has with all, uh, that two, that tu Polar top, I forgot. Uh, Processing, using vacuum process and, blood issues and index simplification and so on.

So, historically, posts is known for very [00:07:00] slow. Select count, star, or select count one. Doesn't matter in this case. but it was improved. some aggregate queries can benefit from, index only scans. Some can benefit from. Loosen the scans, but unfortunately, POCUS doesn't have it. automatically loosen index scan.

You need to program it using the CAR cte. So you need to tell POCUS how to skip, uh, certain, sub trees in b3. but, many aggregate queries can be improved. So it's, it's possible to achieve with pores,

Michael: Yeah, you bring up a good point about, I mean, all the design decisions that make Postgres so good at concurrently doing inserts, updates, the leads, the design decisions that make that really efficient. Make running queries on, let's say one column across all customers. If you're trying to sum those or average those or do some aggregation, it fun is fundamentally difficult.

Whereas a lot of these, an analytical systems are designed for the exact opposite, and [00:08:00] they, and as such, we've got a hard, almost a hard trade off here, right? Like I don't see, there's not an obvious, way of having it having the best of both worlds.

Nikolay: It's possible if storage engines and there is. Work for many years already. And it looks like, some sometime soon. We will have some. There is, there are projects. I always mix it z store that heap. I don't remember attempts to have plugable storage, engine engine, which will provide, column store There is also, project from CTOs, um, called, called C Store, maybe like column store, but it was using F DW. . So it's not like native, storage. Uh, like, it's not like in my, in, in my sequel when you say, this table needs to use this, uh, storage engine and, and that's it. But in general, decision, strategic decision and POCUS project was made to have, multiple storage.

Engines like to have API for that and so on. So in future we can expect that it'll be easy to define some, table as, column store storage [00:09:00] based. Let's, let's talk about differences. You, you raised a good question about when. One works better than the other. If we have, uh, all gp, we usually need to extract, uh, user data with all columns of that user, including like email, last name, and so on and so on.

And of course, if, these columns are stored in, in, in, ideally in the same data page, uh, post has usually like eight kilobyte data pages. It's, it's just one buffer read or. Speaking of buffers, right? Our favorite topic. But cone store, like click house, they store, uh, each column of table in separate file.

In this case, you need to deal with many files and then, uh, extract this data and then aggregate it, uh, in memory to, to provide the role. but this is how OTP queries dealing with single, uh, role. If you have puration, like 25 rows ordered by some, column, it, it becomes [00:10:00] less efficient to, to extract the data.

But, but for queries, like let's calculate some of, some, orders or some aggregates for. you deal just with single file. And, uh, if it's stored historically, and you need just one month, you deal with fewer pages. It's, it's all about IO as usual. Right. And the amount of memory you need to allocate and, and work with.

Michael: Yep. And while we're, while we're talking about systems, Allow for you to pick your trade off per on, maybe on a per table basis. we've also, I think, probably should mention that there are extensions attempting to do this kind of thing as well. And the most obvious one being timescale of doing some level of, it's not exactly column store, I don't think, but it ba I think it works the same way.

Uh, but it's very clever. But the, the, the other thing that they make take advantage of and a lot of the other analytical.

Nikolay: You mean compression, uh, in timescale four times series, which, uh, understands like, [00:11:00] which uses the knowledge about what stores in column and, and applies is to compress, right?

Michael: Yeah, but that, and that's exactly what I was gonna say, because data within a column, is very likely to be similar across rows. Um, it compresses extremely well. And another thing that can reduce IO for those analytical queries is, is that there's just less.

Nikolay: I encourage everyone to read their articles, especially one, uh, explaining details about compression approaches, uh, for time series and timescale. I read it from time to time and recommend, keep recommending. It's, it's brilliant article and in general, work is brilliant for, for this, uh, um, types of data.

Michael: Yep.

Nikolay: But, wanted to mention also Hydra, which is a very. project, which is aiming to build analytical type of, pogs like POGS flavor. And they also work in, in this area. they took CTOs, this store, [00:12:00] uh, for Com store, like via f, DW and, and improve and so on. And we had a episode on post TV interview with founders.

it was one, one month ago or so. Uh, but it's still getting a lot of good feedback. They shared feedback. They get, get personal feedback that that's great. I mean, it's interesting project and I encourage also checking out. This interview. So, yeah. by default, I think pe old people sh should start from pocus, right?

And then we don't want to be like banks, right? We don't want to have ETL every night. It, it doesn't scale well, this process, I mean, it, at some point it'll take more and more hours and, uh, you don't have so, so much time. And like it's, it's hard to manage this process and also customers. Because if they want after date for no, you need to wait until tomorrow to get this info.

No, it's not good. And customers can be internal in this case because, for example, business owner might want to see some reports right [00:13:00] now, but we would tell return tomorrow. No, no. So, We start with pogs and then we have, two options basically. First is to take something for analytical, and this is our default option still.

otp, pogs. DBAs usually recommend and I usually do the same. don't use POGS for analytics. This is mantra actually. We, we use this mantra all the time because the problem is not only, it's less efficient for heavy. queries with aggregates. It also, uh, affects your health, health of your pos. because if you run it on the primary node, if you have very long query lasting half an hour, for example, and during that times other queries deleted in any table, they deleted that topples Auto cannot delete these topples until your long query finishes because you hold xin Horizon.

Michael: So yeah, we've talked about this a few times and I think this is, this might be the one time I disagree with you [00:14:00] because we're talking about realtime analytics here. I think if we're talking about. Things that update, that a user facing people are gonna have problems way before half an hour in terms of user experience.

So I think we're talking, more in the region of the customers I see doing, uh, this kind of optimization work within Postgres, it tends to be they're trying to get something to load within. Ideally, you know, a few hundred milliseconds or five, five seconds, kind of 10 seconds kind of upper limit of how long somebody will wait for a a, maybe if they're changing some dropdowns in a, in a ui, and then they want to refresh their data.

If we're talking about that being able to finish within a few seconds, I, I think that can still live within Postgres for a while. Maybe it doesn't scale forever, but, it's smaller at smaller sizes. I definitely

Nikolay: You are right. You are right. But only until some point if, uh, you, okay, so you, you basically say realtime analytics means that we have a requirement to, uh, to have statement time out, 15 seconds, for example, for [00:15:00] all, all user facing queries. That's, that's great. I agree. But, to achieve that, you probably will.

Face, challenges, uh, when you have, select count or select some or anything, uh, lasting to prolong, and then you will implement some generalization, uh, synchronous via some, uh, queuing mechanism through Kafka or something, to pre-process some data, for example, on hourly basis. And those queries you won't. right? And those queries will your horizon, back backend, how to say them, how to, how to name them. Backend queries or something, not user facing. Some, batch processing queries, something like that. And, to keep everything in ous, right, you need to, to have some pre-processing, asynchronous and parallel cured and so on.

Michael: Uh, and we've talked about like, uh, we should probably give a shoutout to a previous episode we did on, materialized views, which are one kind of pre-processing tool. probably gonna give them a couple more shoutouts in this episode, so

Nikolay: PG I [00:16:00] v m

Michael: But I was gonna, I was gonna say timescale actually with their, uh,

Nikolay: Oh, timescale is great. It is great. Timescale is great Again,

Yes,

Michael: the continuous aggregates are kind of this pre-processing, right? Where you have this continuous aggregation of new data coming in, and then that getting filed away. So I think there are some ways of avoiding what you're talking about, but yeah, only up to a

Nikolay: Continuous segregates and timescale, and this is a part of their open source offering. So it's That's great. And I, I know cases, we are, it works very good. I also need to, I, I, I made a joke today about like, someone from Timescale offered me to register in Timescale Cloud. First of all, of course, I'm registered there and, uh, I need to say that I, I, I said, no, uh, don't go there because, uh, there is no.

in Timescale Cloud. It was a job cause about their bottomless feature. We, we probably also discussed it, of course, go and register on Timescale Cloud. It's, it's great. A lot of, very [00:17:00] new approaches, uh, very interesting approaches. and I think the timescale is one of the companies who, which drive, positive ecosystem forward, using innovation.

That's great. So we, we, if we want to keep everything in pocus, we will have these type of challenges along the line, running queries. Sometimes we can avoid them if we choose smaller batches. We, if we, process this, if we implement this asynchronous process, uh, more often, every minute, for example, but also worth mentioning, it's hard to engineer it because POCUS doesn't.

Uh, what Oracle provides, uh, autonomous pragma, pragma autonomous, something like you. I want to detach my, auxiliary transaction for background processing. For example, we inserted some data and we want to implement some count. We would like to do it as synchronously, not blocking them main insert, right?

Or main. And to do that, you need to engineer it using some, some Kafka or some, I don't know, [00:18:00] like, uh, reim queue. Cannot recommend Reim Queue or Sidekick or something, some queuing. By the way, there is a new project, released a couple of days ago, uh, which should consi is considered as a replacement for p gq.

Uh, but it's in our story Q in pogs. We, we want, we need to discuss it one day. But if you manage to engineer it, probably you can, can keep your transactions below some duration, but if you don't, you have a trouble and you cannot even offload some long running selects. To, to stand by node because either you need to be a single user on that standby node because others, other users will, will notice long legs, big legs when you, your query lost or you need to turn on hosted by feedback, report the primary and have all the issues with bloat again.

So what to do, back to my idea that we have two options. One option [00:19:00] is find a way to remain in positive using hasam. optimization. Some this store, approaches some, some innovation or to use some, uh, regular approach, but with continuous ETL basically, which is called cdc change data capture.

So you take some something, you can take for example, click house and use it as a secondary database right from your applic. It's kind of, there might be problems with, consistency, because probably POCUS knows that you, you made thousand or, uh, sales today, but Click House thinks, uh, it's slightly less or, or more, usually POCUS is considered as a primary source of truth being ultimate system.

And then we have logical. Replication to Snowflake, Verica or, or Creek house or anything else. There are ways to do it, but you need to deal with logical replication. This, this is challenging. , it's possible, but it's challenging. I must admit it's improving over every year. [00:20:00] POGS receives, uh, improvements in this area and there are tools, uh, including commercial tools, open source division, but there are commercial tools I cannot recommend.

Click the, their implementation click is bad. Former opportunity. Opportunity, I think, right? It's just wrong implementation of POS plugin. But for example, um, five trend is quite good if you consider commercial systems.

Michael: And we, we have a whole episode on this, right?

Nikolay: right, it's, it's

Michael: on replication

Nikolay: different story. And so the, the, our option is realtime analytics is to improve queries.

Let's talk about the ways to improve.

Michael: well, and, Another thing I was gonna mention that I think is important here is generally once we have performance issues or once we have a decent amount of data, we're doing analytics, we probably are in the realm of needing or wanting to use partitioning of some description.

Nikolay: partitioning is huge. partitioning is must have if you, if [00:21:00] you need to deal with big data volumes like ru rule of thumb is you need partitioning. If you have for a table which exceeds, for example, uh, 100 gigabytes,

Michael: Yeah, which is a lot, but also not that much these days. Right. It's

Nikolay: And petitioning is needed in many, many, many aspects, uh, including data ology.

So all data lives in particular pages, data blocks on disc, and, they are all frozen, very rarely attached. So tobacco just keeps them very quickly and also indexes, for those old partitions are in like almost frozen shape. efficient, maybe rebuild, not degrading. And they, these pages are barely present in our buffer pool and, uh, operation system cash because if everything is mixed and, each page has a couple of new tops and a lot of old topples, it blows your shell buffer.

This is one of [00:22:00] the biggest reasons of partitioning of, to have partitioning.

I just recently revisited this topic, uh, several times with a couple of customers and, uh, of also like in pocus, 14, I think. Yeah. In 14 Reindex, concurrently and index concurrently, they got optimization not to hold the screen horizon.

It was so great. So you reindex and even if you have, uh, terabyte. Five terabyte size table. Uh, you don't, you can perform index maintenance because we all need, uh, rebuild indexes from time to time. Even if it's not ous. In all TP systems, they degrade their health degrades. So, you hold, uh, you don't hold gi spin horizon, so it doesn't matter if, if it takes hours, right?

But in, uh, it was a tre. And then 14.4, uh, it was reverted. So it means if you have huge table, you need to rebuild, index, holding screen horizon. It, it's the same problem as we just [00:23:00] discussed with, inability of, of vacuum to delete tops, which became that recently. Right. So, so that's why you need to, to, to have partitioning and, and everything is improves, including index behavior and, uh, and so on.

Yeah. Sorry.

Michael: I think

Nikolay: long speech.

Michael: yeah, well, I think there's another good reason and that's, I think it encourages good practice around maintenance and around retention. I think a lot of people before they have partitioning, don't even consider. Deleting old data or, or, or how long they should be holding certain data for.

And a lot of the time it is partitioned on time or at least one aspect of it as time. So I think it does encourage you to think about how many of these do we need to keep around and how easily can we get rid of them? That kind of thing.

But Yeah.

maybe not. So analytics.

Nikolay: here again, the time scale pops up. Like they, they do do very good job to achieve h HTP for POS four time series because again, this bottom list I mentioned, this is how executive, how they, uh, [00:24:00] implemented it. And I like you. I like it. So for partitions with archived data, you, you can choose s3.

Right. And, and, and for hard partitions, uh, it's on, on e bs volume. In a s for example, if it's a s and, uh, for your application, it's, everything is kind of transparent. So, so you, you don't notice it. Of course. Uh, archive data is slower. , and everything is, you can see it even if even explaining lies that, uh, S3 was, was, uh, involved.

Michael: Really? That's cool. I haven't seen that.

Nikolay: yeah. It's a fresh, feature, I think bottomless. Uh, and, but this works only in cloud, unfortunately, as I understand.

Michael: okay. Yeah, makes sense. that's a good point. That they automate the partitioning for you or, uh, once you choose certain things. But the other, the normal way of doing this norm, like the place that's the, the extension that seems to be most popular in, uh, is PD department. Right. The,

Nikolay: [00:25:00] Yeah, I, I, I, I must confess, I never use it.

Michael: yeah.

Nikolay: lot of folks around did it. But I, I don't, I didn't I, but I implemented partitioning myself before partitioning became declarative. So I, I , like I have, uh, some experience with partitioning kind of due, declare partitioning as well. Uh, but, well with partitioning the general advice you need to, to choose.

As fresh measure version of POS as possible because again, a lot of progress was made during last five to seven years. ability to detach position concurrently and so on. in, in not non-blocking way, but I wanted also to mention that the same thing as, timescale did, you probably can do yourself using F DW if you, convert your old partitions to foreign tables using POCUS F dw, for example, you can store them on remote server with slower disc and so on,

Michael: Yeah, you've just reminded me actually of a whole different, it's a different slant on that, but did [00:26:00] you read The Post by Duck db? they're an analytics database and they've done some incredible work. I didn't really understand it, but you can now query Postgre, like you can have your data in Postgre, but query it through Duct db, like, kind of like an F D W.

but in reverse, and get some ridiculous performance improvements on, on the same queries.

Nikolay: But we, like we, when we use F dw, I, I'm afraid we will break our requirement. We, we won't meet our requirement to be below 15 seconds at some point, at some data volumes. I, I'm afraid my gut tells me, you know, but maybe I'm wrong. Uh, I, I, I actually. Don't follow all new stuff happening in Poal ecosystem.

It was so like at least five years ago, uh, uh, definitely 10 or 15 years ago. Right now I'm feeling I cannot catch up with everything. It's too much. Too many things are happening.

Michael: I highly recommend this post product. I will, I'll link [00:27:00] to it as well. but they, some people I really respect us singing their praises at the moment, so I'm, uh, looking forward to understanding that one a bit better. And the oth the other one I wanted to make sure we mentioned. What I don't feel like we should talk about Postgres and realtime analytics without mentioning heap.

They do quite a lot. blog post related to Postgres, and I think, I could be wrong, I could be a bit out of date here. I think they're running Heap, which isn't analytics product. pretty much exclusively I think on Postgres May, maybe with Citus, I think.

Nikolay: I, I didn't see fresh articles from them. but like, I think it was four or five years ago when I, I learned, and I remember they used Zfs with ARC two with, uh, data stored on, uh, EBS volumes and with local disks and local N V nv. Like, it was time when, nitro, in AWS was not released, I think.

And, EBS volumes were much slower than local. N v [00:28:00] e disks, which are ef al. So if machinery boots probably you, you lose data there. So what they did, as I remember they did, of course partitioning was involved. I, I'm quite sure a lot of Node and I think also Shing was involved. I, I, I, right. But, uh, they store data on base volumes quite cheaper.

And ARC two cas the first, uh, Uh, was, was on, uh, local, in Vietnamese. So it's like interesting system, but, I, I never tried to implement it myself. This approach with Zfa,

despite of the fact that in, in my product database, lab engine we use is the first, but we use only regular arc in memory.

like alternative to operation system cash. never played with Arc two. . I, I cannot tell that this is, uh, works Well, maybe, but it's an interesting idea.

Michael: yeah. is there, um, I, I suspect we could talk about this for a quite a while. is there anything else you wanted to make sure we

Nikolay: Well, there's [00:29:00] also, uh, materialized views are very important topic, and this is also not fully solved, so, I think engineers should know what poss can do, right? And provide out of, out of the box. And what poss cannot do in this area of realtime analytics, you should be ready for degradation and you should experiment with larger volumes.

Definitely you should tune auto vacuum to be quite aggressive to keep, uh, visibility maps up to date. and this is must-have thing, so. Your benefit from index only scans. For example, you loaded data, you account once visibility map is fresh and your select count or, or some select with some aggregate works quite fast because it, it doesn't consult to.

Hip pages because, uh, hip fess is zero because visibility maps are very fresh, but then a lot of updates are deletes are coming and it degrades. And if auto vacuum is not aggressive enough, by default it's not at all. Your [00:30:00] will see hip fetches in plants growing and performance degree degrading. So you need to tune not to work, but three things, poss doesn't have, uh, one, I mentioned this like asynchronously detach.

Sessions with transactions, which will help us to normalize data, right? Second partitioning is not fully automated in terms so, uh, nobody will create partitions, uh, for you automatically, right? Or, or all prunes old one. You need to engineer additional pieces here. Of course, let's where, where, uh, timescale shines again, because it

Michael: or use. Or use PG parman

Nikolay: Right. So, right, right, right. And third thing is, uh, that, again, ized, you can be refreshed only as a whole. It can be done concurrently. So ongoing queries are working, but it's, it can be refreshed as a whole and it also can get bloat and so on and so on. There is no incremental, incrementally refreshed materialized use.

[00:31:00] This idea is one of the hugest ideas. Biggest ideas pog, uh, project has, and, uh, there are projects around it. There is Wiki page that discusses how to implement yourself. It's possible. But some engineering effort is needed, definitely. but if you manage to have incrementally updated motorized use and a synchronous, a synchronous workers, which, uh, normalize your data and also implement automatic partitioning, you can achieve real time on analytics imposs, uh, at at scale few terabytes and or dozens of terabytes at least.

Also, I forgot, I wanted to mention one extension. I think it was developed at site was back in the days before Microsoft as well, called hyper log log.

It helps with count distinct, right? Hyper log log, right?

Michael: It's like one of the options for providing a good estimate for it. Right.

Nikolay: Yes. also like I never use it myself in my projects, but I recommended some folks used it. [00:32:00] it doesn't solve all problems, but in some cases it, it, it's useful.

Michael: Yeah, there's an excellent post on the Citus blog that we actually linked to within PG mustard for when they, when we spot slow counts. They did a, a really good post listing, a bunch of ways that you can speed up counts, whether that, whether you need an exact number or whe when an estimate can be, uh, good enough.

Nikolay: Yeah, I remember good old trick of estimates if your auto vacuum is, tuned aggressively. So auto, like it has, uh, analyzed part to maintain raw row statistics and if, uh, uh, so, you know, implement some function. if you provide some query instead of returning data, it returns. raw estimate from it, extracts it and say, okay, raw estimate roughly is this, but you, you need to be friends with your statistics and auto vacuum to rely on this information.

Sometimes it's very off, depending on, on filters and so on, but, okay. I think despite, a lot of name [00:33:00] dropping, , right? We, provided some principles as well, what to expect. And, uh, I think, uh, definitely it's achievable at the scale of like, Terabytes are dozens of terabytes and even thousands or maybe dozens of thousands of transactions it's possible to handle.

But it's still not as good as, uh, performance of, of column store for agates. And unless you use one of the projects we mentioned.

Michael: Yeah, exactly. There is a limit, and if you are, if you're building an analytics product where people need. Analytics is their main thing. There's a good chance that you won't want to use Postgres for that data. Right.

Nikolay: Right, right.

Michael: But yeah. Awesome. Thank you so much Nikola. Thanks everybody. Thanks for the suggestion as well.

keep those coming and

Nikolay: No, no, no. We, we have enough, actually.

Michael: Oh,

Yeah. We have

a lot,

Nikolay: let's

pause,

a little bit. We need to process what we have I'm joking. I'm joking. I'm joking.

Michael: Well, take care. Have a good one.[00:34:00]

Nikolay: Bye-bye. Thank you.

Real-time analytics

Real-time analytics Real-time analytics

More episodes

Real-time analytics

Real-time analytics

Chapters

Creators & Guests

What is Postgres FM?