A weekly podcast about all things PostgreSQL
Michael: Hello, and welcome to
Postgres.FM, a weekly show about
all things PostgreSQL.
I am Michael, founder of pgMustard,
and I'm joined, as usual,
by Nikolay, founder of Postgres.AI.
Hey, Nikolay.
Nikolay: Hi, Michael.
How are you?
Michael: No, not falling for it.
Today, we are delighted to be joined
by 2 guests.
Firstly, Joe from Hydra.
Welcome, Joe.
Joe: Thank you.
Michael: And also Jelte from MotherDuck.
Welcome, Jelte.
Jelte: Hi.
Hi all.
Michael: So thanks so much both
for joining us.
We are going to be discussing a
new-ish project called pg_duckdb,
which very interestingly is a cross
organization collaboration.
So Jelte, I saw the recording
of your excellent lightning
talk introducing pg_duckdb to PGConf
EU, a nice job.
Perhaps you could start us off
with how you describe pg_duckdb
to a Postgres savvy audience who
may or may not know what DuckDB
is.
Jelte: All right, so it's an extension
for Postgres and it embeds
DuckDB.
And DuckDB is an in-process analytics
database, like SQL database.
So it's a bit like SQLite, but
then more for analytics, like
big queries with lots of joins
and stuff like that.
And lots of joins and less like
SQLite or like Postgres in general
for like smaller datasets, like getting
a hundred rows or something
like that.
It's index scans, stuff like that.
Indexes are not something that's
commonly used in the DuckDB
land.
It's like, oh, we just scan the
whole table, and we compute stuff
very fast on it.
Yeah, and that is then embedded
into Postgres so that you can
use that for some of your more
hazard queries that don't rely
on indexes.
Michael: Nice, and I'm sure we're
going to come back to indexes
in a bit once we dive into the
details.
But Joe, I think the origin story
of this project is super interesting.
I'd be fascinated.
I was reading the MotherDuck blog
post and it mentioned that
Hydra kicked off the effort.
So I'd love for you to share a
little bit more about that if
you can.
Joe: Sure, Sure.
So our team had been working on
a crazy weekend project that
we initially were calling pg_quack.
It was called pg_quack because
well, ducks go quack, but also
because we didn't necessarily expect
this to work.
We kind of were writing it off
as a quacky project.
So it was a little bit of an internal
joke.
But the performance of DuckDB was
looking super great when it
came to all kinds of analytics
queries and aggregates like Jelte
mentioned.
And we had this aha moment that
we could actually take pg_quack
pretty seriously and do it for
real in collaboration with folks
that know a whole lot about DuckDB.
You know, we're Postgres guys by
background mostly.
So not to get too far ahead, but
we decided to build pg_duckdb
for real as a production-grade
extension together with the folks
over at DuckDB Labs, the creators
of DuckDB, which is Hannes
and Mark, as well as the folks
at MotherDuck, like Jelte.
So we've been working together
with them on it, oh, for at least
last 6, 7 months now.
Michael: Yeah, super cool.
And I saw a few other organizations
mentioned as well.
How do they all fit in?
Joe: Yeah, so this is open source,
it's an MIT licensed extension.
So for folks that want to participate,
or can see pg_duckdb being
really helpful in their use cases,
they can either tell us directly
what they'd like to add or open
a PR or open issues.
I mean, certainly there are a good
number of those as the project
is currently in beta.
So be gentle with it, but also
tell us the things that could
be improved on.
One of the things that was surprising
to me was the immediate reaction
that we saw to how DuckDB could
be useful to accelerating analytics
queries in Postgres.
And there are a lot of organizations
that went.
Oh, yes, okay.
We want to support this we want
to get involved.
So that's probably some of the
names you were seeing
Michael: Yeah, I think I saw was
it Neon and Microsoft maybe
some others as well.
Joe: Yeah, there's a whole range
of folks that
are interested in us moving this
forward, and they're providing
either direct or indirect support
for the project.
Michael: Yes.
Jota, from your side, what are
you...
I should mention both of your thoughts
on this, but what are
you seeing early users most excited
about?
Like what features are they most
keen to see added or improved?
Jelte: I think one of the big things
is that you can sort of read
Parquet files or read CSV files
from blob storage, like from
S3 or from Azure or from all the
big clouds, basically in Cloudflare
as well.
And I think that's a big feature that people are very interested
in.
And I mean, there's so many open issues of people wanting a specific
thing that it's hard to pick 1 that's the thing people want.
But 1 that has come up a few times is support for spatial stuff,
like PostGIS things.
And that works slightly different in DuckDB so we haven't started
on that we're first sort of building a bit more the base but
that's definitely 1 of the things that 1 of the features that
comes up like people wanting to do spatial and let's do spatial
data basically.
Yeah.
But that's not supported right now.
Nikolay: Yeah.
So, 1 of the biggest features is that data can be on, I mean,
it's not a feature, it's like a requirement here.
Data is on object storage like S3, right?
In Parquet format, file format, for example, and it's already
column store.
But like, I mean, it's unlike Postgres, right?
But this sounds like, I remember some attempts to have access
to Parquet files on object storage as extension before pg_duckdb,
right?
And it's also like It could be done in FDW way, foreign data
wrapper way.
I think some SQL interface over those files and that's it.
But DuckDB is like bringing something else.
I've never tried myself, but I've heard many opinions about very
good performance.
Right?
Jelte: Yeah, yeah.
So, that's 1 of the big differences between pg_duckdb and some
of the other foreign data wrapper based ones.
Is that we sort of flip it around a bit, instead of saying like,
oh, this table is stored somewhere else.
We'll read that using DuckDB or something, or something else.
Like, instead We're saying like, okay, we push the whole query,
we put it in DuckDB, and then let DuckDB read the Postgres data.
So the Postgres sort of becomes a foreign data wrapper, kind
of inside DuckDB.
Does that make sense?
So it's more like a query stealing approach, is what Mario from
Hydra, 1 of the developers there, called it.
It's like, instead of letting the Postgres execution engine do
a lot of things, it's like, we don't want the Postgres execution
engine to do a lot of things because the DuckDB execution engine
is much better for these analytic stuff, like the sorting and
the grouping and stuff like that that are joining.
We'd wanna do as much as possible inside DuckDB.
Nikolay: Yeah, that's interesting.
Joe: Yeah, and also, you know, kind of back to Michael's question
a little bit too, A really common pattern for a lot of application
developers is to have Postgres, but also to have an S3 bucket.
There's a lot of event data that can get generated that really
should not live on disk in Postgres.
We're talking to a company yesterday that had generated, you
know, has 2.6 terabytes of event data that they really shouldn't
be storing necessarily on Postgres, but they could offload into
Parquet format in S3.
But that data still needs to be accessible by Postgres.
Because 1 of the important things is that pg_duckdb can perform
the join between heap table and Postgres together with the Parquet
files in object storage so that there's not this siloing or needing
for ETL processes between Postgres and external analytics databases
and S3.
So how do you smooth over the differences between columnar format
and row-based format?
And I think pg_duckdb goes a long way to do that because ultimately
when you're developing applications, thinking about table format
is pretty low level.
So if you can avoid that as much as possible, that's probably
just easier.
Nikolay: Yeah, but here I'm confused a little bit, because this
thought is like, let's have tiering of storage.
And, for example, I remember several cases, big companies where
this was needed and the decision was made for scaling purposes
of main clusters.
We need to offload archive data somewhere else, maybe to different
cluster or somewhere else.
Maybe sometimes we just delete it.
It could be, for example, e-commerce and old orders, order history,
exceeding for example a couple of years, we don't need that.
But offloading this data to S3, this sounds great and I remember
there's also extension pgt from Tembo, again I haven't tried
it yet, It's in my to-do list.
But offloading that, I'm thinking, if I also receive column storage
for this kind of data, is it right?
Because I still need only 1 row in these cases, right?
In some cases, I do need like aggregates, like, I don't know,
time series, I would love to upload to S3 in column store format,
but not in these cases when I just need 1 row.
Joe: Yeah, I mean, I think it's different needs for different
users.
If you are a, you know, application developer just trying to
build something quickly, then if we can tell you, hey, dump a
lot of your event data into Parquet format in S3, or archive
it out of Postgres on some increment every 7 days, every 30 days
or something, then that's easy enough.
Then maybe the thinking about columnar versus row should subside.
But yeah, I mean, data tiering and data lake is not as different
as, you know, they're really not
that different.
It's just a function of what the
user interacts with.
Nikolay: Right, but my concern
is if I need the row store, like
I need to find all columns for
1 row to present it.
For example, again, orders, right?
I need 1 order.
If we pull it from archive, I expect
not good performance.
But if it's column store, I will
be concerned about performance
to collect all the rows versus
row storage.
Let's formulate the question differently.
Let's talk about the use cases
where this particular pattern
when we need to upload a lot of
data to S3 and benefit from column
store, analytical queries like
aggregates, what kind of data
it could be in existing Postgres
database?
It can be SAS or something, I don't
know like what kind of data
is beneficial for us to offload
to this?
Jelte: A very common case is like
time series data.
I think
that's 1 of the, that's
by far 1 of the most common
cases where it's like, okay, at
least sort of the old data we
store in, I mean, you're not going
to look up a single row in
some series of events.
Or if you do, if it's like a week
old, it's not so terrible that
you don't look that up too often.
So it's like, okay, it's fine.
If that takes a little bit longer,
make sure fetching a bit too
much more data that you actually
need.
If that means that you can do these
aggregates much quicker.
I think that's 1 of the most common
cases, I would say.
Nikolay: Yeah, that makes sense.
And that means that, for example,
Timescale Cloud, they have
it bottomless timescale of loading
some data transparently, so
we don't see it, but it goes, some
archive data goes to S3, but
this is not open source.
PjWB is open source, which I like.
You mentioned MIT license, that's
great, but I see some kind
of, not competition, but alternative
here.
I can choose between going to Timescale
Cloud if I need this,
or just use it if I can.
So, I mean, if I can install extension,
which is not available
in managed service yet.
But yeah, it makes sense to me.
Indeed, especially if data is old,
and I expect it to go to object
storage, I definitely not in the
usually I'm not interested in
single row, I'm interested in aggregates.
Right?
Yeah, this makes sense for me.
Cool.
Michael: That's a good question
though.
Are there any, I'm sure you're
talking to lots, but are there
any Postgres managed services that
already offer it in a preview
or developer mode?
Jelte: I mean, Hydra is offering
it, but other than that, not
yet.
But yeah, we're definitely in talks
with a bunch of them.
Yeah, it's a project that many
people are interested in.
Michael: Yeah, nice.
Joe: Yeah, so Hydra currently has
a closed beta of a fully managed
Postgres service that has pg_duckdb
extension installed.
Nikolay: Nice.
Let's talk about some numbers,
if you can pull them from top
of your head.
For example, if I need to aggregate
over many million rows which
are sparsely distributed.
We know like in the worst case,
for example, if I touch, I don't
know, 1000000 rows, which each
1 is stored in separate buffers
in the worst case, for example,
right?
Very roughly, I need to touch,
And in the worst case, it's not
hits, it's reads.
I need to get from disk 1 million
buffers, I mean, pages, right?
And in the worst case, it can be
absolutely worst.
I can imagine it may be like 1
millisecond for each if latency
is bad and we do it in not without
parallelization it's 1000000
seconds it's terrible right and
this is what I can see in Postgres
in some cases what should I expect
here if I need for example
to calculate sum of of rows like
1 million rows stored in this
format?
What like is it seconds, minutes?
Jelte: You mean if it's sort of
blob storage?
Like how many rows?
Nikolay: 1000000 rows and some
simple aggregates like count sum
or something.
What do you see?
Jelte: I would say it's generally
within a second, I would say,
or within a few seconds at least.
Nikolay: Including data transfer
from S3, everything, right?
Jelte: Yeah, yes.
So because it's a per cave format,
you only get the columns that
you actually requested.
So it's already, even if it's a
lot of data, you generally just
because you have a bunch of columns.
So you only grab the data from
S3 that you actually need.
And that is then sort of streamed
through the WB engine to get
these, to get these aggregates.
Nikolay: Yeah, that's, that's super
cool.
Jelte: But I mean, it's like with
all these questions, it always
depends a bit on what, on what
you're doing exactly.
But for instance, we, we, we, we
like do, there's click bench
results.
ClickHouse has a benchmark and
their pg_duckdb is quite...
From the Postgres compatible ones,
it's...
I don't think it's at the top.
The one with MotherDuck support is
almost at the top, I think, but
the one with Parquet is a bit
lower, but still much higher
than tuned Postgres, I think.
I don't know by heart, but I think
it's at least 10 times faster
than the tuned Postgres version,
but maybe even more.
Nikolay: Yeah, that's great.
And I can imagine this opens possibilities
to stop saying don't
do analytics in Postgres.
So this can be a transaction which
does aggregates in Postgres
if it's within seconds.
We don't have negative consequences
of long-running transactions
anymore, right?
It's not only about user experience.
It's also about how Postgres performs
vacuuming, still, and so
on.
This xmin horizon problem and so
on, right?
Jelte: Yeah, yeah.
Joe: Yeah, I mean, pg_duckdb was
really kicked off to try and
solve the known limitations of
analytics in Postgres.
You know, not even at huge scales,
Postgres can be fairly inefficient
for doing analytical processing.
So, you know, DuckDB, that's what
it excels at.
We're trying to get the project
to the point where the user doesn't
necessarily need to know very much
about DuckDB at all.
Since these are Postgres users,
the service should look and feel
exactly like Postgres for the most
part.
And then we make those differences
clear if you were to run an
EXPLAIN plan, it will show you
exactly where and when DuckDB
is taking over for that execution
if you want to look a little
bit deeper.
Michael: Would that be like a custom
scan node or how would it
show?
Jelte: Yeah, that's a custom scan
node.
And basically, it's always like
it either takes completely over
or it doesn't take over at all.
There's no, I mean, that's sort
of one of the things we'll probably
want to add at some point, like
to say like, oh, this part is
actually, needs to be executed
in Postgres because DuckDB doesn't
know how to do it.
Right now, it's either everything
or nothing.
If you want to do something that
DuckDB doesn't support, but
still needs DuckDB execution, then
just look at this query.
It's not supported at the moment.
But yeah, It's one big custom scan
node, basically, that does the
DuckDB execution.
Nikolay: And do you see situations
when you need to tune something?
There are some tuning capabilities
inside DuckDB, right?
But if we see only one node in the
plan, we cannot do it from there.
What's the workflow to understand how efficient it is?
Jelte: So, I mean, at the moment, the main things you can tune
in pg_duckdb are like just that amount of threads that DuckDB
uses and the amount of memory that it's allowed to use.
Because it's, I mean, because you're running it inside... normally
DuckDB really likes to take all the resources of the system and
go as fast as it can.
So that's, I mean, your laptop tries to fly away and overheats.
That's what it wants to do.
But, I mean, if you were inside Postgres, that's probably not
what you want, because you also want to run some other queries.
So that's why we have some limits.
So I think by default, it will try to use as many threads as
are available, just so that it completes quickly, but at least
the memory is trying to limit.
But you can lower both or increase both.
Nikolay: And we can run this queries on Postgres standby nodes,
right?
On replicas?
Jelte: Yes, you should be able to.
I don't think I've personally tried yet, but I think that should
be fine.
And if it doesn't work, please open a BR or like an issue.
Nikolay: I'm already thinking about a common approach, like I
would say a traditional approach right now is to have OLTP database
Postgres for everything but for analytics we offload data to
some another system sometimes.
Or like with Hydrate it's a different table with column storage
as I remember it's also like additional effort to maintain data
synchronized.
Here if some data, okay, again, maybe also here, maybe it makes
sense in some cases to have original table stored in Postgres
and raw storage, but this table, maybe with maybe projections,
some Columns only, go to this additional table which is stored
in this format in S3, Parquet, DuckDB processed, right?
And this is also an interesting question.
Do you see it?
But I wanted to ask a different question.
So if we do this, we can run analytical queries right inside.
And I already see that it makes sense if it just selects to do
it on replicas, not to, like, primaries.
We have only 1 single primary in a single cluster, so it's not
good to take memory for this, and so on, and affect all TP workload
processing.
In this case, it's great that if it's fast, because host-and-buy
feedback is on, like, spin-horizon is almost not affected, great.
But there, I probably would like to give all memory, all CPU
power for this processing if it's maybe it's dedicated replica.
Since I'm new to this, sorry if my question is not right, but
what about compression?
So do I have any capabilities to
choose compression method and
ratio and so on in case of Parquet?
Joe: Well, if you were to copy
the heap table to Parquet format
using pg_duckdb, then, okay, you
know, I think you'd see what,
5 times compression.
Nikolay: Depends on data, right?
Joe: At worst, depending on the
data, 5 times, sometimes you
see 10 times compression.
But this can also be the case for
the other columnar formats
as well.
pg_duckdb supports the Iceberg reader
as well as the Delta table
reader.
You know, you could use Delta tables
or Iceberg files in object
storage as well.
All of those will have better compression
than, you know, than
just standard Postgres heap tables.
Nikolay: Yeah, this is what I expected.
And I'm just...
As I understand, there are no control,
you just get it, that's
it.
For now, you can...
Jelte: I'm not entirely sure.
It might be that the copy...
I mean, we sort of expose all the
copy options that DuckDB exposes.
So I think they might expose like
a compression option that you
can say like which compression
type you want to use to write
the Parquet file?
Nikolay: Yeah, because usually
it's a trade-off between CPU utilization
and
the size.
Jelte: Yeah, yeah.
Nikolay: Yeah, this discussion of CPU triggers
this question in my head,
and I know Column storage is beating
Postgres completely in terms
of compression.
I remember a funny case when I
said 1 billion rows is 1 terabyte,
but VictoriaMetrics CTO said,
how come it's just 1 gigabyte?
And we could not understand each
other, right?
Because in Postgres it's indeed
1 terabyte.
But yeah, 5X, you said it's the
lowest, that's great.
And I think it also cost saving.
And it's not only disk saving in
terms of if you're offloaded
to S3, but also memory, right?
If we have 1 primary multiple standby
nodes, we like, in many
cases, the shared buffer pool and
page cache state is mirrored
between standby nodes.
And if this is data, which we don't
need often, it's not good
in terms of cache efficiencies
right if you're if loaded if you're
floated to this you like less memory
to process your workload
and this is probably a huge cost
saving as well, right?
Jelte: If you offload the computer
to standby now, if you mean,
Nikolay: or are you offloading?
In general, we take some data,
we offload it, like if it's archived
data, we don't need it often, and
it also have some patterns
where column storage makes sense.
We offloaded to S3 using this approach.
And if it's processed by a DuckDB,
like, well, it still needs
memory, right?
It still needs memory if we often
execute some aggregates, so
we consume memory on Postgres standby
anyway, right?
Jelte: Yes.
Yeah, but even then the memory
can still be less than it would
be on Postgres tables, because
also in memory sometimes it's
compressed.
Like if there's many values that
are the same, sometimes it will
just distort the amount of values
and the actual value.
It's just like repeat end times,
basically.
So that also makes the compute...
That's 1 of the tricks that this
column storage engine generally
uses to improve this compute, and
also the memory usage.
Nikolay: Right, and again, if you
do it in row storage, we pull
all those blocks with all columns
and very inefficient in terms
of memory consumption.
Jelte: Yeah, and they all have
the exact same value, basically.
Nikolay: Yeah, yeah, yeah.
So this makes sense completely
to me.
Joe: You know, the common pattern
for so many, so many SaaS companies
is that really only the last 30
days of data or so is being updated
very frequently.
But then after that, the data maybe
shouldn't be deleted, but
it should be accessed.
And it can live in object storage,
which is 2 cents per gigabyte
per month, as opposed to on disk
and Postgres, it can be 10 cents,
15 cents per gigabyte per month
for some services.
So not only is it cheaper from
a cost perspective of the raw
gigabyte per cost, but also the
compression as you mentioned
just earlier is quite good.
So you can substantially archive
data, still have it accessible
to the application.
And pg_duckdb also does support
caching.
So you would be able to cache the
parquet files from S3 to where
Postgres is, so that you can avoid
running over the network to
connecting to S3, because that
would be really 1 of the major
costs, is running over from Postgres
to S3 to retrieve those
files.
Nikolay: Are you talking about
caching in memory or on disk,
like a BUS volume or something?
Jelte: On disk caching.
Basically, it downloads the file
and then it's like, oh, this
1, we were going to use it a lot
of times, it doesn't change.
Nikolay: Yeah, that's interesting.
And also, I noticed that, of course,
rights are supported, right?
So we can update this data, right?
Insert update and so on.
And in this case, I'm very curious
about the performance of writes
as well.
Is it like, how to compare it?
What's the sense of it?
Jelte: I think writes are, I mean,
you can do, like you can store
like a Parquet file, but I don't
think we currently support updating
values in a Parquet file.
We do support the real-life copy.
You put a query and you put all
the results in 1 big Parquet
file.
Or, yeah, that's the main writing
to Blob Search to support.
What we do support is also like
writing to MotherDoc which is
like a sort of managed DocDB.
That's I mean that's the company
I work for.
It's like we have like DocDB in
the cloud basically.
So there we do support inserting
into tables that live there
and updating them, like sort of
more similar to how you would
insert it to Postgres table.
Does that make sense?
Nikolay: Yeah, it makes sense.
And additional question, if I insert
data, can I do it inside
a complex transaction, which probably
inserts data to a regular
table in Postgres?
Jelte: So, right now you cannot
write to both a Postgres table
and a MotherDev table in the same
transaction, specifically because
when a crash happens, I mean, it
kind of works.
It's just that when a crash happens,
maybe you only wrote to
1 of the places.
So that's something we want to
protect against.
Nikolay: Because 2PC would protect
here.
Jelte: Yes, 2PC would protect here,
but 2PC is not trivial to
implement.
Nikolay: And it's super slow as
well.
It's not fun to use as well.
Jelte: Yes, So there's definitely
something you're considering
for like a future version.
But for now, it's like, okay, you
can write to MotorDoc and Postgres.
You can write to MotorDoc in a
transaction, you can do multiple
writes, but you cannot write to
Postgres in that same transaction.
Nikolay: Will it just produce an
error or just, you mean, it's
not good?
Jelte: Yeah, no, right now it will
produce an error.
Like it will say, like, oh, it's
a protection against, like,
maybe we'll lift, like, we'll have
a setting to say, okay, you
know what you're doing, or you
don't care about slight inconsistencies
in case of crash, because it will
still roll back, like the happy
rollback path, or query fails or
something, it will still roll
back.
That still works, but actual crash
cases where Postgres really
really goes away without sort of
doing its normal cleanup yeah
you have some race conditions where
then you write only to MotherDuck
and not to Postgres.
Michael: 1 core feature we haven't
talked about yet is creating
temporary tables within Postgres
using the DuckDB columnar format.
When would it make sense to do
that versus caching a parquet
file locally?
When are you seeing people use
that?
Jelte: I don't know if I've seen
many people use that.
The main reason...
1 of the main reasons it's implemented
is because we wanted to
add MotherDuck support.
And this was really easy to also...
I mean, it was like, we sort of
get it for free.
So that was...
It wasn't necessarily like, oh,
this is a feature we super need.
But it was like, oh, this is like
that lines of code extra, and
now we have temporary table support.
And it also makes the testing a
bit easier, because then you
don't have to...
Like, for the MotherDuck testing,
you need an account, and you
need everything, so you can have
this...
These DuckDB tables, you can have
them locally.
Cool.
Yeah, but 1 of the things that
could be useful for us, like for
like ETL kind of jobs, where you
have like a, like you create
some temporary tables, you do 1
query, you do 1 query, create
a temporary table to do some local
transforms, you create another
temporary table to do some transforms,
and then you store somewhere
else.
So it's also not completely useless
or something.
It's just that it wasn't our main,
like, oh, we need this.
Yeah, so that's it.
Michael: I think that's what I
find.
Like, I found this project quite
confusing to get, and not just
this 1, like all DuckDB Postgres
related projects, like I find
it hard to get my head around,
and partly because it's like a,
feels like a bag of features, like
firstly I can copy to Parquet
format and I can offload queries,
like I can query through Postgres
to Parquet format, but I can also
execute a query that's completely
local to Postgres using the DuckDB
engine.
That's a totally different feature,
but it's also the same extension.
It took me a while to get my head
around that these are different
use cases, but all provided by
the same extension, if that makes
any sense.
Jelte: Yeah, that's definitely...
I totally understand that it's
a bit confusing.
1 reason, I guess, is because DuckDB
can do so many things.
Yeah.
So it's like, okay, and you kind
of want to be able to combine
all of these things.
So it's like, oh yeah, you want
to join a Parquet file with,
you want to read the Parquet file
and join it with some Postgres
table to, for instance, sort of
hydrate the user IDs with the
recently updated user information.
So that the username is actually
what it is now and not what
it was like 2 months ago when the
record was created in the Parquet
file.
And that sort of means that DuckDB
really needs to be able to
read Postgres tables.
But then, I mean, if you want,
you could also just make it read
Postgres tables also fine.
So it's, I think both, I mean,
both Hydra and MotherDuck, we
had some sort of, we had some idea
about what features we think
are important.
And I mean, you start a bit from
there and like, oh, these are
the use cases we want to support.
And then how can we glue all those
use cases together in a way
that works nice.
And then you also end up with some
other use cases you didn't
even think about.
Joe: Yeah,
it's a really good point.
I've seen Postgres itself described
by many people, even myself,
as a multi-tool type of database.
With adding DuckDB into Postgres
makes it probably even more
of a multi-tool database where
the average user is not going
to use every feature, nor should
they try to.
But if there's something there
that's useful to them that just
kind of comes along with DuckDB,
certainly there's no reason
for us to try and remove it or
disable it.
You kind of get all of these DuckDB
cool features along with
now Postgres, just as a complete
kind of all-in-one multi-tool
type database.
In fact, I've kind of started to
use the term like multi-store
a little bit.
I think single store and unistore
was taken.
So, you know, but really what's
what's happening is you have
2 different databases kind of working
working together and the
data is also can be stored in multiple
formats both in, you know,
heap as well as Parquet.
So, I mean, that could be another
way to think about it.
Michael: Yeah.
Yes.
Because of time, I'm starting to
think about wrapping up.
Do you guys want to talk a little
bit about what you're working
on at the moment or what your plans
are for the future?
Jelte: Sure.
So I think there's a, I mean, we
released like the latest version
like a week ago or something.
And the things we're working on
now, we're working basically
on 2 big things.
1 is redoing the whole way we read
from Postgres, because before
we were sort of doing low-level
reads, sort of like the actual
bytes, sort of re-implementing
the heap reader in a sense.
But that means you don't get index
support automatically, then
you also need to implement the
index reader and stuff like that.
So, so now we're trying to more
sort of build a query again to
do, to do that, like to build a
Postgres query again, and then
give it to the Postgres planner
and then make Postgres figure
out, and that's, that's working
very well because then, then
you get foreign data wrapper support,
so that you can call foreign
data wrappers from Postgres from
within DuckDB again.
So it's definitely all the way
down.
So that's 1 big feature.
And the other is, 1 thing people
have been running into, sort
of we've seen people run into in
the first 2 versions is that
to be able to read from Parquet,
like you use the DuckDB functions
to read Parquet files and reads
Iceberg files, Iceberg catalogs,
you need to know what the columns
are.
And right now you need to specify
those columns and the types
in your query, which is a bit of
a hassle.
So I'm working on a way to make
it so that you don't have to
do that anymore.
Because Postgres wants to be very
strict about these things.
And the DuckDB, like, there's 1
of the big differences between
DuckDB and Postgres.
Postgres is very strict about certain
things.
And DuckDB is like, oh, you typed
this, you probably meant to
do this thing.
This kind of makes sense.
I can understand what you're doing,
what you're trying to do
here.
Let me give you the sensible result.
And Postgres is much more like,
oh, this is not exactly how I
expected it.
So, that's sort of where there's
a bit of a problem.
So I'm trying to make the UX there
a lot nicer.
And I have a way now that's working
quite well, but it needs
a bit more polish.
So yeah, probably after Christmas
or after New Year's, I'll also
create a VR to get that easier
to work with.
So you could just do SELECT *
from read CSV and .f2, or read
parquet and .f2 to specify the
callouts.
That's sort of the user experience
we want.
But right now you need to do SELECT *
from read parquet and
then specify all the callouts that
you actually want to read
from the parquet file, which kind
of defeats the purpose of the
star.
Michael: Makes perfect sense.
And how about from your side, Joe?
Joe: Yeah, I think Jelte covered
it well, as far as the way that
Postgres scan tables.
That should be merged in, I think,
relatively soon, and then
we'll let it bake before the next
you know official release.
As Jelte mentioned, we released
a past version just a week ago
so we have a pretty good clip at
this point of shipping new releases.
If there's something that folks
want to see, certainly head on
over to the repo and open an issue
or let us know.
And then as far as Hydra goes,
as I mentioned, we have a managed
cloud Postgres service that's in
beta that has integrated pg_duckdb.
So we're looking to roll that up
out into public beta, so people
can begin to use pg_duckdb without
having to self-manage it.
Today, you can obviously take pg_duckdb
and put it on EC2 box
with Postgres or whatever you want
to do.
So yeah, looking forward to shipping
that out.
Nikolay: Yeah, that's great.
I just thought, like, 20 years
ago, MySQL was dominating in open
source database field, and they
have several engines.
I remember you needed to pick up,
1 was not crashing, and the
other was had the full tech search.
I mean, my son and then the DB,
it was, yeah, trade off.
And now like 20 years later, we
talk about various engines, basically,
cloud friendly, because it's, you
can upload it to object storage.
It's amazing that we can combine
these in Postgres right now,
and it looks like it's only starting,
right?
A lot of stuff can be added, developed,
and so on.
It's only the beginning as I see
it.
Joe: Yeah, it's pretty interesting
to see Data Lake come to Postgres.
It goes to further your point,
which is many of the layers of
the modern databases are being
pulled apart.
There's multiple engines, there's
object storage, there's compute
with Postgres.
So I think that trend is continuing
to play out.
As far as the average app developer
is concerned, like I was
indicating before, I don't think
they necessarily need to know
or care so much about the differences
between a database versus
a data lake.
So hopefully, The interface being
Postgres helps bridge the gap
of using something really cool
like DuckDB, but together with
what you're used to using in Postgres.
Nikolay: Yeah, that's cool.
Michael: Nice 1.
Yeah.
Well, thank you so much both for
joining.
Thank you, Nikolaj, as always,
and catch you soon.
Nikolay: Yeah.
Thank you guys.
Jelte: Thank you.