Postgres FM

Michael and Nikolay are joined by Joe Sciarrino and Jelte Fennema-Nio to discuss pg_duckdb — what it is, how it started, what early users are using it for, and what they're working on next.
 
Here are some links to things they mentioned:

~~~

What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!

~~~

Postgres FM is produced by:

With special thanks to:
  • Jessie Draws for the elephant artwork 

Creators & Guests

Host
Michael Christofides
Founder of pgMustard
Host
Nikolay Samokhvalov
Founder of Postgres AI
Guest
Jelte Fennema-Nio
Software Engineer at MotherDuck & Postgres contributor
Guest
Joe Sciarrino
Co-founder & CEO at Hydra

What is Postgres FM?

A weekly podcast about all things PostgreSQL

Michael: Hello, and welcome to
Postgres.FM, a weekly show about

all things PostgreSQL.

I am Michael, founder of pgMustard,
and I'm joined, as usual,

by Nikolay, founder of Postgres.AI.

Hey, Nikolay.

Nikolay: Hi, Michael.

How are you?

Michael: No, not falling for it.

Today, we are delighted to be joined
by 2 guests.

Firstly, Joe from Hydra.

Welcome, Joe.

Joe: Thank you.

Michael: And also Jelte from MotherDuck.

Welcome, Jelte.

Jelte: Hi.

Hi all.

Michael: So thanks so much both
for joining us.

We are going to be discussing a
new-ish project called pg_duckdb,

which very interestingly is a cross
organization collaboration.

So Jelte, I saw the recording
of your excellent lightning

talk introducing pg_duckdb to PGConf
EU, a nice job.

Perhaps you could start us off
with how you describe pg_duckdb

to a Postgres savvy audience who
may or may not know what DuckDB

is.

Jelte: All right, so it's an extension
for Postgres and it embeds

DuckDB.

And DuckDB is an in-process analytics
database, like SQL database.

So it's a bit like SQLite, but
then more for analytics, like

big queries with lots of joins
and stuff like that.

And lots of joins and less like
SQLite or like Postgres in general

for like smaller datasets, like getting
a hundred rows or something

like that.

It's index scans, stuff like that.

Indexes are not something that's
commonly used in the DuckDB

land.

It's like, oh, we just scan the
whole table, and we compute stuff

very fast on it.

Yeah, and that is then embedded
into Postgres so that you can

use that for some of your more
hazard queries that don't rely

on indexes.

Michael: Nice, and I'm sure we're
going to come back to indexes

in a bit once we dive into the
details.

But Joe, I think the origin story
of this project is super interesting.

I'd be fascinated.

I was reading the MotherDuck blog
post and it mentioned that

Hydra kicked off the effort.

So I'd love for you to share a
little bit more about that if

you can.

Joe: Sure, Sure.

So our team had been working on
a crazy weekend project that

we initially were calling pg_quack.

It was called pg_quack because
well, ducks go quack, but also

because we didn't necessarily expect
this to work.

We kind of were writing it off
as a quacky project.

So it was a little bit of an internal
joke.

But the performance of DuckDB was
looking super great when it

came to all kinds of analytics
queries and aggregates like Jelte

mentioned.

And we had this aha moment that
we could actually take pg_quack

pretty seriously and do it for
real in collaboration with folks

that know a whole lot about DuckDB.

You know, we're Postgres guys by
background mostly.

So not to get too far ahead, but
we decided to build pg_duckdb

for real as a production-grade
extension together with the folks

over at DuckDB Labs, the creators
of DuckDB, which is Hannes

and Mark, as well as the folks
at MotherDuck, like Jelte.

So we've been working together
with them on it, oh, for at least

last 6, 7 months now.

Michael: Yeah, super cool.

And I saw a few other organizations
mentioned as well.

How do they all fit in?

Joe: Yeah, so this is open source,
it's an MIT licensed extension.

So for folks that want to participate,
or can see pg_duckdb being

really helpful in their use cases,
they can either tell us directly

what they'd like to add or open
a PR or open issues.

I mean, certainly there are a good
number of those as the project

is currently in beta.

So be gentle with it, but also
tell us the things that could

be improved on.

One of the things that was surprising
to me was the immediate reaction

that we saw to how DuckDB could
be useful to accelerating analytics

queries in Postgres.

And there are a lot of organizations
that went.

Oh, yes, okay.

We want to support this we want
to get involved.

So that's probably some of the
names you were seeing

Michael: Yeah, I think I saw was
it Neon and Microsoft maybe

some others as well.

Joe: Yeah, there's a whole range
of folks that

are interested in us moving this
forward, and they're providing

either direct or indirect support
for the project.

Michael: Yes.

Jota, from your side, what are
you...

I should mention both of your thoughts
on this, but what are

you seeing early users most excited
about?

Like what features are they most
keen to see added or improved?

Jelte: I think one of the big things
is that you can sort of read

Parquet files or read CSV files
from blob storage, like from

S3 or from Azure or from all the
big clouds, basically in Cloudflare

as well.

And I think that's a big feature that people are very interested

in.

And I mean, there's so many open issues of people wanting a specific

thing that it's hard to pick 1 that's the thing people want.

But 1 that has come up a few times is support for spatial stuff,

like PostGIS things.

And that works slightly different in DuckDB so we haven't started

on that we're first sort of building a bit more the base but

that's definitely 1 of the things that 1 of the features that

comes up like people wanting to do spatial and let's do spatial

data basically.

Yeah.

But that's not supported right now.

Nikolay: Yeah.

So, 1 of the biggest features is that data can be on, I mean,

it's not a feature, it's like a requirement here.

Data is on object storage like S3, right?

In Parquet format, file format, for example, and it's already

column store.

But like, I mean, it's unlike Postgres, right?

But this sounds like, I remember some attempts to have access

to Parquet files on object storage as extension before pg_duckdb,

right?

And it's also like It could be done in FDW way, foreign data

wrapper way.

I think some SQL interface over those files and that's it.

But DuckDB is like bringing something else.

I've never tried myself, but I've heard many opinions about very

good performance.

Right?

Jelte: Yeah, yeah.

So, that's 1 of the big differences between pg_duckdb and some

of the other foreign data wrapper based ones.

Is that we sort of flip it around a bit, instead of saying like,

oh, this table is stored somewhere else.

We'll read that using DuckDB or something, or something else.

Like, instead We're saying like, okay, we push the whole query,

we put it in DuckDB, and then let DuckDB read the Postgres data.

So the Postgres sort of becomes a foreign data wrapper, kind

of inside DuckDB.

Does that make sense?

So it's more like a query stealing approach, is what Mario from

Hydra, 1 of the developers there, called it.

It's like, instead of letting the Postgres execution engine do

a lot of things, it's like, we don't want the Postgres execution

engine to do a lot of things because the DuckDB execution engine

is much better for these analytic stuff, like the sorting and

the grouping and stuff like that that are joining.

We'd wanna do as much as possible inside DuckDB.

Nikolay: Yeah, that's interesting.

Joe: Yeah, and also, you know, kind of back to Michael's question

a little bit too, A really common pattern for a lot of application

developers is to have Postgres, but also to have an S3 bucket.

There's a lot of event data that can get generated that really

should not live on disk in Postgres.

We're talking to a company yesterday that had generated, you

know, has 2.6 terabytes of event data that they really shouldn't

be storing necessarily on Postgres, but they could offload into

Parquet format in S3.

But that data still needs to be accessible by Postgres.

Because 1 of the important things is that pg_duckdb can perform

the join between heap table and Postgres together with the Parquet

files in object storage so that there's not this siloing or needing

for ETL processes between Postgres and external analytics databases

and S3.

So how do you smooth over the differences between columnar format

and row-based format?

And I think pg_duckdb goes a long way to do that because ultimately

when you're developing applications, thinking about table format

is pretty low level.

So if you can avoid that as much as possible, that's probably

just easier.

Nikolay: Yeah, but here I'm confused a little bit, because this

thought is like, let's have tiering of storage.

And, for example, I remember several cases, big companies where

this was needed and the decision was made for scaling purposes

of main clusters.

We need to offload archive data somewhere else, maybe to different

cluster or somewhere else.

Maybe sometimes we just delete it.

It could be, for example, e-commerce and old orders, order history,

exceeding for example a couple of years, we don't need that.

But offloading this data to S3, this sounds great and I remember

there's also extension pgt from Tembo, again I haven't tried

it yet, It's in my to-do list.

But offloading that, I'm thinking, if I also receive column storage

for this kind of data, is it right?

Because I still need only 1 row in these cases, right?

In some cases, I do need like aggregates, like, I don't know,

time series, I would love to upload to S3 in column store format,

but not in these cases when I just need 1 row.

Joe: Yeah, I mean, I think it's different needs for different

users.

If you are a, you know, application developer just trying to

build something quickly, then if we can tell you, hey, dump a

lot of your event data into Parquet format in S3, or archive

it out of Postgres on some increment every 7 days, every 30 days

or something, then that's easy enough.

Then maybe the thinking about columnar versus row should subside.

But yeah, I mean, data tiering and data lake is not as different

as, you know, they're really not
that different.

It's just a function of what the
user interacts with.

Nikolay: Right, but my concern
is if I need the row store, like

I need to find all columns for
1 row to present it.

For example, again, orders, right?

I need 1 order.

If we pull it from archive, I expect
not good performance.

But if it's column store, I will
be concerned about performance

to collect all the rows versus
row storage.

Let's formulate the question differently.

Let's talk about the use cases
where this particular pattern

when we need to upload a lot of
data to S3 and benefit from column

store, analytical queries like
aggregates, what kind of data

it could be in existing Postgres
database?

It can be SAS or something, I don't
know like what kind of data

is beneficial for us to offload
to this?

Jelte: A very common case is like
time series data.

I think

that's 1 of the, that's
by far 1 of the most common

cases where it's like, okay, at
least sort of the old data we

store in, I mean, you're not going
to look up a single row in

some series of events.

Or if you do, if it's like a week
old, it's not so terrible that

you don't look that up too often.

So it's like, okay, it's fine.

If that takes a little bit longer,
make sure fetching a bit too

much more data that you actually
need.

If that means that you can do these
aggregates much quicker.

I think that's 1 of the most common
cases, I would say.

Nikolay: Yeah, that makes sense.

And that means that, for example,
Timescale Cloud, they have

it bottomless timescale of loading
some data transparently, so

we don't see it, but it goes, some
archive data goes to S3, but

this is not open source.

PjWB is open source, which I like.

You mentioned MIT license, that's
great, but I see some kind

of, not competition, but alternative
here.

I can choose between going to Timescale
Cloud if I need this,

or just use it if I can.

So, I mean, if I can install extension,
which is not available

in managed service yet.

But yeah, it makes sense to me.

Indeed, especially if data is old,
and I expect it to go to object

storage, I definitely not in the
usually I'm not interested in

single row, I'm interested in aggregates.

Right?

Yeah, this makes sense for me.

Cool.

Michael: That's a good question
though.

Are there any, I'm sure you're
talking to lots, but are there

any Postgres managed services that
already offer it in a preview

or developer mode?

Jelte: I mean, Hydra is offering
it, but other than that, not

yet.

But yeah, we're definitely in talks
with a bunch of them.

Yeah, it's a project that many
people are interested in.

Michael: Yeah, nice.

Joe: Yeah, so Hydra currently has
a closed beta of a fully managed

Postgres service that has pg_duckdb
extension installed.

Nikolay: Nice.

Let's talk about some numbers,
if you can pull them from top

of your head.

For example, if I need to aggregate
over many million rows which

are sparsely distributed.

We know like in the worst case,
for example, if I touch, I don't

know, 1000000 rows, which each
1 is stored in separate buffers

in the worst case, for example,
right?

Very roughly, I need to touch,
And in the worst case, it's not

hits, it's reads.

I need to get from disk 1 million
buffers, I mean, pages, right?

And in the worst case, it can be
absolutely worst.

I can imagine it may be like 1
millisecond for each if latency

is bad and we do it in not without
parallelization it's 1000000

seconds it's terrible right and
this is what I can see in Postgres

in some cases what should I expect
here if I need for example

to calculate sum of of rows like
1 million rows stored in this

format?

What like is it seconds, minutes?

Jelte: You mean if it's sort of
blob storage?

Like how many rows?

Nikolay: 1000000 rows and some
simple aggregates like count sum

or something.

What do you see?

Jelte: I would say it's generally
within a second, I would say,

or within a few seconds at least.

Nikolay: Including data transfer
from S3, everything, right?

Jelte: Yeah, yes.

So because it's a per cave format,
you only get the columns that

you actually requested.

So it's already, even if it's a
lot of data, you generally just

because you have a bunch of columns.

So you only grab the data from
S3 that you actually need.

And that is then sort of streamed
through the WB engine to get

these, to get these aggregates.

Nikolay: Yeah, that's, that's super
cool.

Jelte: But I mean, it's like with
all these questions, it always

depends a bit on what, on what
you're doing exactly.

But for instance, we, we, we, we
like do, there's click bench

results.

ClickHouse has a benchmark and
their pg_duckdb is quite...

From the Postgres compatible ones,
it's...

I don't think it's at the top.

The one with MotherDuck support is
almost at the top, I think, but

the one with Parquet is a bit
lower, but still much higher

than tuned Postgres, I think.

I don't know by heart, but I think
it's at least 10 times faster

than the tuned Postgres version,
but maybe even more.

Nikolay: Yeah, that's great.

And I can imagine this opens possibilities
to stop saying don't

do analytics in Postgres.

So this can be a transaction which
does aggregates in Postgres

if it's within seconds.

We don't have negative consequences
of long-running transactions

anymore, right?

It's not only about user experience.

It's also about how Postgres performs
vacuuming, still, and so

on.

This xmin horizon problem and so
on, right?

Jelte: Yeah, yeah.

Joe: Yeah, I mean, pg_duckdb was
really kicked off to try and

solve the known limitations of
analytics in Postgres.

You know, not even at huge scales,
Postgres can be fairly inefficient

for doing analytical processing.

So, you know, DuckDB, that's what
it excels at.

We're trying to get the project
to the point where the user doesn't

necessarily need to know very much
about DuckDB at all.

Since these are Postgres users,
the service should look and feel

exactly like Postgres for the most
part.

And then we make those differences
clear if you were to run an

EXPLAIN plan, it will show you
exactly where and when DuckDB

is taking over for that execution
if you want to look a little

bit deeper.

Michael: Would that be like a custom
scan node or how would it

show?

Jelte: Yeah, that's a custom scan
node.

And basically, it's always like
it either takes completely over

or it doesn't take over at all.

There's no, I mean, that's sort
of one of the things we'll probably

want to add at some point, like
to say like, oh, this part is

actually, needs to be executed
in Postgres because DuckDB doesn't

know how to do it.

Right now, it's either everything
or nothing.

If you want to do something that
DuckDB doesn't support, but

still needs DuckDB execution, then
just look at this query.

It's not supported at the moment.

But yeah, It's one big custom scan
node, basically, that does the

DuckDB execution.

Nikolay: And do you see situations
when you need to tune something?

There are some tuning capabilities
inside DuckDB, right?

But if we see only one node in the
plan, we cannot do it from there.

What's the workflow to understand how efficient it is?

Jelte: So, I mean, at the moment, the main things you can tune

in pg_duckdb are like just that amount of threads that DuckDB

uses and the amount of memory that it's allowed to use.

Because it's, I mean, because you're running it inside... normally

DuckDB really likes to take all the resources of the system and

go as fast as it can.

So that's, I mean, your laptop tries to fly away and overheats.

That's what it wants to do.

But, I mean, if you were inside Postgres, that's probably not

what you want, because you also want to run some other queries.

So that's why we have some limits.

So I think by default, it will try to use as many threads as

are available, just so that it completes quickly, but at least

the memory is trying to limit.

But you can lower both or increase both.

Nikolay: And we can run this queries on Postgres standby nodes,

right?

On replicas?

Jelte: Yes, you should be able to.

I don't think I've personally tried yet, but I think that should

be fine.

And if it doesn't work, please open a BR or like an issue.

Nikolay: I'm already thinking about a common approach, like I

would say a traditional approach right now is to have OLTP database

Postgres for everything but for analytics we offload data to

some another system sometimes.

Or like with Hydrate it's a different table with column storage

as I remember it's also like additional effort to maintain data

synchronized.

Here if some data, okay, again, maybe also here, maybe it makes

sense in some cases to have original table stored in Postgres

and raw storage, but this table, maybe with maybe projections,

some Columns only, go to this additional table which is stored

in this format in S3, Parquet, DuckDB processed, right?

And this is also an interesting question.

Do you see it?

But I wanted to ask a different question.

So if we do this, we can run analytical queries right inside.

And I already see that it makes sense if it just selects to do

it on replicas, not to, like, primaries.

We have only 1 single primary in a single cluster, so it's not

good to take memory for this, and so on, and affect all TP workload

processing.

In this case, it's great that if it's fast, because host-and-buy

feedback is on, like, spin-horizon is almost not affected, great.

But there, I probably would like to give all memory, all CPU

power for this processing if it's maybe it's dedicated replica.

Since I'm new to this, sorry if my question is not right, but

what about compression?

So do I have any capabilities to
choose compression method and

ratio and so on in case of Parquet?

Joe: Well, if you were to copy
the heap table to Parquet format

using pg_duckdb, then, okay, you
know, I think you'd see what,

5 times compression.

Nikolay: Depends on data, right?

Joe: At worst, depending on the
data, 5 times, sometimes you

see 10 times compression.

But this can also be the case for
the other columnar formats

as well.

pg_duckdb supports the Iceberg reader
as well as the Delta table

reader.

You know, you could use Delta tables
or Iceberg files in object

storage as well.

All of those will have better compression
than, you know, than

just standard Postgres heap tables.

Nikolay: Yeah, this is what I expected.

And I'm just...

As I understand, there are no control,
you just get it, that's

it.

For now, you can...

Jelte: I'm not entirely sure.

It might be that the copy...

I mean, we sort of expose all the
copy options that DuckDB exposes.

So I think they might expose like
a compression option that you

can say like which compression
type you want to use to write

the Parquet file?

Nikolay: Yeah, because usually
it's a trade-off between CPU utilization

and

the size.

Jelte: Yeah, yeah.

Nikolay: Yeah, this discussion of CPU triggers
this question in my head,

and I know Column storage is beating
Postgres completely in terms

of compression.

I remember a funny case when I
said 1 billion rows is 1 terabyte,

but VictoriaMetrics CTO said,
how come it's just 1 gigabyte?

And we could not understand each
other, right?

Because in Postgres it's indeed
1 terabyte.

But yeah, 5X, you said it's the
lowest, that's great.

And I think it also cost saving.

And it's not only disk saving in
terms of if you're offloaded

to S3, but also memory, right?

If we have 1 primary multiple standby
nodes, we like, in many

cases, the shared buffer pool and
page cache state is mirrored

between standby nodes.

And if this is data, which we don't
need often, it's not good

in terms of cache efficiencies
right if you're if loaded if you're

floated to this you like less memory
to process your workload

and this is probably a huge cost
saving as well, right?

Jelte: If you offload the computer
to standby now, if you mean,

Nikolay: or are you offloading?

In general, we take some data,
we offload it, like if it's archived

data, we don't need it often, and
it also have some patterns

where column storage makes sense.

We offloaded to S3 using this approach.

And if it's processed by a DuckDB,
like, well, it still needs

memory, right?

It still needs memory if we often
execute some aggregates, so

we consume memory on Postgres standby
anyway, right?

Jelte: Yes.

Yeah, but even then the memory
can still be less than it would

be on Postgres tables, because
also in memory sometimes it's

compressed.

Like if there's many values that
are the same, sometimes it will

just distort the amount of values
and the actual value.

It's just like repeat end times,
basically.

So that also makes the compute...

That's 1 of the tricks that this
column storage engine generally

uses to improve this compute, and
also the memory usage.

Nikolay: Right, and again, if you
do it in row storage, we pull

all those blocks with all columns
and very inefficient in terms

of memory consumption.

Jelte: Yeah, and they all have
the exact same value, basically.

Nikolay: Yeah, yeah, yeah.

So this makes sense completely
to me.

Joe: You know, the common pattern
for so many, so many SaaS companies

is that really only the last 30
days of data or so is being updated

very frequently.

But then after that, the data maybe
shouldn't be deleted, but

it should be accessed.

And it can live in object storage,
which is 2 cents per gigabyte

per month, as opposed to on disk
and Postgres, it can be 10 cents,

15 cents per gigabyte per month
for some services.

So not only is it cheaper from
a cost perspective of the raw

gigabyte per cost, but also the
compression as you mentioned

just earlier is quite good.

So you can substantially archive
data, still have it accessible

to the application.

And pg_duckdb also does support
caching.

So you would be able to cache the
parquet files from S3 to where

Postgres is, so that you can avoid
running over the network to

connecting to S3, because that
would be really 1 of the major

costs, is running over from Postgres
to S3 to retrieve those

files.

Nikolay: Are you talking about
caching in memory or on disk,

like a BUS volume or something?

Jelte: On disk caching.

Basically, it downloads the file
and then it's like, oh, this

1, we were going to use it a lot
of times, it doesn't change.

Nikolay: Yeah, that's interesting.

And also, I noticed that, of course,
rights are supported, right?

So we can update this data, right?

Insert update and so on.

And in this case, I'm very curious
about the performance of writes

as well.

Is it like, how to compare it?

What's the sense of it?

Jelte: I think writes are, I mean,
you can do, like you can store

like a Parquet file, but I don't
think we currently support updating

values in a Parquet file.

We do support the real-life copy.

You put a query and you put all
the results in 1 big Parquet

file.

Or, yeah, that's the main writing
to Blob Search to support.

What we do support is also like
writing to MotherDoc which is

like a sort of managed DocDB.

That's I mean that's the company
I work for.

It's like we have like DocDB in
the cloud basically.

So there we do support inserting
into tables that live there

and updating them, like sort of
more similar to how you would

insert it to Postgres table.

Does that make sense?

Nikolay: Yeah, it makes sense.

And additional question, if I insert
data, can I do it inside

a complex transaction, which probably
inserts data to a regular

table in Postgres?

Jelte: So, right now you cannot
write to both a Postgres table

and a MotherDev table in the same
transaction, specifically because

when a crash happens, I mean, it
kind of works.

It's just that when a crash happens,
maybe you only wrote to

1 of the places.

So that's something we want to
protect against.

Nikolay: Because 2PC would protect
here.

Jelte: Yes, 2PC would protect here,
but 2PC is not trivial to

implement.

Nikolay: And it's super slow as
well.

It's not fun to use as well.

Jelte: Yes, So there's definitely
something you're considering

for like a future version.

But for now, it's like, okay, you
can write to MotorDoc and Postgres.

You can write to MotorDoc in a
transaction, you can do multiple

writes, but you cannot write to
Postgres in that same transaction.

Nikolay: Will it just produce an
error or just, you mean, it's

not good?

Jelte: Yeah, no, right now it will
produce an error.

Like it will say, like, oh, it's
a protection against, like,

maybe we'll lift, like, we'll have
a setting to say, okay, you

know what you're doing, or you
don't care about slight inconsistencies

in case of crash, because it will
still roll back, like the happy

rollback path, or query fails or
something, it will still roll

back.

That still works, but actual crash
cases where Postgres really

really goes away without sort of
doing its normal cleanup yeah

you have some race conditions where
then you write only to MotherDuck

and not to Postgres.

Michael: 1 core feature we haven't
talked about yet is creating

temporary tables within Postgres
using the DuckDB columnar format.

When would it make sense to do
that versus caching a parquet

file locally?

When are you seeing people use
that?

Jelte: I don't know if I've seen
many people use that.

The main reason...

1 of the main reasons it's implemented
is because we wanted to

add MotherDuck support.

And this was really easy to also...

I mean, it was like, we sort of
get it for free.

So that was...

It wasn't necessarily like, oh,
this is a feature we super need.

But it was like, oh, this is like
that lines of code extra, and

now we have temporary table support.

And it also makes the testing a
bit easier, because then you

don't have to...

Like, for the MotherDuck testing,
you need an account, and you

need everything, so you can have
this...

These DuckDB tables, you can have
them locally.

Cool.

Yeah, but 1 of the things that
could be useful for us, like for

like ETL kind of jobs, where you
have like a, like you create

some temporary tables, you do 1
query, you do 1 query, create

a temporary table to do some local
transforms, you create another

temporary table to do some transforms,
and then you store somewhere

else.

So it's also not completely useless
or something.

It's just that it wasn't our main,
like, oh, we need this.

Yeah, so that's it.

Michael: I think that's what I
find.

Like, I found this project quite
confusing to get, and not just

this 1, like all DuckDB Postgres
related projects, like I find

it hard to get my head around,
and partly because it's like a,

feels like a bag of features, like
firstly I can copy to Parquet

format and I can offload queries,
like I can query through Postgres

to Parquet format, but I can also
execute a query that's completely

local to Postgres using the DuckDB
engine.

That's a totally different feature,
but it's also the same extension.

It took me a while to get my head
around that these are different

use cases, but all provided by
the same extension, if that makes

any sense.

Jelte: Yeah, that's definitely...

I totally understand that it's
a bit confusing.

1 reason, I guess, is because DuckDB
can do so many things.

Yeah.

So it's like, okay, and you kind
of want to be able to combine

all of these things.

So it's like, oh yeah, you want
to join a Parquet file with,

you want to read the Parquet file
and join it with some Postgres

table to, for instance, sort of
hydrate the user IDs with the

recently updated user information.

So that the username is actually
what it is now and not what

it was like 2 months ago when the
record was created in the Parquet

file.

And that sort of means that DuckDB
really needs to be able to

read Postgres tables.

But then, I mean, if you want,
you could also just make it read

Postgres tables also fine.

So it's, I think both, I mean,
both Hydra and MotherDuck, we

had some sort of, we had some idea
about what features we think

are important.

And I mean, you start a bit from
there and like, oh, these are

the use cases we want to support.

And then how can we glue all those
use cases together in a way

that works nice.

And then you also end up with some
other use cases you didn't

even think about.

Joe: Yeah,

it's a really good point.

I've seen Postgres itself described
by many people, even myself,

as a multi-tool type of database.

With adding DuckDB into Postgres
makes it probably even more

of a multi-tool database where
the average user is not going

to use every feature, nor should
they try to.

But if there's something there
that's useful to them that just

kind of comes along with DuckDB,
certainly there's no reason

for us to try and remove it or
disable it.

You kind of get all of these DuckDB
cool features along with

now Postgres, just as a complete
kind of all-in-one multi-tool

type database.

In fact, I've kind of started to
use the term like multi-store

a little bit.

I think single store and unistore
was taken.

So, you know, but really what's
what's happening is you have

2 different databases kind of working
working together and the

data is also can be stored in multiple
formats both in, you know,

heap as well as Parquet.

So, I mean, that could be another
way to think about it.

Michael: Yeah.

Yes.

Because of time, I'm starting to
think about wrapping up.

Do you guys want to talk a little
bit about what you're working

on at the moment or what your plans
are for the future?

Jelte: Sure.

So I think there's a, I mean, we
released like the latest version

like a week ago or something.

And the things we're working on
now, we're working basically

on 2 big things.

1 is redoing the whole way we read
from Postgres, because before

we were sort of doing low-level
reads, sort of like the actual

bytes, sort of re-implementing
the heap reader in a sense.

But that means you don't get index
support automatically, then

you also need to implement the
index reader and stuff like that.

So, so now we're trying to more
sort of build a query again to

do, to do that, like to build a
Postgres query again, and then

give it to the Postgres planner
and then make Postgres figure

out, and that's, that's working
very well because then, then

you get foreign data wrapper support,
so that you can call foreign

data wrappers from Postgres from
within DuckDB again.

So it's definitely all the way
down.

So that's 1 big feature.

And the other is, 1 thing people
have been running into, sort

of we've seen people run into in
the first 2 versions is that

to be able to read from Parquet,
like you use the DuckDB functions

to read Parquet files and reads
Iceberg files, Iceberg catalogs,

you need to know what the columns
are.

And right now you need to specify
those columns and the types

in your query, which is a bit of
a hassle.

So I'm working on a way to make
it so that you don't have to

do that anymore.

Because Postgres wants to be very
strict about these things.

And the DuckDB, like, there's 1
of the big differences between

DuckDB and Postgres.

Postgres is very strict about certain
things.

And DuckDB is like, oh, you typed
this, you probably meant to

do this thing.

This kind of makes sense.

I can understand what you're doing,
what you're trying to do

here.

Let me give you the sensible result.

And Postgres is much more like,
oh, this is not exactly how I

expected it.

So, that's sort of where there's
a bit of a problem.

So I'm trying to make the UX there
a lot nicer.

And I have a way now that's working
quite well, but it needs

a bit more polish.

So yeah, probably after Christmas
or after New Year's, I'll also

create a VR to get that easier
to work with.

So you could just do SELECT *
from read CSV and .f2, or read

parquet and .f2 to specify the
callouts.

That's sort of the user experience
we want.

But right now you need to do SELECT *
from read parquet and

then specify all the callouts that
you actually want to read

from the parquet file, which kind
of defeats the purpose of the

star.

Michael: Makes perfect sense.

And how about from your side, Joe?

Joe: Yeah, I think Jelte covered
it well, as far as the way that

Postgres scan tables.

That should be merged in, I think,
relatively soon, and then

we'll let it bake before the next
you know official release.

As Jelte mentioned, we released
a past version just a week ago

so we have a pretty good clip at
this point of shipping new releases.

If there's something that folks
want to see, certainly head on

over to the repo and open an issue
or let us know.

And then as far as Hydra goes,
as I mentioned, we have a managed

cloud Postgres service that's in
beta that has integrated pg_duckdb.

So we're looking to roll that up
out into public beta, so people

can begin to use pg_duckdb without
having to self-manage it.

Today, you can obviously take pg_duckdb
and put it on EC2 box

with Postgres or whatever you want
to do.

So yeah, looking forward to shipping
that out.

Nikolay: Yeah, that's great.

I just thought, like, 20 years
ago, MySQL was dominating in open

source database field, and they
have several engines.

I remember you needed to pick up,
1 was not crashing, and the

other was had the full tech search.

I mean, my son and then the DB,
it was, yeah, trade off.

And now like 20 years later, we
talk about various engines, basically,

cloud friendly, because it's, you
can upload it to object storage.

It's amazing that we can combine
these in Postgres right now,

and it looks like it's only starting,
right?

A lot of stuff can be added, developed,
and so on.

It's only the beginning as I see
it.

Joe: Yeah, it's pretty interesting
to see Data Lake come to Postgres.

It goes to further your point,
which is many of the layers of

the modern databases are being
pulled apart.

There's multiple engines, there's
object storage, there's compute

with Postgres.

So I think that trend is continuing
to play out.

As far as the average app developer
is concerned, like I was

indicating before, I don't think
they necessarily need to know

or care so much about the differences
between a database versus

a data lake.

So hopefully, The interface being
Postgres helps bridge the gap

of using something really cool
like DuckDB, but together with

what you're used to using in Postgres.

Nikolay: Yeah, that's cool.

Michael: Nice 1.

Yeah.

Well, thank you so much both for
joining.

Thank you, Nikolaj, as always,
and catch you soon.

Nikolay: Yeah.

Thank you guys.

Jelte: Thank you.