How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
Nicolay Gerold: Many companies use
Elastic or OpenSearch, but only
use 10 percent of the capacity.
And also, they have to build ETL
pipelines, get data normalized,
worry about race conditions, all in
all, at the moment, when you want
to do search, but you already have.
existing data in a data warehouse or in a
Postgres instance, you are forced to build
a distributed system, but not anymore.
ParadeDB is building an open
source Postgres extension to
enable search within your database.
And today I'm talking to Philippe
Noël, the founder and CEO of ParadeDB.
And we talk about how they build
ParadeDB, how they integrate into the
Postgres query engine, and how you
can build search on top of Postgres.
Let's do it.
Philippe Noël: That's why we did rust
and the only 2 options if you're building
Postgres extensions is C or rust.
Although now, since there's another
company trying to make it happen in Zig,
but at the time that was not possible.
And I don't think their project is mature
enough to really build like a state
of the art product on it yet anyway.
Zig is just beginning to be possible.
And the Zig ecosystem is so small.
Most of the data infra is written
in Rust nowadays, I would say.
Nicolay Gerold: What wouldn't have been
possible if you hadn't written it in Rust?
Philippe Noël: Aside from Tentavy?
Nicolay Gerold: Yeah.
Philippe Noël: That's one that's
one really big one already.
The second one, I would say is just
the speed at which we can develop
things would have taken so much
longer if we did them and see, but
that would have been a big problem.
We also build this project called
analytics where we bundle in a
library called data fusion inside of.
Inside of Postgres that would not
have been possible without it.
We also play around with Doug DB that
you can do that and see pretty easily.
It's probably easier to see actually.
So that 1 is a, a contra, a counter
example we're thinking about new
columnar file formats as well for
work to replace arrow because arrow
doesn't have good random access memory.
Random access patterns, basically.
All of those are written in Rust, then
no one is, the only new file format
written in C is Nimble by by Meta.
It's not that mature yeah,
Nicolay Gerold: So you will end
up going with Lance, probably.
Philippe Noël: Lance or Vortex.
Yeah, I don't know about, I
don't know if I'm fully happy
with what Lens does for us.
It's very good, but
it's very good for them.
I don't know if it makes sense for us.
It's possible we use it.
And analytics we are borrowing a
lot of the work from data fusion.
But I agree.
It's very exciting.
What's happening there.
Nicolay Gerold: It's like you
have the easy entry ones like
Weaviate Qdrant, the newer ones.
Then you have Elasticsearch and OpenSearch
somewhere in the middle and then you
have the really advanced one, Vespa.
Basically on, in like complexity,
but also in what you can build.
Where would you place
ParadiDB in that spectrum?
And feel free to add like
your own categories, how you
differentiate from them as well.
Philippe Noël: Yeah, I think we're really
trying to just be a competitor to Elastic.
So I would say feature wise and complexity
wise, we're trying to be in the same
category as Elastic, but we're trying
to bring that to the ease of use that
maybe the Qdrant and we Weaviates are so
for context or let me put it this way.
Most people use Elastic.
Elastic is the most popular one, right?
It's the most popular one for a reason.
I have a theory that a lot of
new search engines are not very
different from Elastic, right?
They're also a NoSQL database.
They suffer from the
same set of trade offs.
Like sure, it might be rewritten in Rust
or a different language, but these are You
know, they're not like orders of magnitude
different, right parade DB is trying to
be that we're trying to build it in SQL.
We're trying to build in Postgres,
which is the most popular SQL or
the fastest growing SQL dialect.
I think we can build something that's
equally powerful, but solves most
of the trade offs that Elastic has.
So that's really where we're
trying to place ourselves.
Nicolay Gerold: Yeah.
And can you maybe, just for anyone
who doesn't know it yet, give a
quick introduction what ParadeDB is?
Philippe Noël: Sure yeah so
I'm 1 of the people building
a project called Parade DB.
It's an open source project.
We're building a competitor to
Elastic search inside Postgres.
So what this means is we're building
our full text search and analytics
or aggregations inside Postgres.
The, all this is built as an extension.
The benefit of it is typically the
canonical workload customers have is they
use Postgres as their relational database.
They're very happy with it.
Obviously, Postgres is wonderful at
many things, but it's not very good at
being an analytical or search database.
When customers need to do user
facing search, user facing analytics
in their products, they need to
bring another search database,
like Elastic usually, and start to
move data from Postgres to Elastic.
That can cause a lot of issues.
You have clusters out of sync.
You have data that's not fresh.
The cost is much higher.
You need to denormalize the data,
which means higher data duplication.
You don't have transaction safety
on your Elasticsearch cluster.
For a DB solves all of these things by
essentially bringing the feature set
of Elastic inside Postgres directly.
And the way we do this under the
hood is we essentially implement
some of the technologies that make
Elastic powerful inside Postgres.
So we're also building inverted
index, columnar storage complex query
builder, like you have out of Elastic
search where you can string together
really complex search query logic and
analytics query logic inside Postgres
so that we can bring those feature sets.
Nicolay Gerold: Yeah, and I think,
especially in Postgres, part of what makes
it so powerful is the query engine that
actually optimizes your queries below
the hood before it's actually executed.
Are you also directly hooking into that
component as well for the different
search queries you're constructing?
Philippe Noël: We are.
Yeah, we are.
We work pretty deep inside Postgres.
Nicolay Gerold: Yeah, can you maybe
double click on that how that works
and what it actually enables you to do?
Philippe Noël: Sure, absolutely.
Within databases, you have 2 core pieces.
You have the planner.
The planner takes your query
and essentially plants how
the query will be executed.
And the executor will go and execute
that query against the storage
that you have in the database.
Postgres is really extensible,
and it gives you the ability to
hook in at both of these layers.
At the executor level,
for example, what we do.
Is we intercept queries that
have a specific operator
that we call the at operator.
For anyone that's familiar writing
search queries in Postgres, you know that
you can do that with the at operator.
That's the built in full text
search way to to phrase full
text search queries in Postgres.
We basically introduce our own.
That has an extra at, and when you
use that, we hook in internally in
the Postgres executor to tell it, hey,
when you see a query that matches this,
we want it to be executed against our
columnar storage, our inverted index
that has bm25 scoring, for example,
which is very important, but it's not
existent in Postgres and then that gets
rewired internally to our code paths
within Postgres, and that's one of the
ways we can do very seamless Integration
that's significantly more powerful.
Nicolay Gerold: Yeah, and maybe on
the opposite end is where would you
actually see the point where Postgres
in itself starts to break as a
search database, even with ParadiDB.
So in what terms of write,
read, load, data size, latency,
you're trying to reach.
Philippe Noël: Yeah, good question.
So I'm going to have a
slightly different answer.
I don't think any load has
been a core problem for us now.
So there's nothing that Postgres
technically can do at very
high scale if you scale it.
And our biggest customer Alibaba cloud.
They're using.
For a DB inside a data warehouse of
theirs, that load is truly massive, and
they haven't seen any problems with it.
The area where I would say we're not well
suited is when people are working with
very large amounts of JSON type documents.
ParadeDB works wonderfully if you have a
structured relational format, the bread
and butter of the customers we service.
They store data in Postgres and
they want to keep it there, right?
If you're using MongoDB or you're
using some sort of NoSQL document
database and everything works with
JSON, you can do that in ParadeDB as
well, and it's going to work well.
We have customers that do it, but that's
an area where using a NoSQL search
engine, might make more sense for you.
Nicolay Gerold: And the, so basically
it assumes that you put the text
field into text fields and not into
any more advanced JSON structures.
Philippe Noël: You can, so we can
tokenize any type of column or any data
type in Postgres, but it's going to be
easier to reason around if your data
is structured in tables rather than
if your data is structured in JSONs.
Nicolay Gerold: Yeah.
And what is like the power that
actually comes with it when
you build on top of Postgres?
You mentioned a bunch already
but maybe can you go deeper,
especially in the transaction part?
Why would I want to have transactions?
For my search queries as well,
or for the queries that are
running on my search application.
Philippe Noël: Yeah, good question.
So Elastic, one of its biggest drawbacks
is it's known for not being very reliable.
Clusters go missing or, the
data is not fresh because you
need to ETL it over and so on.
Having transaction just guarantees that
your data is consistent, basically.
And that's very important.
So we work with a lot of companies
in the financial sector, Fintechs,
a lot of them, like financial data.
You can't have duplicated data.
You can't have missing data.
That's just so incredibly important.
Imagine you're going to bank account
and you have a missing transaction,
or you have a duplicated transaction
you're going to be very unhappy.
For these folks, it's really
important that they have really
good data integrity and parade DB.
Makes that possible versus something
like Elastic, we have had customers
come to us and show us their FinTech
applications and tell us that the data
here is not present in their database
anymore, because, let's say you roll
back a commit in your Postgres and
suddenly that commit didn't get rolled
back in your Elasticsearch cluster,
that can be a really big pain point.
So I would say that's 1 of the biggest 1.
That's typically the main
reason people come to us.
We remove the ETL.
And we give you transaction safety
which means the data is more fresh and
the data is has very strong integrity.
And then there's other small
benefits that Paradeby has.
It's about five times faster on a
single node compared to Elasticsearch.
You don't need to duplicate
your data as well.
So the cost is much lower.
Those things play, a nice part of it,
but they're typically more like nice
to have rather than the core reason
people get excited about what we do.
Nicolay Gerold: Yeah, do you see yourself
as part of the larger trend where you
combine some new capabilities, like for
example iceberg tables, also running
analytics with DuckDB within Postgres?
Philippe Noël: We do see ourselves part
of a trend of composable data system
where people like glue and, together more
state of the art library inside Postgres.
We have done some work with DougDB.
I do think 1 thing that's very
important in Postgres is to do the
work in the Postgres query engine.
So we have tried to cut corner
before we've tried to use Duck DB.
We've tried to use data fusion
to accelerate analytical queries.
There's a lot of overhead that
comes when you start to jumble
in multiple query engines.
And we found that to be a big problem.
Like it's really hard to give
the data integrity that I was
mentioning is so important before.
So I do think systems will consolidate,
but I don't think they'll consolidate.
And use like all the tools in the
exact same way that people think a
lot of the work you just have to build
in a native in natively in Postgres.
Nicolay Gerold: for you guys, that
you also plan to Run the analytical
queries which is like one part of your
feature set over time in Postgres,
or are you already doing that?
Philippe Noël: We're already doing that,
and we're going to be doing more of it.
At first, basically, when you use,
if you're familiar with Elastic.
Our core mission is to build an
Elasticsearch competitor, right?
We started by building
the search capabilities.
And what we found out is the analytics
capabilities are very important.
They're a very core part
of Elasticsearch as well.
And you need both in order
to deliver real value.
So if you use like a search engine,
oftentimes they'll, you'll return
results and you'll see there's.
100, 000 results that exist
on this e commerce website or
something like that, right?
That's an example of an analytics query.
So we started looking
into building analytics.
When we did, we first look at it
by using Data Fusion, and then
we looked at it by using DuckDB.
And each time, the
performance was really fast.
You can get really quick performance
really quickly, but it didn't
integrate well within Postgres.
It was very difficult to get.
The MV c.
Read the transactional guarantees
to be really good because analytical
engines typically don't really
care about transactional safety.
That's not a really
important part for them.
But for progress, it is.
And so what we found out is if we wanted
to do that, we need to build it ourselves.
So today, in that executor poke
workflow that I described, we also
store data in columnar format in a
fully Postgres transactional way.
And then we crunch the aggregate
numbers by accelerating the way
Postgres operates on that data.
And we're not using DougDB,
we're not using Data Fusion.
We're building similar designs as
what they've done, but we're doing
it directly in the Postgres planners.
Nicolay Gerold: Does this also mean
that the Postgres that runs ParadiDB
should be run on particularly
hardware to take more advantage of
the columnar format your data is in?
Philippe Noël: Yes, I
would say, yes and no.
We recommend our customers use
storage optimized instances.
So on Amazon, they're the I
something, I4 something series.
I don't know what they are
on the other cloud providers.
Those are quite similar to the
instances that are typically
recommended for regular Postgres boxes.
The reason being, when you're
doing database workloads, the
biggest time consuming piece of the
hardware is reading the data from
the disk, and that's reasonably
similar to transactional's workload.
Nicolay Gerold: Yeah, and just maybe
to catch people up MVCC basically means
that you can have multiple versions
of the same row in your database.
So multiple transactions basically can
interact with the same pieces of data.
And basically each one has a
version based on the timestamp
when the transaction started.
Philippe Noël: Correct.
Correct.
Yeah, it enables concurrent access and
it's one of the crown jewels in all of,
there's things wrong with it, but it's one
of the things that make Postgres so good.
Postgres is known for being
really reliable database.
Like your data is always
there and it's always correct.
And MVCC is how that's possible.
Nicolay Gerold: Yeah, and especially when
I'm looking at the Elastic, it's what a
lot of people under appreciate is like
the analytics that come with it because
it also has a columnar storage format.
It doesn't just have the reverse index.
What are you planning in there
and what are you already offering?
Like you usually have click
data when we are talking about
search, you have personalization
vectors, you have re ranking.
All of which is basically more or less
linked with the analytics component.
How is this working in ParadeDB?
And what is the feature set you
already have in production right now?
Philippe Noël: Yeah, so we already have
columnar storage and for ParadeDB we,
we already have vectorized processing
as well, which is how you can do really
fast aggregation over that columnar data.
We it's not generalized yet.
It's not generalized to all
click data, for example.
It's used specifically in the context
of what we call faceted search,
which is essentially analytics
over full text search results.
So that can mean bucketing results, right?
You're on a search engine like
your Amazon, and you want to bucket
results by star ratings, right?
Four star, three star, 4.
5 stars, things like that.
Or you want to count the total number of
results that there are, things like that.
That's what we do today, specifically
in the context of search.
As we go, we do plan to generalize it to
generalized aggregate operations that you
might be able to do on Clickhouse or time
series databases or things like that.
Nicolay Gerold: Nice.
What do you say is at the moment
the killer use case for ParadeDB?
Because in Elasticsearch, what most
people don't know, they started
out really heavy and still are.
on log data.
Philippe Noël: Elastic when it
started, it's killer use case was
actually the user facing search.
And very quickly, they
moved into observability.
Us, what we do today, I would
say is transaction safe, no ETL
user facing search and analytics.
So we power search for
financial technology companies.
As I mentioned, we power
analytics for productivity
companies that you probably use.
Many of them will be releasing
announcements about them
in the next few months.
Really user facing search and analytics,
dashboarding, tabling, filtering.
We're big in sales automation as well.
E commerce is another big one, a
little too big to the ones we service.
We may go towards the observability
world eventually, but first we got to
build like a really good foundational
database before we start looking at that.
Nicolay Gerold: Yeah, and what would
you say are the things you want
to take with you from the other
databases and bring into ParadeDB?
So what are the things you actually
really like about certain other databases?
Philippe Noël: For Elastic, for
example, I really like it's like the
feature richness of the queries, right?
You can do so much.
And I think it's important
to enable users to do that.
We had users that requested things
like percolation, for example, which
is a reverse like reverse querying.
If you have, if you save a search query,
and you have new data that comes in,
it automatically maps the result rather
than the standard search, which is.
store the data, but do
a search query over it.
Something that's a big one.
We're borrowing a lot of inspiration from
Data Fusion and Clickhouse in terms of
how they're processing analytical queries.
What they've done is
really state of the art.
And this is some of the work we want to do
on the analytics front inside of Instead
of Postgres, and then we're also borrowing
some inspiration from the folks in AdDocDB
and other of the data lakes type products.
I do believe interoperating with
data lakes is going to be important,
over the next like 5 to 10 years.
So we're, keeping our eyes peeled
for work that we can do there.
Nicolay Gerold: Yeah.
And what degree of interoperability
do you mean by that?
Is it actually running search on top
of like open table formats as well?
Or rather offloading work, for
example, for the analytics.
Philippe Noël: For now, we
do everything inside Postgres
that's what our customers want.
We do have another project that uses
DuckDB that can read data from OpenTable
formats on S3 or Google Cloud Storage
or Hugging Face or things like that.
So that makes it really easy to
load data into Postgres and then
do search and analytics over it.
That's one of the ways we interoperate.
Eventually, we'll also want to be
able to write data to object storage
so that you can offload, colder
data to a cheaper storage mechanism.
I do think there will always be a place
for a hot data that's inside Postgres and
then some cold data and in object storage.
And we just want that transition
to be very seamless for people.
Nicolay Gerold: What was the
thinking behind using DuckDB for
reading over foreign data wrappers?
Philippe Noël: So we do both.
We do both.
We have our, it's built as a foreign data
wrapper, which internally uses DuckDB.
That makes it just much easier, right?
Because DuckDB abstracts away the
entire workflow of interoperating
multiple file formats, multiple
object storage, things like that.
But it is a foreign data wrapper
using DuckDB inside of it.
Nicolay Gerold: Interesting.
And also most underappreciated feature of
DuckDB, it's like the CSV parser for me.
Philippe Noël: Yeah, exactly.
Yeah.
So a lot of, so JSONs, things
like that, it works quite well.
Nicolay Gerold: Yeah, nice.
Maybe let's move into the case study
you actually posted, the Alibaba cloud.
Can you maybe give a little bit of
background, what they actually want to try
to build, and how they basically ended up
using ParadeDB for their specific problem?
Philippe Noël: Yeah.
So Alibaba reached out to us in February.
Of this year, so it's been a while
now they were building a Postgres data
warehouse and 1 of their customers wanted
full text search and that data warehouse.
If you want to do full text
search and Postgres, basically.
3 things you could do, they
could use the regular TS vector
built into inside Postgres.
That's very limited.
So they didn't want to do that they
could use Elastic, but then they wouldn't
really be offering a singular product.
They would need to offer 2 databases glued
together, which is not great, or they
could use parade DB or try to rebuild it.
Right and they reached out to us and
decided to work with us and purchase
the product rather than rebuild it.
And then they integrated
within their data warehouse.
Nicolay Gerold: And how would you
actually what is the difference
between the data warehouse context
and general purpose search?
Philippe Noël: So that's a good question.
I don't think it's that
different, to be honest.
There's differences in the nuances
of how, like the data is stored and
the type of data and how it gets
searched over and things like that.
But that's something we've mostly left
to Alibaba to figure out themselves.
Like they are modifying slightly
our work in order to integrate it
internally within their product.
But from our side It's
just Postgres, right?
ParadeDB is building a Postgres
compatible product, and that
data warehouse is Postgres based.
So as long as they remain Postgres
compatible, a lot of the interoperation
of our work with their work is taken care
of by the core Postgres query engine.
Nicolay Gerold: Yeah, nice.
I want to actually double click a little
bit more on the complexity behind it.
And especially you already mentioned
you have two core different
components on the one you're
hooking into the execution engine.
The query formulation, but also the
file system, like how the data is
stored, can you maybe take it bottom up?
What changes did you actually
do to the file system?
And how did it does it actually
differ for different types of data?
So it's the, for example, text
data laid out differently than the
data for your analytic queries.
So basically the interaction
data, for example.
Philippe Noël: Yeah, good question.
So what we do, we create a
new custom index and Postgres
that we call the BM 25 index.
For context, for those that are not
familiar, BM 25 is a ranking algorithm
to rank search results, and it's
known to be the state of the art.
It's what Elasticsearch uses.
It's what most powerful search
engines use, or all, perhaps even
and Postgres doesn't support it.
So we do that, then within that
index, it is an inverted index.
So it stores, mappings basically
between tokens and their location
within the tables in our case,
within the documents in the corpus.
That data gets stored inside a
standard Postgres index and, I'll
spare you the complexities behind it.
You have a file format that gets stored
on the blogs within the Postgres pages.
Excuse me.
But.
All this to say this is how
we work at a high level.
And then we also have
the analytics queries.
So we have the ability within our
index to specify when you index
different types of data or different
fields, whether you want them to be
indexed as what we call fast fields.
Fast fields means they're
indexed in columnar format.
So if you don't do that, if you
use, for example, like a text a
text column, you can store it as a
regular inverted index type format.
Or if you have a numeric field or an
integer field, for example, that you
may want to do aggregation over, then we
recommend you store it as a fast field.
And it's still going to store
within the index, but it is going
to store as a columnar format.
Nicolay Gerold: Yeah, and when I'm
looking especially at the comparison
of Lucene, you have the segment info
and also if you have field data, you
have norm data, you have normalization
data often you have the TF IDF parts.
What are like the components
you actually took with you
for the new index you created?
What's the additional data
you're storing for each index?
Philippe Noël: Yeah, so we do.
We do the exact same.
In fact, the search engine that
we use called Tantivy is very
strongly inspired by Lucene.
It's not exactly identical, but
it's very strongly inspired.
So we store the exact same things.
All of these components you describe are
required to calculate the BM25 score.
On top of this, we will store
data related to Postgres.
So that's one difference
compared to Lucene.
So as you were mentioning we will
store data to make it compatible with
the Postgres index access method.
But we have the data stored in the tables.
So I don't think we store any extra data.
Like the extra data would be like the
transaction safety related information,
but that's stored within the rows in
the tables itself, not in the index.
Nicolay Gerold: Yeah, and can you
maybe walk us through it and you
setting up a new database, you're
inserting all documents, what's been
happening behind the scenes as new
documents are hitting the database,
what's happening to the documents, how
they are pre processed and inserted.
Philippe Noël: Yeah, so the Postgres
access, the Postgres has this API
called an index access method.
That's what developers like us get
to define custom indexes in Postgres.
When you define your index over a table,
it has triggers in place to process new
data as it gets modified or inserted.
So whenever you have a new row that
gets inserted in a table on which
you've created an index, a ParadeDB
BM25 Index, that row would automatically
get grabbed by the Postgres index.
It's going to be tokenized following the
tokenization strategy that you've defined.
In our case, All of this is handled by
a search engine tend to be that does the
tokenization into inverted index, and all
of that is going to be stored on this.
And then when you go
and you write a query.
That query will be executed against the
index, assuming you use the proper syntax.
And if you go and modify, you update
the row, you delete the row, the
sort of same process will happen and
will update the index accordingly.
What this means is, if you if you make
large updates, it might take some amount
of time for the indexing to finish.
It's quite quick, but it could
be, half a second, a few seconds,
depending on, maybe even more.
If you're like indexing an entire new
table, it can take minutes or hours.
Nicolay Gerold: How do you actually
allow the user to Tailor their search
applications for the specific use case
by constructing really complex queries,
but still maintain like the level of
simplicity and deterministic language
that in the end SQL brings with it.
Philippe Noël: Yeah, that's
a very good question.
That's one of the things that I think
we do quite well compared to Elastic.
So if you look at our
docs, you can see this.
We have our general full text
search API where, as I mentioned,
we use the at at operator.
So you could define a really
simple search query, look in the.
Search for a product catalog where
the description mentions blue, right?
And you get all the blue items that
you might be selling or keyboard
and you get all the keyboards you're
selling or something like that.
So you can write very simple queries with
this and almost like in pure PostgreSQL.
Now, we also have something that
we call the Query Builder API.
So our Query Builder API, which is
in our documentation as well, is
a bit similar to the Elastic DSL.
What this means is we expose the entire
underlying search engine, exactly
like what Elastic does, where you
can construct really complex queries.
You can do booleans between results.
You can do disjunction max to
do tie breaking, for example.
You have really complex access.
And you can do it in SQL.
But you can also do it in JSON
similar to how you would with Elastic.
That's been pretty
important for our customers.
Some of them will have the queries
be generated programmatically
by their application based
on what the user will click.
It will like programmatically generate
the query that they want to run.
And that's just easier to do in JSON.
So we expose that as well.
Nicolay Gerold: Can you give me an example
of what query, for example, is constructed
as the user clicks through the dashboard?
Philippe Noël: Sure.
So let's say you're a sales
automation company, right?
You're a product that sends
emails on behalf of customers.
You might have a you might have a table.
Where I'm a user of yours, and I can
say, okay, I want companies based in
Europe that are between 100 and 500
employees that are in the manufacturing
space that are less than 5 years old.
Let's say that's who I
want to email, right?
When you go and you click these.
behind the scenes, it's going to
be generating a query to say, okay
it needs to take this part of the
data set and not this part of the
data set and so on and so forth.
So oftentimes those queries get
generated programmatically and it's
easier to do that So we make that
possible with Prairie DB as well.
Nicolay Gerold: Yeah.
And when you look at the projects
and you would see someone setting up
ParodyDB for the first time, what would
you tell them when they expect they're
building a large search applications?
How do they future proof
it for like larger scale?
What are the settings?
What are the setups they should
consider from the get go?
Philippe Noël: Yeah, good question.
The good thing with Postgres is
you don't really need to overthink
it too much early on because you
can always modify these later.
Postgres scales vertically very well.
So for most people, we ship with
a, a set of defaults that are
pretty straightforward and I don't
think you need to modify too much.
One thing is we recommend that you set the
parallelization parameters of Postgres to
match the number of CPUs on your machine.
That has a really big impact on
performance specifically for analytics.
We have a page on that in our docs, then
I would say from an infrastructure on
architecture standpoint, we recommend
our customers follow the same sharding
and partitioning strategy and parade
DB as they do with their Postgres.
It's quite common for companies
with larger basically.
When you use Postgres, what a company
will do is they'll start with a
single Postgres database, right?
A monolith that they'll
store all of their data in.
Eventually, that's going to be too big.
They have too much data to do that.
So they're going to need to
partition the data or shard it
across multiple Postgres machines.
When you do that, then using something
like Elasticsearch can become complicated
because you have two different ways to
think about how the data is structured.
But ParadeDB can do the same
partitioning and sharding strategy.
That your transactional Postgres
does, so that's what we recommend.
Customers do like any best practices
that they're used to doing in Postgres.
We recommend they do in parade DB
as well, and you can define schema
migration and database migrations and
in parade DB because it's Postgres.
So it's actually quite easy to
map that across your entire infra.
Nicolay Gerold: What is the
difference between searching on
single node versus a sharded database?
Philippe Noël: So most of the customers
that use us in a multiple node today,
what they do is they route the queries
based on where the data is stored.
So let's say they have 2 nodes.
They'll have half of the data on 1 node,
half of the data on the other node.
And if you design your partitioning and
you're sharding properly, you're not
going to need to do cross machine queries.
So that's going to be pretty simple.
Obviously, if you have.
Yeah.
If you want something like
Elastic, where you don't really
need to think about it, right?
Everything is going to work.
For ADB does not support distributed
search in that fashion today.
We have found actually that our
big customers don't need it.
But when they do need it, then we will
probably build it and then I'll be able
to tell you, the way we're going about it.
We have a few ideas for how we
want to do it, but yeah, for now,
we haven't dived too deep with it.
Nicolay Gerold: Yeah, I would just
expect at some point I will hit the
query where the sharding strategy
was in the end the wrong one.
Philippe Noël: Usually the companies
that shard their Postgres, at least
the ones that we work with, they
have very smart engineers and those
people that design it really well.
So far we haven't had that need, but
I am certain it will happen for sure.
Yeah.
Nicolay Gerold: Yeah, once the
less smart engineers are coming in.
Philippe Noël: it's not that,
but you know what I mean?
Yeah, eventually it'll happen.
Yeah, but for now we haven't needed to.
And that's one thing that
I do want to talk about.
Postgres scales vertically very well.
And because ParadeDB works with
joins, while Elastic doesn't
support joins, you don't need to
search over as large a dataset.
So we have, for example, customers
that have 50 terabyte, 100
terabyte Elastic clusters, right?
And they come to us and they say, Hey
can ParadeDB work over 100 terabytes?
Because that's how much data
I have in my Elasticsearch.
But in reality, the question
you should be asking is, can
ParadeB work over my largest table
that I put into Elasticsearch?
Because if you have a hundred tables
that are each one terabyte, then
ParadeB is going to do very well because
it's just one terabyte at a time.
And then you join the result set
at the end, in the case of Elastic.
They don't support joins, so you
have to search over a hundred
terabytes no matter what, right?
We don't have that problem because
we have the relational structure.
So with even a single node, you can
get very far because usually no one has
a hundred terabyte table in Postgres.
People will have a few
terabytes maybe at most.
Nicolay Gerold: Yeah, and I think
in the end it will come down to
the Redshift study that the large
majority of data sets isn't even that.
And even if like your queries
will only hit a subset.
So you can actually do a lot especially
when joins are possible as well.
So you can split up the different
text fields, for example, into
multiple different tables.
Philippe Noël: Yeah, exactly.
Exactly.
Exactly.
That's one of the really
big advantages of parade DB.
So oftentimes you have customers
come and they're like, Oh, I have
this big Elastic search cluster.
And by the time they use
parade DB, they're like, Oh,
actually, it works very well.
I didn't need to think about it.
That's such a big volume.
Nicolay Gerold: Yeah.
How do you actually think about
query optimization for ParadeDB?
I think everyone knows the explain
keyword for Postgres, which is like
your best friend if you know it already.
How do you think about it in Paradidb?
Philippe Noël: It's the same, it's the
same we have made sure that our work
when you run full text search queries,
the explain and the explain analyze
function also provides you statistics on
what happened in the search index and in
the analytics side of the search index.
This is what our customers use we,
we try really hard to conform to the
way people use Postgres so that they
don't need to change their workflow.
Nicolay Gerold: Is there any
advantage for using Rust for that?
Because I think Rust
especially is very verbose.
In any error messages it throws,
is there any advantage of, in the
combination of explain and rust?
Philippe Noël: That's a good question.
I wish I had a better
answer for you to that.
I would need to ask one of our
engineers who's built this.
My understanding is the error messages
that we return are as much as possible.
The Postgres error messages that Rust
allows us to glue it because again, we're
trying to be as Postgres y as possible.
So I don't think we do anything really
specific here, but, yeah, I'm not sure.
I'm not sure.
I don't think there's anything
meaningfully different, but we
do have very Toro error messages,
and they are, Rust based.
So there's probably some stuff
that I'm not thinking of.
Nicolay Gerold: Or what would you say
are the most common mistakes people
do when they start out with ParadeDB
and doing search on top of Postgres?
Philippe Noël: Good question.
The most common mistakes.
There's a few.
People want to do analytics over fields
they haven't specified in columnar format.
That's a common one.
And then they find the performance
is not as fast as they expected.
We now default.
Your integer for fields to be an index
and columnar format anyways, unless
you specify that you don't want that
because most people usually want it.
Another 1 is not configuring
Postgres properly.
So if you want the maximum performance
on analytics, you do need to
make sure that Postgres can take
advantage of the full hardware.
We also have instructions
on how to do that.
And inside Parade, it comes
pre configured, but sometimes
when people setting up them
themselves, they make that mistake.
And then on the search front, it is.
Not uncommon for people to misunderstand
how the tokenization works and be
surprised to not find the results
based on specific tokenizers.
We already have some work that we're
going to be releasing there to make
it much harder for people to make that
mistake understanding, like, how and
grams tokenizers were versus like, white
space tokenizers and things like that.
Sometimes people will write a query, it
will return no result and there'll be.
pretty confused.
While in reality, it's not that
Paradeby is, is misbehaving, but
it's just the query isn't quite
doing what they thought it would.
Nicolay Gerold: Yeah.
What would you say?
What's on the horizon?
What can you already teaser?
What's next for you guys?
What do you have in the
pipeline at the moment?
Philippe Noël: 1st, lots of big case
studies are going to be coming up.
We've been quiet for a few months as
we've been, making the product even more
enterprise ready, but I'm excited to
release that from the product standpoint.
We have a lot of exciting works
coming around fast analytics.
So I'm very excited for that
to be released in postgres.
There's a lot of people waiting.
Yeah, lots to be announced
here over the next few months.
And then we have some integrations
with some other folks in the
Postgres world that will be coming.
We've been laying the groundwork
to make that possible, but it'll
be able to use ParadeDB in more
places than you could before.
And I'm very excited to,
to announce that in a bit.
Nicolay Gerold: Nice.
And When you look at the larger
landscape, I think there are a
lot of people implementing search,
implementing reg right now.
What's your dream tech stack on top
of ParadeDB for reg, for search?
What would you add to your stack
when you have to build like a
full fledged search application?
Philippe Noël: That's a good question.
That's a good question.
We work with a lot of customers that do.
That don't even use rag that much.
So we should PG vector and parade DB.
It's very common for people to be
grabbed to build a rag products using
basically for a DB with PG vector for
the search stack, and then they will use
like some ranking re ranking models on
top of it with any of the rag tooling
or rag pipelines tools that exist it.
That's typically the very standard one.
We don't do a lot of work there.
Yeah, I don't know.
Whatever floats your boat, honestly.
Postgres is integrated everywhere.
It varies greatly.
Yeah,
Nicolay Gerold: Nice.
That's usually my place for the hot takes.
I often get some people to really
get some interesting stuff out there.
And what's something that you
would actually love to see built
in AI in data or in search that's
you're not building, by the way.
Philippe Noël: yeah,
that I'm not building.
Sure, of course.
In AI, search or data?
Let me think.
Let me think.
For one, I think there's a lot that
can be done in Postgres still on the
data front around better analytical
queries, like materialization, like
view materialization is a big one that
I think can still be improved upon.
I wish that would happen
on the search front.
I wish, honestly, it was easier to deal
with with synonyms and yeah, with synonyms
parsing and in a lot of search engines.
So now it's being done through
vectors in a lot of cases, but we've
had a bunch of customers that have
wanted things different than that.
We may build that eventually,
to be honest, but yeah, I wish
I would better handling of that.
And then lastly, let me see.
Another big one that I think will be
important is more of the various data
tools integrating with object storage.
So we've had to do a lot
of work on that ourselves.
There's tools like DuckDB that
make this slightly easier to do,
but it's still a lot of work.
I think the.
Deeper integration between search and
analytics engines and object storage
would be very valuable for the community.
And I know people are working on
this for what it's worth, especially
in the Rust data ecosystem.
Nicolay Gerold: Nice.
And if people want to start
building the stuff we just talked
about, where would you point them?
Philippe Noël: Or docs.
paradeb.
com.
You can see everything there.
We have a tutorial and we have
everything you need to get started.
We have our readme, our github
readme, so we're on parade.
db slash parade.
db on github.
And from there you can also see our
slack community, where you can join
and ask for help and things like that.
It's a pretty vibrant community,
those are the best place to go.
Nicolay Gerold: And if people
want to follow along with you?
Where would you point them?
Philippe Noël: Same thing,
give us a star on GitHub.
We have a Twitter as well.
And we have a Blue Sky now that
everyone is moving to Blue Sky nowadays.
So you can find us at ParadeDB
slash ParadeDB basically everywhere.
GitHub slash ParadeDB.
Twitter slash, excuse me, x.
com slash ParadeDB.
Blue Sky slash ParadeDB.
Nicolay Gerold: So what can we take away
when we are building search applications?
I think if you are on Postgres and you
want to build search, it's prudent to
start with something like ParadiDB,
which is an open search extension
just on Postgres, because your data
is already there, and you can start
experimenting without having the
additional cost of setting up ETL,
setting up an entire new database in
Elastic, and also what you have to
consider like hiring engineers that
actually know how to work with Elastic.
And I think at the moment, it's
like database engineers and search
engineers, it's two entirely different
job types, but also skill sets.
One is always hinging on knowing
how distributed systems work
and database engineering,
especially on Postgres doesn't.
And this also means that
it's probably really hard.
And thanks to David for
that, for pointing that out.
It's probably hard to find people on
Postgres who know how to do search
engineering or relevance engineering.
Because it will be very different
to elastic search, which has the
explain, which is an entire skillset
you build up over building a bunch
of search applications on elastic.
What I think where it's really great,
it's like the financial applications,
like you have them transactions, you
have the securities, you don't have
the race conditions that can happen.
when you go distributed or go to elastic,
you don't have to worry about ETL.
So it can help.
In a lot of places, I tried it out.
I like the tool, I like the interface,
but I also used the Tantivy before, so the
Rust library and ran an embedded database,
so I was already familiar with it.
The explain works I think it's not
as great as in Elastic and Elastic it
gives you really a lot of information.
But, it can catch up.
Whether we get there, I'm not sure.
A massive advantage in
my opinion is joins.
Because, you can, if you, especially if
you have a lot of different searches.
For example, like when we talked about
VM25, we had GitHub's case where they
have different types of data, like they
have code, they have blog posts, they
have discussions, they have issues.
And each could be different.
its own table and you can run a
search on top of that and if you
want you can run joins and join it
together like you can join the repo
on the issues and stuff like that.
It won't work like Postgres won't
be able to do it on GitHub scale but
if you have a use case which is way
smaller it could be interesting and
What was interesting to me, it's like
that chasing documents will give you
issues which I would be interested
in actually learning more about why.
And what's maybe a good
learning for people building,
especially like data tools.
With query engine, that there comes a
cost when you use multiple different query
engines, which is a little bit of a trend
at the moment that you mix and match,
especially with Ibis, the open source
project, that you can actually have one
interface into multiple execution engines.
So you can use Ibis to call Polars, to
call Postgres, but it comes with a cost.
And One major limitation of
it will be like, just scale.
Postgres hits its limits at
a certain point in scale.
But, it's like, when you're running
search on Postgres, I don't see
any reason to not switch to Parade,
because it will be more efficient.
And, yep, that's it.
I would love to know what
you think about this episode.
It's very different to what I
usually do like really highlighting
a tool and how it works.
So I'd love to hear in the comments
what you think and whether you
want to hear more like that.
Otherwise, we will be continuing
a series on search next week.
So subscribe, and also feel free
to comment helps me out a lot.
I would love to hear critique,
feedback, positive or negative.
And yeah, see you next week.