How AI Is Built

Many companies use Elastic or OpenSearch and use 10% of the capacity.

They have to build ETL pipelines.

Get data Normalized.

Worry about race conditions.

All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.
Not anymore.

ParadeDB is building an open-source PostgreSQL extension to enable search within your database.

Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.

We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.

Key Insights:
Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.

Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.

Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.

Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.

Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.

Philippe Noël:
Nicolay Gerold:
00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: Many companies use
Elastic or OpenSearch, but only

use 10 percent of the capacity.

And also, they have to build ETL
pipelines, get data normalized,

worry about race conditions, all in
all, at the moment, when you want

to do search, but you already have.

existing data in a data warehouse or in a
Postgres instance, you are forced to build

a distributed system, but not anymore.

ParadeDB is building an open
source Postgres extension to

enable search within your database.

And today I'm talking to Philippe
Noël, the founder and CEO of ParadeDB.

And we talk about how they build
ParadeDB, how they integrate into the

Postgres query engine, and how you
can build search on top of Postgres.

Let's do it.

Philippe Noël: That's why we did rust
and the only 2 options if you're building

Postgres extensions is C or rust.

Although now, since there's another
company trying to make it happen in Zig,

but at the time that was not possible.

And I don't think their project is mature
enough to really build like a state

of the art product on it yet anyway.

Zig is just beginning to be possible.

And the Zig ecosystem is so small.

Most of the data infra is written
in Rust nowadays, I would say.

Nicolay Gerold: What wouldn't have been
possible if you hadn't written it in Rust?

Philippe Noël: Aside from Tentavy?

Nicolay Gerold: Yeah.

Philippe Noël: That's one that's
one really big one already.

The second one, I would say is just
the speed at which we can develop

things would have taken so much
longer if we did them and see, but

that would have been a big problem.

We also build this project called
analytics where we bundle in a

library called data fusion inside of.

Inside of Postgres that would not
have been possible without it.

We also play around with Doug DB that
you can do that and see pretty easily.

It's probably easier to see actually.

So that 1 is a, a contra, a counter
example we're thinking about new

columnar file formats as well for
work to replace arrow because arrow

doesn't have good random access memory.

Random access patterns, basically.

All of those are written in Rust, then
no one is, the only new file format

written in C is Nimble by by Meta.

It's not that mature yeah,

Nicolay Gerold: So you will end
up going with Lance, probably.

Philippe Noël: Lance or Vortex.

Yeah, I don't know about, I
don't know if I'm fully happy

with what Lens does for us.

It's very good, but
it's very good for them.

I don't know if it makes sense for us.

It's possible we use it.

And analytics we are borrowing a
lot of the work from data fusion.

But I agree.

It's very exciting.

What's happening there.

Nicolay Gerold: It's like you
have the easy entry ones like

Weaviate Qdrant, the newer ones.

Then you have Elasticsearch and OpenSearch
somewhere in the middle and then you

have the really advanced one, Vespa.

Basically on, in like complexity,
but also in what you can build.

Where would you place
ParadiDB in that spectrum?

And feel free to add like
your own categories, how you

differentiate from them as well.

Philippe Noël: Yeah, I think we're really
trying to just be a competitor to Elastic.

So I would say feature wise and complexity
wise, we're trying to be in the same

category as Elastic, but we're trying
to bring that to the ease of use that

maybe the Qdrant and we Weaviates are so
for context or let me put it this way.

Most people use Elastic.

Elastic is the most popular one, right?

It's the most popular one for a reason.

I have a theory that a lot of
new search engines are not very

different from Elastic, right?

They're also a NoSQL database.

They suffer from the
same set of trade offs.

Like sure, it might be rewritten in Rust
or a different language, but these are You

know, they're not like orders of magnitude
different, right parade DB is trying to

be that we're trying to build it in SQL.

We're trying to build in Postgres,
which is the most popular SQL or

the fastest growing SQL dialect.

I think we can build something that's
equally powerful, but solves most

of the trade offs that Elastic has.

So that's really where we're
trying to place ourselves.

Nicolay Gerold: Yeah.

And can you maybe, just for anyone
who doesn't know it yet, give a

quick introduction what ParadeDB is?

Philippe Noël: Sure yeah so
I'm 1 of the people building

a project called Parade DB.

It's an open source project.

We're building a competitor to
Elastic search inside Postgres.

So what this means is we're building
our full text search and analytics

or aggregations inside Postgres.

The, all this is built as an extension.

The benefit of it is typically the
canonical workload customers have is they

use Postgres as their relational database.

They're very happy with it.

Obviously, Postgres is wonderful at
many things, but it's not very good at

being an analytical or search database.

When customers need to do user
facing search, user facing analytics

in their products, they need to
bring another search database,

like Elastic usually, and start to
move data from Postgres to Elastic.

That can cause a lot of issues.

You have clusters out of sync.

You have data that's not fresh.

The cost is much higher.

You need to denormalize the data,
which means higher data duplication.

You don't have transaction safety
on your Elasticsearch cluster.

For a DB solves all of these things by
essentially bringing the feature set

of Elastic inside Postgres directly.

And the way we do this under the
hood is we essentially implement

some of the technologies that make
Elastic powerful inside Postgres.

So we're also building inverted
index, columnar storage complex query

builder, like you have out of Elastic
search where you can string together

really complex search query logic and
analytics query logic inside Postgres

so that we can bring those feature sets.

Nicolay Gerold: Yeah, and I think,
especially in Postgres, part of what makes

it so powerful is the query engine that
actually optimizes your queries below

the hood before it's actually executed.

Are you also directly hooking into that
component as well for the different

search queries you're constructing?

Philippe Noël: We are.

Yeah, we are.

We work pretty deep inside Postgres.

Nicolay Gerold: Yeah, can you maybe
double click on that how that works

and what it actually enables you to do?

Philippe Noël: Sure, absolutely.

Within databases, you have 2 core pieces.

You have the planner.

The planner takes your query
and essentially plants how

the query will be executed.

And the executor will go and execute
that query against the storage

that you have in the database.

Postgres is really extensible,
and it gives you the ability to

hook in at both of these layers.

At the executor level,
for example, what we do.

Is we intercept queries that
have a specific operator

that we call the at operator.

For anyone that's familiar writing
search queries in Postgres, you know that

you can do that with the at operator.

That's the built in full text
search way to to phrase full

text search queries in Postgres.

We basically introduce our own.

That has an extra at, and when you
use that, we hook in internally in

the Postgres executor to tell it, hey,
when you see a query that matches this,

we want it to be executed against our
columnar storage, our inverted index

that has bm25 scoring, for example,
which is very important, but it's not

existent in Postgres and then that gets
rewired internally to our code paths

within Postgres, and that's one of the
ways we can do very seamless Integration

that's significantly more powerful.

Nicolay Gerold: Yeah, and maybe on
the opposite end is where would you

actually see the point where Postgres
in itself starts to break as a

search database, even with ParadiDB.

So in what terms of write,
read, load, data size, latency,

you're trying to reach.

Philippe Noël: Yeah, good question.

So I'm going to have a
slightly different answer.

I don't think any load has
been a core problem for us now.

So there's nothing that Postgres
technically can do at very

high scale if you scale it.

And our biggest customer Alibaba cloud.

They're using.

For a DB inside a data warehouse of
theirs, that load is truly massive, and

they haven't seen any problems with it.

The area where I would say we're not well
suited is when people are working with

very large amounts of JSON type documents.

ParadeDB works wonderfully if you have a
structured relational format, the bread

and butter of the customers we service.

They store data in Postgres and
they want to keep it there, right?

If you're using MongoDB or you're
using some sort of NoSQL document

database and everything works with
JSON, you can do that in ParadeDB as

well, and it's going to work well.

We have customers that do it, but that's
an area where using a NoSQL search

engine, might make more sense for you.

Nicolay Gerold: And the, so basically
it assumes that you put the text

field into text fields and not into
any more advanced JSON structures.

Philippe Noël: You can, so we can
tokenize any type of column or any data

type in Postgres, but it's going to be
easier to reason around if your data

is structured in tables rather than
if your data is structured in JSONs.

Nicolay Gerold: Yeah.

And what is like the power that
actually comes with it when

you build on top of Postgres?

You mentioned a bunch already
but maybe can you go deeper,

especially in the transaction part?

Why would I want to have transactions?

For my search queries as well,
or for the queries that are

running on my search application.

Philippe Noël: Yeah, good question.

So Elastic, one of its biggest drawbacks
is it's known for not being very reliable.

Clusters go missing or, the
data is not fresh because you

need to ETL it over and so on.

Having transaction just guarantees that
your data is consistent, basically.

And that's very important.

So we work with a lot of companies
in the financial sector, Fintechs,

a lot of them, like financial data.

You can't have duplicated data.

You can't have missing data.

That's just so incredibly important.

Imagine you're going to bank account
and you have a missing transaction,

or you have a duplicated transaction
you're going to be very unhappy.

For these folks, it's really
important that they have really

good data integrity and parade DB.

Makes that possible versus something
like Elastic, we have had customers

come to us and show us their FinTech
applications and tell us that the data

here is not present in their database
anymore, because, let's say you roll

back a commit in your Postgres and
suddenly that commit didn't get rolled

back in your Elasticsearch cluster,
that can be a really big pain point.

So I would say that's 1 of the biggest 1.

That's typically the main
reason people come to us.

We remove the ETL.

And we give you transaction safety
which means the data is more fresh and

the data is has very strong integrity.

And then there's other small
benefits that Paradeby has.

It's about five times faster on a
single node compared to Elasticsearch.

You don't need to duplicate
your data as well.

So the cost is much lower.

Those things play, a nice part of it,
but they're typically more like nice

to have rather than the core reason
people get excited about what we do.

Nicolay Gerold: Yeah, do you see yourself
as part of the larger trend where you

combine some new capabilities, like for
example iceberg tables, also running

analytics with DuckDB within Postgres?

Philippe Noël: We do see ourselves part
of a trend of composable data system

where people like glue and, together more
state of the art library inside Postgres.

We have done some work with DougDB.

I do think 1 thing that's very
important in Postgres is to do the

work in the Postgres query engine.

So we have tried to cut corner
before we've tried to use Duck DB.

We've tried to use data fusion
to accelerate analytical queries.

There's a lot of overhead that
comes when you start to jumble

in multiple query engines.

And we found that to be a big problem.

Like it's really hard to give
the data integrity that I was

mentioning is so important before.

So I do think systems will consolidate,
but I don't think they'll consolidate.

And use like all the tools in the
exact same way that people think a

lot of the work you just have to build
in a native in natively in Postgres.

Nicolay Gerold: for you guys, that
you also plan to Run the analytical

queries which is like one part of your
feature set over time in Postgres,

or are you already doing that?

Philippe Noël: We're already doing that,
and we're going to be doing more of it.

At first, basically, when you use,
if you're familiar with Elastic.

Our core mission is to build an
Elasticsearch competitor, right?

We started by building
the search capabilities.

And what we found out is the analytics
capabilities are very important.

They're a very core part
of Elasticsearch as well.

And you need both in order
to deliver real value.

So if you use like a search engine,
oftentimes they'll, you'll return

results and you'll see there's.

100, 000 results that exist
on this e commerce website or

something like that, right?

That's an example of an analytics query.

So we started looking
into building analytics.

When we did, we first look at it
by using Data Fusion, and then

we looked at it by using DuckDB.

And each time, the
performance was really fast.

You can get really quick performance
really quickly, but it didn't

integrate well within Postgres.

It was very difficult to get.

The MV c.

Read the transactional guarantees
to be really good because analytical

engines typically don't really
care about transactional safety.

That's not a really
important part for them.

But for progress, it is.

And so what we found out is if we wanted
to do that, we need to build it ourselves.

So today, in that executor poke
workflow that I described, we also

store data in columnar format in a
fully Postgres transactional way.

And then we crunch the aggregate
numbers by accelerating the way

Postgres operates on that data.

And we're not using DougDB,
we're not using Data Fusion.

We're building similar designs as
what they've done, but we're doing

it directly in the Postgres planners.

Nicolay Gerold: Does this also mean
that the Postgres that runs ParadiDB

should be run on particularly
hardware to take more advantage of

the columnar format your data is in?

Philippe Noël: Yes, I
would say, yes and no.

We recommend our customers use
storage optimized instances.

So on Amazon, they're the I
something, I4 something series.

I don't know what they are
on the other cloud providers.

Those are quite similar to the
instances that are typically

recommended for regular Postgres boxes.

The reason being, when you're
doing database workloads, the

biggest time consuming piece of the
hardware is reading the data from

the disk, and that's reasonably
similar to transactional's workload.

Nicolay Gerold: Yeah, and just maybe
to catch people up MVCC basically means

that you can have multiple versions
of the same row in your database.

So multiple transactions basically can
interact with the same pieces of data.

And basically each one has a
version based on the timestamp

when the transaction started.

Philippe Noël: Correct.

Correct.

Yeah, it enables concurrent access and
it's one of the crown jewels in all of,

there's things wrong with it, but it's one
of the things that make Postgres so good.

Postgres is known for being
really reliable database.

Like your data is always
there and it's always correct.

And MVCC is how that's possible.

Nicolay Gerold: Yeah, and especially when
I'm looking at the Elastic, it's what a

lot of people under appreciate is like
the analytics that come with it because

it also has a columnar storage format.

It doesn't just have the reverse index.

What are you planning in there
and what are you already offering?

Like you usually have click
data when we are talking about

search, you have personalization
vectors, you have re ranking.

All of which is basically more or less
linked with the analytics component.

How is this working in ParadeDB?

And what is the feature set you
already have in production right now?

Philippe Noël: Yeah, so we already have
columnar storage and for ParadeDB we,

we already have vectorized processing
as well, which is how you can do really

fast aggregation over that columnar data.

We it's not generalized yet.

It's not generalized to all
click data, for example.

It's used specifically in the context
of what we call faceted search,

which is essentially analytics
over full text search results.

So that can mean bucketing results, right?

You're on a search engine like
your Amazon, and you want to bucket

results by star ratings, right?

Four star, three star, 4.

5 stars, things like that.

Or you want to count the total number of
results that there are, things like that.

That's what we do today, specifically
in the context of search.

As we go, we do plan to generalize it to
generalized aggregate operations that you

might be able to do on Clickhouse or time
series databases or things like that.

Nicolay Gerold: Nice.

What do you say is at the moment
the killer use case for ParadeDB?

Because in Elasticsearch, what most
people don't know, they started

out really heavy and still are.

on log data.

Philippe Noël: Elastic when it
started, it's killer use case was

actually the user facing search.

And very quickly, they
moved into observability.

Us, what we do today, I would
say is transaction safe, no ETL

user facing search and analytics.

So we power search for
financial technology companies.

As I mentioned, we power
analytics for productivity

companies that you probably use.

Many of them will be releasing
announcements about them

in the next few months.

Really user facing search and analytics,
dashboarding, tabling, filtering.

We're big in sales automation as well.

E commerce is another big one, a
little too big to the ones we service.

We may go towards the observability
world eventually, but first we got to

build like a really good foundational
database before we start looking at that.

Nicolay Gerold: Yeah, and what would
you say are the things you want

to take with you from the other
databases and bring into ParadeDB?

So what are the things you actually
really like about certain other databases?

Philippe Noël: For Elastic, for
example, I really like it's like the

feature richness of the queries, right?

You can do so much.

And I think it's important
to enable users to do that.

We had users that requested things
like percolation, for example, which

is a reverse like reverse querying.

If you have, if you save a search query,
and you have new data that comes in,

it automatically maps the result rather
than the standard search, which is.

store the data, but do
a search query over it.

Something that's a big one.

We're borrowing a lot of inspiration from
Data Fusion and Clickhouse in terms of

how they're processing analytical queries.

What they've done is
really state of the art.

And this is some of the work we want to do
on the analytics front inside of Instead

of Postgres, and then we're also borrowing
some inspiration from the folks in AdDocDB

and other of the data lakes type products.

I do believe interoperating with
data lakes is going to be important,

over the next like 5 to 10 years.

So we're, keeping our eyes peeled
for work that we can do there.

Nicolay Gerold: Yeah.

And what degree of interoperability
do you mean by that?

Is it actually running search on top
of like open table formats as well?

Or rather offloading work, for
example, for the analytics.

Philippe Noël: For now, we
do everything inside Postgres

that's what our customers want.

We do have another project that uses
DuckDB that can read data from OpenTable

formats on S3 or Google Cloud Storage
or Hugging Face or things like that.

So that makes it really easy to
load data into Postgres and then

do search and analytics over it.

That's one of the ways we interoperate.

Eventually, we'll also want to be
able to write data to object storage

so that you can offload, colder
data to a cheaper storage mechanism.

I do think there will always be a place
for a hot data that's inside Postgres and

then some cold data and in object storage.

And we just want that transition
to be very seamless for people.

Nicolay Gerold: What was the
thinking behind using DuckDB for

reading over foreign data wrappers?

Philippe Noël: So we do both.

We do both.

We have our, it's built as a foreign data
wrapper, which internally uses DuckDB.

That makes it just much easier, right?

Because DuckDB abstracts away the
entire workflow of interoperating

multiple file formats, multiple
object storage, things like that.

But it is a foreign data wrapper
using DuckDB inside of it.

Nicolay Gerold: Interesting.

And also most underappreciated feature of
DuckDB, it's like the CSV parser for me.

Philippe Noël: Yeah, exactly.

Yeah.

So a lot of, so JSONs, things
like that, it works quite well.

Nicolay Gerold: Yeah, nice.

Maybe let's move into the case study
you actually posted, the Alibaba cloud.

Can you maybe give a little bit of
background, what they actually want to try

to build, and how they basically ended up
using ParadeDB for their specific problem?

Philippe Noël: Yeah.

So Alibaba reached out to us in February.

Of this year, so it's been a while
now they were building a Postgres data

warehouse and 1 of their customers wanted
full text search and that data warehouse.

If you want to do full text
search and Postgres, basically.

3 things you could do, they
could use the regular TS vector

built into inside Postgres.

That's very limited.

So they didn't want to do that they
could use Elastic, but then they wouldn't

really be offering a singular product.

They would need to offer 2 databases glued
together, which is not great, or they

could use parade DB or try to rebuild it.

Right and they reached out to us and
decided to work with us and purchase

the product rather than rebuild it.

And then they integrated
within their data warehouse.

Nicolay Gerold: And how would you
actually what is the difference

between the data warehouse context
and general purpose search?

Philippe Noël: So that's a good question.

I don't think it's that
different, to be honest.

There's differences in the nuances
of how, like the data is stored and

the type of data and how it gets
searched over and things like that.

But that's something we've mostly left
to Alibaba to figure out themselves.

Like they are modifying slightly
our work in order to integrate it

internally within their product.

But from our side It's
just Postgres, right?

ParadeDB is building a Postgres
compatible product, and that

data warehouse is Postgres based.

So as long as they remain Postgres
compatible, a lot of the interoperation

of our work with their work is taken care
of by the core Postgres query engine.

Nicolay Gerold: Yeah, nice.

I want to actually double click a little
bit more on the complexity behind it.

And especially you already mentioned
you have two core different

components on the one you're
hooking into the execution engine.

The query formulation, but also the
file system, like how the data is

stored, can you maybe take it bottom up?

What changes did you actually
do to the file system?

And how did it does it actually
differ for different types of data?

So it's the, for example, text
data laid out differently than the

data for your analytic queries.

So basically the interaction
data, for example.

Philippe Noël: Yeah, good question.

So what we do, we create a
new custom index and Postgres

that we call the BM 25 index.

For context, for those that are not
familiar, BM 25 is a ranking algorithm

to rank search results, and it's
known to be the state of the art.

It's what Elasticsearch uses.

It's what most powerful search
engines use, or all, perhaps even

and Postgres doesn't support it.

So we do that, then within that
index, it is an inverted index.

So it stores, mappings basically
between tokens and their location

within the tables in our case,
within the documents in the corpus.

That data gets stored inside a
standard Postgres index and, I'll

spare you the complexities behind it.

You have a file format that gets stored
on the blogs within the Postgres pages.

Excuse me.

But.

All this to say this is how
we work at a high level.

And then we also have
the analytics queries.

So we have the ability within our
index to specify when you index

different types of data or different
fields, whether you want them to be

indexed as what we call fast fields.

Fast fields means they're
indexed in columnar format.

So if you don't do that, if you
use, for example, like a text a

text column, you can store it as a
regular inverted index type format.

Or if you have a numeric field or an
integer field, for example, that you

may want to do aggregation over, then we
recommend you store it as a fast field.

And it's still going to store
within the index, but it is going

to store as a columnar format.

Nicolay Gerold: Yeah, and when I'm
looking especially at the comparison

of Lucene, you have the segment info
and also if you have field data, you

have norm data, you have normalization
data often you have the TF IDF parts.

What are like the components
you actually took with you

for the new index you created?

What's the additional data
you're storing for each index?

Philippe Noël: Yeah, so we do.

We do the exact same.

In fact, the search engine that
we use called Tantivy is very

strongly inspired by Lucene.

It's not exactly identical, but
it's very strongly inspired.

So we store the exact same things.

All of these components you describe are
required to calculate the BM25 score.

On top of this, we will store
data related to Postgres.

So that's one difference
compared to Lucene.

So as you were mentioning we will
store data to make it compatible with

the Postgres index access method.

But we have the data stored in the tables.

So I don't think we store any extra data.

Like the extra data would be like the
transaction safety related information,

but that's stored within the rows in
the tables itself, not in the index.

Nicolay Gerold: Yeah, and can you
maybe walk us through it and you

setting up a new database, you're
inserting all documents, what's been

happening behind the scenes as new
documents are hitting the database,

what's happening to the documents, how
they are pre processed and inserted.

Philippe Noël: Yeah, so the Postgres
access, the Postgres has this API

called an index access method.

That's what developers like us get
to define custom indexes in Postgres.

When you define your index over a table,
it has triggers in place to process new

data as it gets modified or inserted.

So whenever you have a new row that
gets inserted in a table on which

you've created an index, a ParadeDB
BM25 Index, that row would automatically

get grabbed by the Postgres index.

It's going to be tokenized following the
tokenization strategy that you've defined.

In our case, All of this is handled by
a search engine tend to be that does the

tokenization into inverted index, and all
of that is going to be stored on this.

And then when you go
and you write a query.

That query will be executed against the
index, assuming you use the proper syntax.

And if you go and modify, you update
the row, you delete the row, the

sort of same process will happen and
will update the index accordingly.

What this means is, if you if you make
large updates, it might take some amount

of time for the indexing to finish.

It's quite quick, but it could
be, half a second, a few seconds,

depending on, maybe even more.

If you're like indexing an entire new
table, it can take minutes or hours.

Nicolay Gerold: How do you actually
allow the user to Tailor their search

applications for the specific use case
by constructing really complex queries,

but still maintain like the level of
simplicity and deterministic language

that in the end SQL brings with it.

Philippe Noël: Yeah, that's
a very good question.

That's one of the things that I think
we do quite well compared to Elastic.

So if you look at our
docs, you can see this.

We have our general full text
search API where, as I mentioned,

we use the at at operator.

So you could define a really
simple search query, look in the.

Search for a product catalog where
the description mentions blue, right?

And you get all the blue items that
you might be selling or keyboard

and you get all the keyboards you're
selling or something like that.

So you can write very simple queries with
this and almost like in pure PostgreSQL.

Now, we also have something that
we call the Query Builder API.

So our Query Builder API, which is
in our documentation as well, is

a bit similar to the Elastic DSL.

What this means is we expose the entire
underlying search engine, exactly

like what Elastic does, where you
can construct really complex queries.

You can do booleans between results.

You can do disjunction max to
do tie breaking, for example.

You have really complex access.

And you can do it in SQL.

But you can also do it in JSON
similar to how you would with Elastic.

That's been pretty
important for our customers.

Some of them will have the queries
be generated programmatically

by their application based
on what the user will click.

It will like programmatically generate
the query that they want to run.

And that's just easier to do in JSON.

So we expose that as well.

Nicolay Gerold: Can you give me an example
of what query, for example, is constructed

as the user clicks through the dashboard?

Philippe Noël: Sure.

So let's say you're a sales
automation company, right?

You're a product that sends
emails on behalf of customers.

You might have a you might have a table.

Where I'm a user of yours, and I can
say, okay, I want companies based in

Europe that are between 100 and 500
employees that are in the manufacturing

space that are less than 5 years old.

Let's say that's who I
want to email, right?

When you go and you click these.

behind the scenes, it's going to
be generating a query to say, okay

it needs to take this part of the
data set and not this part of the

data set and so on and so forth.

So oftentimes those queries get
generated programmatically and it's

easier to do that So we make that
possible with Prairie DB as well.

Nicolay Gerold: Yeah.

And when you look at the projects
and you would see someone setting up

ParodyDB for the first time, what would
you tell them when they expect they're

building a large search applications?

How do they future proof
it for like larger scale?

What are the settings?

What are the setups they should
consider from the get go?

Philippe Noël: Yeah, good question.

The good thing with Postgres is
you don't really need to overthink

it too much early on because you
can always modify these later.

Postgres scales vertically very well.

So for most people, we ship with
a, a set of defaults that are

pretty straightforward and I don't
think you need to modify too much.

One thing is we recommend that you set the
parallelization parameters of Postgres to

match the number of CPUs on your machine.

That has a really big impact on
performance specifically for analytics.

We have a page on that in our docs, then
I would say from an infrastructure on

architecture standpoint, we recommend
our customers follow the same sharding

and partitioning strategy and parade
DB as they do with their Postgres.

It's quite common for companies
with larger basically.

When you use Postgres, what a company
will do is they'll start with a

single Postgres database, right?

A monolith that they'll
store all of their data in.

Eventually, that's going to be too big.

They have too much data to do that.

So they're going to need to
partition the data or shard it

across multiple Postgres machines.

When you do that, then using something
like Elasticsearch can become complicated

because you have two different ways to
think about how the data is structured.

But ParadeDB can do the same
partitioning and sharding strategy.

That your transactional Postgres
does, so that's what we recommend.

Customers do like any best practices
that they're used to doing in Postgres.

We recommend they do in parade DB
as well, and you can define schema

migration and database migrations and
in parade DB because it's Postgres.

So it's actually quite easy to
map that across your entire infra.

Nicolay Gerold: What is the
difference between searching on

single node versus a sharded database?

Philippe Noël: So most of the customers
that use us in a multiple node today,

what they do is they route the queries
based on where the data is stored.

So let's say they have 2 nodes.

They'll have half of the data on 1 node,
half of the data on the other node.

And if you design your partitioning and
you're sharding properly, you're not

going to need to do cross machine queries.

So that's going to be pretty simple.

Obviously, if you have.

Yeah.

If you want something like
Elastic, where you don't really

need to think about it, right?

Everything is going to work.

For ADB does not support distributed
search in that fashion today.

We have found actually that our
big customers don't need it.

But when they do need it, then we will
probably build it and then I'll be able

to tell you, the way we're going about it.

We have a few ideas for how we
want to do it, but yeah, for now,

we haven't dived too deep with it.

Nicolay Gerold: Yeah, I would just
expect at some point I will hit the

query where the sharding strategy
was in the end the wrong one.

Philippe Noël: Usually the companies
that shard their Postgres, at least

the ones that we work with, they
have very smart engineers and those

people that design it really well.

So far we haven't had that need, but
I am certain it will happen for sure.

Yeah.

Nicolay Gerold: Yeah, once the
less smart engineers are coming in.

Philippe Noël: it's not that,
but you know what I mean?

Yeah, eventually it'll happen.

Yeah, but for now we haven't needed to.

And that's one thing that
I do want to talk about.

Postgres scales vertically very well.

And because ParadeDB works with
joins, while Elastic doesn't

support joins, you don't need to
search over as large a dataset.

So we have, for example, customers
that have 50 terabyte, 100

terabyte Elastic clusters, right?

And they come to us and they say, Hey
can ParadeDB work over 100 terabytes?

Because that's how much data
I have in my Elasticsearch.

But in reality, the question
you should be asking is, can

ParadeB work over my largest table
that I put into Elasticsearch?

Because if you have a hundred tables
that are each one terabyte, then

ParadeB is going to do very well because
it's just one terabyte at a time.

And then you join the result set
at the end, in the case of Elastic.

They don't support joins, so you
have to search over a hundred

terabytes no matter what, right?

We don't have that problem because
we have the relational structure.

So with even a single node, you can
get very far because usually no one has

a hundred terabyte table in Postgres.

People will have a few
terabytes maybe at most.

Nicolay Gerold: Yeah, and I think
in the end it will come down to

the Redshift study that the large
majority of data sets isn't even that.

And even if like your queries
will only hit a subset.

So you can actually do a lot especially
when joins are possible as well.

So you can split up the different
text fields, for example, into

multiple different tables.

Philippe Noël: Yeah, exactly.

Exactly.

Exactly.

That's one of the really
big advantages of parade DB.

So oftentimes you have customers
come and they're like, Oh, I have

this big Elastic search cluster.

And by the time they use
parade DB, they're like, Oh,

actually, it works very well.

I didn't need to think about it.

That's such a big volume.

Nicolay Gerold: Yeah.

How do you actually think about
query optimization for ParadeDB?

I think everyone knows the explain
keyword for Postgres, which is like

your best friend if you know it already.

How do you think about it in Paradidb?

Philippe Noël: It's the same, it's the
same we have made sure that our work

when you run full text search queries,
the explain and the explain analyze

function also provides you statistics on
what happened in the search index and in

the analytics side of the search index.

This is what our customers use we,
we try really hard to conform to the

way people use Postgres so that they
don't need to change their workflow.

Nicolay Gerold: Is there any
advantage for using Rust for that?

Because I think Rust
especially is very verbose.

In any error messages it throws,
is there any advantage of, in the

combination of explain and rust?

Philippe Noël: That's a good question.

I wish I had a better
answer for you to that.

I would need to ask one of our
engineers who's built this.

My understanding is the error messages
that we return are as much as possible.

The Postgres error messages that Rust
allows us to glue it because again, we're

trying to be as Postgres y as possible.

So I don't think we do anything really
specific here, but, yeah, I'm not sure.

I'm not sure.

I don't think there's anything
meaningfully different, but we

do have very Toro error messages,
and they are, Rust based.

So there's probably some stuff
that I'm not thinking of.

Nicolay Gerold: Or what would you say
are the most common mistakes people

do when they start out with ParadeDB
and doing search on top of Postgres?

Philippe Noël: Good question.

The most common mistakes.

There's a few.

People want to do analytics over fields
they haven't specified in columnar format.

That's a common one.

And then they find the performance
is not as fast as they expected.

We now default.

Your integer for fields to be an index
and columnar format anyways, unless

you specify that you don't want that
because most people usually want it.

Another 1 is not configuring
Postgres properly.

So if you want the maximum performance
on analytics, you do need to

make sure that Postgres can take
advantage of the full hardware.

We also have instructions
on how to do that.

And inside Parade, it comes
pre configured, but sometimes

when people setting up them
themselves, they make that mistake.

And then on the search front, it is.

Not uncommon for people to misunderstand
how the tokenization works and be

surprised to not find the results
based on specific tokenizers.

We already have some work that we're
going to be releasing there to make

it much harder for people to make that
mistake understanding, like, how and

grams tokenizers were versus like, white
space tokenizers and things like that.

Sometimes people will write a query, it
will return no result and there'll be.

pretty confused.

While in reality, it's not that
Paradeby is, is misbehaving, but

it's just the query isn't quite
doing what they thought it would.

Nicolay Gerold: Yeah.

What would you say?

What's on the horizon?

What can you already teaser?

What's next for you guys?

What do you have in the
pipeline at the moment?

Philippe Noël: 1st, lots of big case
studies are going to be coming up.

We've been quiet for a few months as
we've been, making the product even more

enterprise ready, but I'm excited to
release that from the product standpoint.

We have a lot of exciting works
coming around fast analytics.

So I'm very excited for that
to be released in postgres.

There's a lot of people waiting.

Yeah, lots to be announced
here over the next few months.

And then we have some integrations
with some other folks in the

Postgres world that will be coming.

We've been laying the groundwork
to make that possible, but it'll

be able to use ParadeDB in more
places than you could before.

And I'm very excited to,
to announce that in a bit.

Nicolay Gerold: Nice.

And When you look at the larger
landscape, I think there are a

lot of people implementing search,
implementing reg right now.

What's your dream tech stack on top
of ParadeDB for reg, for search?

What would you add to your stack
when you have to build like a

full fledged search application?

Philippe Noël: That's a good question.

That's a good question.

We work with a lot of customers that do.

That don't even use rag that much.

So we should PG vector and parade DB.

It's very common for people to be
grabbed to build a rag products using

basically for a DB with PG vector for
the search stack, and then they will use

like some ranking re ranking models on
top of it with any of the rag tooling

or rag pipelines tools that exist it.

That's typically the very standard one.

We don't do a lot of work there.

Yeah, I don't know.

Whatever floats your boat, honestly.

Postgres is integrated everywhere.

It varies greatly.

Yeah,

Nicolay Gerold: Nice.

That's usually my place for the hot takes.

I often get some people to really
get some interesting stuff out there.

And what's something that you
would actually love to see built

in AI in data or in search that's
you're not building, by the way.

Philippe Noël: yeah,
that I'm not building.

Sure, of course.

In AI, search or data?

Let me think.

Let me think.

For one, I think there's a lot that
can be done in Postgres still on the

data front around better analytical
queries, like materialization, like

view materialization is a big one that
I think can still be improved upon.

I wish that would happen
on the search front.

I wish, honestly, it was easier to deal
with with synonyms and yeah, with synonyms

parsing and in a lot of search engines.

So now it's being done through
vectors in a lot of cases, but we've

had a bunch of customers that have
wanted things different than that.

We may build that eventually,
to be honest, but yeah, I wish

I would better handling of that.

And then lastly, let me see.

Another big one that I think will be
important is more of the various data

tools integrating with object storage.

So we've had to do a lot
of work on that ourselves.

There's tools like DuckDB that
make this slightly easier to do,

but it's still a lot of work.

I think the.

Deeper integration between search and
analytics engines and object storage

would be very valuable for the community.

And I know people are working on
this for what it's worth, especially

in the Rust data ecosystem.

Nicolay Gerold: Nice.

And if people want to start
building the stuff we just talked

about, where would you point them?

Philippe Noël: Or docs.

paradeb.

com.

You can see everything there.

We have a tutorial and we have
everything you need to get started.

We have our readme, our github
readme, so we're on parade.

db slash parade.

db on github.

And from there you can also see our
slack community, where you can join

and ask for help and things like that.

It's a pretty vibrant community,
those are the best place to go.

Nicolay Gerold: And if people
want to follow along with you?

Where would you point them?

Philippe Noël: Same thing,
give us a star on GitHub.

We have a Twitter as well.

And we have a Blue Sky now that
everyone is moving to Blue Sky nowadays.

So you can find us at ParadeDB
slash ParadeDB basically everywhere.

GitHub slash ParadeDB.

Twitter slash, excuse me, x.

com slash ParadeDB.

Blue Sky slash ParadeDB.

Nicolay Gerold: So what can we take away
when we are building search applications?

I think if you are on Postgres and you
want to build search, it's prudent to

start with something like ParadiDB,
which is an open search extension

just on Postgres, because your data
is already there, and you can start

experimenting without having the
additional cost of setting up ETL,

setting up an entire new database in
Elastic, and also what you have to

consider like hiring engineers that
actually know how to work with Elastic.

And I think at the moment, it's
like database engineers and search

engineers, it's two entirely different
job types, but also skill sets.

One is always hinging on knowing
how distributed systems work

and database engineering,
especially on Postgres doesn't.

And this also means that
it's probably really hard.

And thanks to David for
that, for pointing that out.

It's probably hard to find people on
Postgres who know how to do search

engineering or relevance engineering.

Because it will be very different
to elastic search, which has the

explain, which is an entire skillset
you build up over building a bunch

of search applications on elastic.

What I think where it's really great,
it's like the financial applications,

like you have them transactions, you
have the securities, you don't have

the race conditions that can happen.

when you go distributed or go to elastic,
you don't have to worry about ETL.

So it can help.

In a lot of places, I tried it out.

I like the tool, I like the interface,
but I also used the Tantivy before, so the

Rust library and ran an embedded database,
so I was already familiar with it.

The explain works I think it's not
as great as in Elastic and Elastic it

gives you really a lot of information.

But, it can catch up.

Whether we get there, I'm not sure.

A massive advantage in
my opinion is joins.

Because, you can, if you, especially if
you have a lot of different searches.

For example, like when we talked about
VM25, we had GitHub's case where they

have different types of data, like they
have code, they have blog posts, they

have discussions, they have issues.

And each could be different.

its own table and you can run a
search on top of that and if you

want you can run joins and join it
together like you can join the repo

on the issues and stuff like that.

It won't work like Postgres won't
be able to do it on GitHub scale but

if you have a use case which is way
smaller it could be interesting and

What was interesting to me, it's like
that chasing documents will give you

issues which I would be interested
in actually learning more about why.

And what's maybe a good
learning for people building,

especially like data tools.

With query engine, that there comes a
cost when you use multiple different query

engines, which is a little bit of a trend
at the moment that you mix and match,

especially with Ibis, the open source
project, that you can actually have one

interface into multiple execution engines.

So you can use Ibis to call Polars, to
call Postgres, but it comes with a cost.

And One major limitation of
it will be like, just scale.

Postgres hits its limits at
a certain point in scale.

But, it's like, when you're running
search on Postgres, I don't see

any reason to not switch to Parade,
because it will be more efficient.

And, yep, that's it.

I would love to know what
you think about this episode.

It's very different to what I
usually do like really highlighting

a tool and how it works.

So I'd love to hear in the comments
what you think and whether you

want to hear more like that.

Otherwise, we will be continuing
a series on search next week.

So subscribe, and also feel free
to comment helps me out a lot.

I would love to hear critique,
feedback, positive or negative.

And yeah, see you next week.