How AI Is Built

How AI Is Built Trailer Bonus Episode 27 Season 2

#44 Semih Salihoglu on Graphs Aren't Just For Specialists Anymore | Search

#44 Semih Salihoglu on Graphs Aren't Just For Specialists Anymore | Search#44 Semih Salihoglu on Graphs Aren't Just For Specialists Anymore | Search

00:00
Kuzu is an embedded graph database that implements Cypher as a library.

It can be easily integrated into various environments—from scripts and Android apps to serverless platforms.

Its design supports both ephemeral, in-memory graphs (ideal for temporary computations) and large-scale persistent graphs where traditional systems struggle with performance and scalability.

Key Architectural Decisions:
  • Columnar Storage:
  • Kuzu stores node and relationship properties in separate, contiguous columns. This design reduces I/O by allowing queries to scan only the needed columns, unlike row-based systems (e.g., Neo4j) that read full records even when only a subset of properties is required.
  • Efficient Join Indexing with CSR:
  • The join index is maintained using a Compressed Sparse Row (CSR) format. By sorting and compressing relationship data, Kuzu ensures that adjacent node relationships are stored contiguously, minimizing random I/O and speeding up traversals.
  • Vectorized Query Processing:
  • Instead of processing one tuple at a time, Kuzu processes blocks (vectors) of tuples. This block-based (or vectorized) approach reduces function-call overhead and improves cache locality, boosting performance for analytic queries.
  • Factorization and ASP Join:
  • For many-to-many queries that can generate enormous intermediate results, Kuzu uses factorization to represent data compactly. Its ASP join algorithm integrates factorization, sequential scanning, and sideways information passing to avoid unnecessary full scans and materializations.
Kuzu is optimized for read-heavy, analytic workloads. While batched writes are efficient, the system is less tuned for high-frequency, small transactions. Upcoming features include:
  • A WebAssembly (Wasm) version for running in browsers.
  • Enhanced vector and full-text search indices.
  • Built-in graph data science algorithms for tasks like PageRank and centrality analysis.
Kuzu can be a powerful backend for AI applications in several ways:
  • Knowledge Graphs:
  • Store and query complex relationships between entities to support natural language understanding, semantic search, and reasoning tasks.
  • Graph Data Science:
  • Run built-in graph algorithms (like PageRank, centrality, or community detection) that help uncover patterns and insights, which can improve recommendation systems, fraud detection, and other AI-driven analyses.
  • Retrieval-Augmented Generation (RAG):
  • Integrate with large language models by efficiently retrieving relevant, structured graph data. Kuzu’s vector search capabilities and fast query processing make it ideal for augmenting AI responses with contextual information.
  • Graph Embeddings & ML Pipelines:
  • Serve as the foundation for generating graph embeddings, which are used in downstream machine learning tasks—such as clustering, classification, or link prediction—to enhance model performance.
Semih Salihoğlu:
Nicolay Gerold:
00:00 Introduction to Graph Databases
00:18 Introducing Kuzu: A Modern Graph Database
01:48 Use Cases and Applications of Kuzu
03:03 Kuzu's Research Origins and Scalability
06:18 Columnar Storage vs. Row-Oriented Storage
10:27 Query Processing Techniques in Kuzu
22:22 Compressed Sparse Row (CSR) Storage
27:25 Vectorization in Graph Databases
31:24 Optimizing Query Processors with Vectorization
33:25 Common Wisdom in Graph Databases
35:13 Introducing ASP Join in Kuzu
35:55 Factorization and Efficient Query Processing
39:49 Challenges and Solutions in Graph Databases
45:26 Write Path Optimization in Kuzu
54:10 Future Developments in Kuzu
57:51 Key Takeaways and Final Thoughts

What is How AI Is Built ?

Real engineers. Real deployments. Zero hype. We interview the top engineers who actually put AI in production. Learn what the best engineers have figured out through years of experience. Hosted by Nicolay Gerold, CEO of Aisbach and CTO at Proxdeal and Multiply Content.

Nicolay Gerold: Graph
databases struggle with scale.

Load a billion node graph, run a complex
query and watch performance tank.

And most systems crumble on
the large data sets or really

complex multi hop traversals.

Kuzu is an embedded graph database.

Which changes, changes this and it brings
modern database techniques to graphs and

it can handle a lot of data with speed.

So you can build fast analytics,
embedded applications in Android

apps, in serverless functions,
or also with VESM in the web.

You can do quick experiments.

Build temporary graphs in memory,
process the data on the fly, run

analysis without persistence.

And it works by taking a lot, what we've
learned over the years in databases

and applying it to graph databases.

So similar to DuckDB,
it uses a smart storage.

So it's column first instead of row first.

And through that, it can.

Do a lot of smart preprocessing
to filter down data.

So it actually limits the
amount of data being scanned.

It uses really fast joints
and modern processing.

And today we're going to be
really diving deep into how graph

databases typically worked and
how Kuzu improves on top of that.

And when you want to go really deep
into knowledge graphs, I think you

can take away a lot from this episode.

Let's do it.

Semih Salihoğlu: Kuzu, as is
an embedded graph database.

So basically what that means in
modern jargon is that we implement

Cypher and we are a library.

You can put us into your scripts, you can
put us into your Android applications.

You can put it into a W.

S.

Landa.

And you can put it on many
environments as a library instead

of setting up a server somewhere
and then connecting to the server.

So some of the use cases that we see are
related to the fact that we are embedded.

Some of the use cases that we see
are because we have an in memory

mode or a temporary, ephemeral graph
mode, meaning that we don't persist

your databases, you just build it
on the fly and then throw it away.

And then some of the use cases we see
are people building very large databases.

And they have frustrations with
other existing graph databases.

And they come to Kuzu because Kuzu is a
state of the art engine in this space.

So we basically essentially see if I
were to categorize the use cases based

on the sort of the features that we
have or the properties of Kuzu, I'd say

they would bundle in the embeddedness.

The fact that we allow people to build
ephemeral graphs and the fact that we

scale very well on, on large databases.

Originally Kuzuz started as a research
project because we were going after

scalability and performance and some
of the usability decisions that we

did, such as being embeddable and be
having an in memory mode has also.

been advantageous in terms
of getting some use cases.

So let me give you an example from,
for example, this one use case

that uses us as an embedded system
and also uses this in memory mode.

This is something we wrote about.

This is a user called Bauplan,
and what Bauplan does is that

they basically give you a So it's
a function as a service company.

What that means is that they allow
you to build complex applications.

serverless applications that
consists of many serverless

functions working together.

And what they're trying to do is
give you a good development interface

where you write your programs.

Maybe this is a data processing pipeline
that you're writing that consists

of, say, 20 different AWS serverless
Lambda functions that work together.

And you write it in Python.

As if you're writing essentially different
functions in Python, putting it together

and then they will compile it to the
serverless functions and run it for you.

And as you, as they do this
they will actually optimize and

also do some debugging whether
your program is correct or not.

So what happens is that when somebody
clicks on, say, writes a program

in BioPlan's interfaces, Python
interface with maybe like 20 Python

functions, each that will cook.

compile to serverless functions,
and they click on compile and run.

What bar plan does is on the fly generates
this graph off the computation that will

happen, and it will do some optimizations
to minimize the data as there will be data

that will be transferred between these
functions, which is going to save the

user a cost AWS communication cost and
serverless function running time because

and also will do many sort of checks.

About the correctness of the
program, and they do this essentially

within a second or half a second.

They run hundreds of queries
inclusive, create a small graph

and hundreds of crazy inclusive.

So they needed.

This is a case where having an
embedded So, you graph database where

you are creating on the fly graphs
and throwing them away is useful.

So this was something obviously
surprising because that's not

how we started Kuzu at all.

Kuzu went for the billion scale graphs,
but this was a case where essentially

we benefited from some usability
decisions that we made that are really

orthogonal to the scalability and
performance decisions that that we made.

Nicolay Gerold: Yeah, and I think
that already leads us a little bit

into the more technical aspects and.

You already mentioned that Kuzu
is also one set of the use cases

is because people are unhappy
with the existing graph databases.

I want to move a little bit into
what are the different features that

actually makes the performance of
Kuzu better than its competitors?

Semih Salihoğlu: Yeah.

So I can tell you a little
bit about kind of the origins

of kuzu and how kuzu started.

So kuzu is it's a spinoff from a research
project that I had, um, that started about

about four years ago, but prior to that,
about five, six years ago before that.

So maybe at this point, maybe eight
years, eight, nine years ago I have

been, I have entered this space as a
researcher, this graph database space.

And when I entered the space, the first
thing I did was I did a user survey.

Um, so this is published
in VLDB and VLDB Journal.

Um, and the question we asked
people were basically, what

do you do with your graphs?

How large are your graph?

You wanted to understand their
graphs and their use cases.

What type of applications
do you build on graphs?

And we wanted to understand what
are your main challenges with

existing technology, basically.

And this is priority me starting with
my research group, any kind of technical

project, uh, in this space and what.

We learned in that user survey, and
this was a user survey of about 90

people online, and then we talked
to maybe 15 people face to face.

What we learned is that a lot of
people had major scalability problems.

So there was this association
that you can't essentially manage

large scale graph databases.

Graphs are hard because graph queries are
hard and existing technology doesn't scare

and it was very visible like scalability
was the top sort of problem people

consistently told us in that survey and
I think it was actually highlight of the

paper was that scalability is an issue.

So my tenure case actually when I took
my position at University of Waterloo was

to study that problem that if we were to
develop a modern graph database engine.

How would we, what would its
architectural principles be?

That would make it scale
and make it perform.

About five, six years of research
and lots of articles that we're on,

several theses in the topic I advised.

After all that work, I had a
clear sort of picture of what

that system would look like.

So some of its core principles.

So the, one of the principles
was that I was convinced that

the system should be columnar.

That.

Sort of the storage, essentially layer
of the system, the physical design of how

we are going to store the records should
be columnar instead of row oriented.

So in a system like Neo4j, for example,
the physical storage of records is.

It's closer to row oriented.

It's even linked list based, it's
an interesting kind of up, very

update friendly, but not very scan
friendly scan friendly design.

Where you have essentially,
let's say, a node record.

And all the records will point
to their first edges and then the

first edge is going to point to its
second edge, so on and so forth.

But there's no kind of concern for
locality of all these records on disk.

Whereas in a columnar design is
optimized for scan performance.

Where you want to put data that
will be scanned essentially

that will be scanned together.

And it allows also for
you to compress the data.

So it's really optimizes for scan
performance and minimizing disk

IOs when you're doing large scans.

So the insight for why I thought the
graph database should be columnar

was that a lot of the queries that
people ask First of all, they scan

large chunks of the database because
graphs, essentially people do a lot

of traversal based queries, okay?

And but they usually are going to
read as they do these traversals,

as they find the patterns or the
path, paths that they're looking for,

they're going to read a few of the
properties of these edges and nodes.

So a columnar design is good for
that because it first separates.

different properties from each other.

So if you're if nodes have say
20 properties, but a query only

accesses one of them, you get an
ability to essentially touch a very

small part of the node records.

So that was, for example, one sort of
storage level architectural decisions.

We had a paper on this
back in, I think, 2000.

19 or 2020 in VLDB it was
a thesis that I advised.

Another thing was that I worked a lot
on the several core query processing

techniques novel processing techniques,
I should say that are specifically

designed to address the scalability
challenges that happens during query

processing on, on, on graph patterns.

So a fundamental kind of property of
graph queries is that, um, graphs.

generally represent data where there's
a lot of many to many connections.

So classic examples are just social
networks or financial transaction

networks, where an account, for
example, might transfer a lot of

money to a lot of other accounts.

So that's a many to many relationship.

And those accounts may be transferring
money to many other accounts.

So if you're looking for
a path, there's this.

By the nature of the query and
the nature of the data, the search

space within which you're looking
for a pattern grows exponentially.

So this exponential growth is essentially
a major problem with scale, like

essentially querying over large graphs.

So just a simple math.

If on average each node has 1, 000
connections, if you do a three hop query,

you're essentially looking for a search
space of 1, 000 Cubed 1, 000 times 1, 001.

So it just grows.

That's just already a billion possible
sort of patterns within which you're

looking for some pattern that you're
searching for so I advise a lot

of sort of work on we did a lot of
work on Addressing this problem.

How could you essentially
mitigate this growth?

Can you do some optimizations?

When while during query processing
instead of essentially enumerating a

size of say one billion in my example,
you can do something much smaller and

look for a pattern in a smaller space.

So there, one sort of very good
technique in the space is factorization.

Factorization is essentially a
compression technique, but it's

not it's very interesting because
compression in databases is almost

always used at the storage level.

When you're storing your records, you
compress them so that you minimize IO.

But factorization is a technique.

It's a compression technique
that's designed to compress the

data, the intermediate data that
query processes generate, not

the ones that you store on disk.

So it will, for example, factorization
can take that, two hop, three

hop queries and then represent
it in a much more succinct way.

So I can get into the details of that,
but we did a lot of work about how do

we essentially increase the scalability.

Of these systems, query processing
scalability, if you will so that some

of these harder queries can be answered
in a much more performant manner.

So all those essentially
techniques that I had and some

of these I can see we can get.

deeper into.

So all of these techniques were results
of many years of research that we did.

And at the end of which we
said, okay, let's put all of

these together in a new system.

And cruiser is that system that tries
to integrate all the state of the

art knowledge we know about how to
manage and query large scale graphs.

Nicolay Gerold: And to maybe nail it
really home to the difference between,

especially like row and column base,
can you maybe go into one example of

how a query would look like if you do
it row based versus doing it column

based and like to really nail the
point home of this, like escalating

complexity that I don't have so many
different options to query through.

Semih Salihoğlu: Yeah, so I
can give a very simple example.

So for example, and this is not going
to be graph specific, but just so

that I can clarify the difference
between a row oriented storage

and a column oriented storage.

So if you take, let's just
focus on relational systems.

If you take a system like Postgres,
Postgres is row oriented has

a row oriented storage design.

What that means is that if you've got a
table of say let's say accounts, Okay, and

there is maybe 20 properties owner, the
address of the owner the balance the date

it was opened, so on and so forth, right?

What, the way this will be stored on
disk is that there will be one record.

Okay, that stores 20 properties of that,
each row essentially next to each other.

Once that is finished, you'll have the
second record and then the third record.

In a column oriented system, what you'll
have is that you'll take the first

property, say maybe the name of the owner,

The name of the first record and the
second record and third record, all of

those will be stored next to each other.

And then in a separate part of the disk,
you'll store, say, maybe the balances of

each of these accounts, one after another.

So the balance of the first account will
be the first integer, the balance of

the second account, so on and so forth.

So you essentially split
each record into 20 pieces.

And then store essentially in 20
different contiguous sectors, each

property or each column of this record.

And if you've got a query that,
for example, looks for something

selective, give me all accounts that
were opened on a particular day.

Now, what you can do is you can
essentially scan only the property.

You don't need to scan each record.

And then essentially on disk, essentially,
you don't need to scan all of the database

and maybe the whole database is 100
gigabytes, but let's say only the date

properties when the accounts were opened
is maybe 500 megabytes only if you just

put all of them together, so you can only
scan the 500 megabytes and then find the

ones that were opened on a particular day.

And maybe there's two records.

And then you can go do 20 different IOs
to consolidate the rest of the properties.

So that's a case where you can,
instead of scanning 100 gigabytes

of data, to find those records,
scan only 500 megabytes of data.

And that will give, essentially that's
essentially One of the main insights of

columnar databases in general and that's
a topic at this point three decades

old and it's span of many systems
modern analytics oriented systems do

column oriented Designs and our insight
was that a lot of the graph queries

are analytics oriented So instead of
optimizing the system for updates,

we optimize the system for scans.

So that's essentially The justification
for why a system like Kuzo is columnar.

Nicolay Gerold: Yeah, and how this
maybe as an example would look like if

you have a social media graph, I think
that this is the dominant example.

You basically have a query.

So basically I have a user let's take me,
Nicolai Gerald, and I want to find like

people who are followers of my followers.

So instead of basically, I first
have to identify like the starting

note, which is me, basically.

So I find it, and then I basically
filter down the, with the column

database like node, which have a
follow connection to Nikolai Al.

And through that I don't have
to do the role retrieval.

So basically I get the role where Nikolai
ALD is in it, and then I basically

have to get all of the different ashes
follow them, and then do basically

do another hop to follow them again.

But I can use the different.

Separate columns to basically filter them
down already and decrease my search space.

Semih Salihoğlu: Um, so pop, this
is, I think, partly accurate, but let

me maybe, uh, separate two things.

So the columnar sort of record storage.

is orthogonal to I think another thing
you're mentioning in your example, which

is the ability to follow essentially
specifically your connections.

So that's a feature that graph databases
have, that relational databases does

not have, which is that when you in a
graph database say, here are my users and

here are other users that they follow.

Here are the follow
essentially relationships.

What graph databases, you,
universally, not just Kuzu, Neo4j,

but take any other graph database.

What they'll all do is that they'll
build a join index automatically.

You don't have to, as a user, tell
the system to build this index.

The system will automatically
build a join index.

So what that means is that they will build
an index so that for a particular user,

say Nikolai, you'll be able to find The
people Nikolai follows, and the people

that follow Nikolai with a single IO.

So they'll, that's essentially an
index structure, um, that is separately

built automatically in graph databases.

And that's, in some sense,
that's orthogonal to the columnar

design that I was talking about.

So I, in my example, I gave a column
there essentially design of node

records, but you can also have a
column design of these relationship

records that are being indexed.

So for example if your follow
relationships had many properties

on each follow relationship, maybe
you put many properties, like maybe

the date, the following started.

Or how many times they have seen each
other each other's posts or something.

Um, now, all of those properties
also need to be, so those, now

those are relationship records.

They also need to be stored somewhere.

And suppose, you can again make a choice
there as an architect of the database.

Do I want to store these properties of
these relationships in a row oriented

manner or in a column oriented manner?

And in your query, for example, your
queries doesn't care about the date

the following started or how many times
essentially these people have interacted.

So in a column oriented design, you
can essentially avoid scanning any

of those and just scan Essentially
the fact that they're connected to

each other, which is usually another
kind of property of the relationship.

And completely avoid scanning any of
the properties of these relationships.

Whereas in a system like row oriented
system that forces you to scan the

entire records, relationship records.

You are forced to scan also all the
properties that come with the relationship

record, this follows record, that you may
not be needing for the query processing.

So columnar versus row oriented is
essentially a choice about how much

of the records do you want to read
when you're answering the query.

And columnar systems gives you the
advantage that you can read as few as

parts of the records as the query needs.

But obviously there's a catch.

The catch is that when you're
producing your final outputs,

you have to do extra work to go
and grab the rest of the records.

That, suppose you're, suppose you want
to find the users that Nikolai follows.

Right?

And maybe when, uh, when they follow,
when and the day the following started.

So like you are trying to
return some of these properties.

So columnar systems, maybe to find
the final set of records does scan

little, but at the end they need to
go through an extra step to make sure

that if the final query output requires
other properties of the records, that

they're not maybe needed for the query
processing or to find the records.

But that are needed because the
user is returning them, right?

Now you need to do extra
work to put things together.

Whereas in a row oriented system,
because you've scanned all of the

entire record during query processing
anyways, You can just directly output

them, but overall, I think it's very
well understood that if you've got a

system, you're building a system that
is meant to be analytics oriented.

That's meant to optimize for analytics
oriented applications, meaning

applications that ask a lot of queries,
but maybe they do batch updates

instead of a ton of small updates.

They're not OLTP type of applications.

Columnar systems are, they've got
a, they've got a big advantage.

Nicolay Gerold: And in this
traversal, how do CSR, so complex

sparse rows, come into play for the
traversal or for the adjacencies?

Semih Salihoğlu: Yeah.

So CSR is a specific physical design
for the join index that I mentioned.

So there are different ways
you can store the join index.

The join index is basically allows you
to Um, locate all the, um, relationships

of a particular node very very quickly.

You should, basically, you should
think of it as a, with a single IO,

with a single lookup, you can find
any of your nodes relationships

or at least start scanning them.

So there's, again in the storage of that
index you There's different designs.

You could obviously build a tree based
classic index B plus trees and try to

Have a joint index that will also give
you quick access to sets of records CSR

is a particular Physical design that's
very popular for graphs in general.

In memory, graph analytics libraries
uses use CSR based in the seas.

A lot.

Kuzu builds a disk based CSR
index, and CSR is basically you

can think of CSR as It's, it stands
for columnar columnar sparse row.

Sorry, actually, we can edit this, maybe.

CSR should be standing
for let's look it up.

Columnar,

let me find this.

Sparse row, I think,
but I want to make sure.

Columnar, compressed sparse row.

Sorry, not

columnar, compressed sparse row.

But it's essentially a columnar.

The reason I always Get confused
with this is we can edit this part.

So CSR stands for compressed sparse
row based storage and what but a good

way to understand what that is so if
you have essentially say From two.

Okay.

So that's the relationship, the essence
of the relationship that you're storing.

Okay.

You got from nodes and you got two nodes.

And let's assume that these from and to
are integer IDs that identify the nodes.

So one to two means node one.

follows node 2.

2 to 5 means node 2 follows node 5.

And you got these records 1, 2, 2, 5, etc.

So that's your essentially
essence of the relationship.

Again, the relationships could have
other properties, but let's ignore that.

The join index does not
usually store those properties.

So if you've got A from column and a two
column, and they are IDs of these nodes.

And if you sort it, say the from column
sorted these records by the, from column,

you're gonna get something like 0 1, 0 5,
0 7, 1, 2, 1 5, 102, so on and so forth.

So the from column will be sorted.

And now if you compress,
now that the from column is.

sorted, you can do run length
encoding compression to it.

So run length encoding
compression basically says,

if essentially there is Zero.

The from column values
are like 0 1, 0 5, 0 7.

That means there's three zeros,
so you just store zero and three

instead of storing zero three times,
and maybe if there's five ones

next, then you store one and five.

So if you did that sort of storage, that
is actually the essence of CSR storage.

CSR uh, storage is very
similar to essentially storing

all the tools consecutively.

And then, from the from column, you store
just essentially how many each node, how

many edges each node has, and you don't
necessarily have to store anything else.

So you essentially end up
compressing the from column a lot.

And if you're storing the backward
edges, Now you would have a to and

from relationship, and then you
would sort by to and compress the to.

So that's the essence
of compress sparse row.

It's essentially, that's why
it's a columnar storage as

well, and Kuzu stores on disk.

It's a joint index in this
compressed format in the CSR format.

And that contrasts with, say, Neo4j.

Neo4j stores these records, links these
records with each other and doesn't

have the guarantee, for example,
that the records of a particular The

relationships of a particular node,
the follows off, say, Nikolai are

not necessarily next to each other,
but they're linked to each other.

So you can with one.

I always go to the next index,
but obviously there's no locality

guarantee as the database grows.

Kuzu instead always guarantees.

that will always store the
edges, the relationships that

index in this CSR format.

Nicolay Gerold: Interesting.

So we've talked a lot about the
storage or memory layout of Kuzu.

What we've missed until now is the compute
part and especially the vectorization and

how it makes the queries more efficient.

Can you go a little bit into the
vectorization part and explain it,

how it works and how it contrasts
to regular graph databases?

Semih Salihoğlu: Yeah, so this is also,
again, vectorization is not a very

graph database specific technique.

It's a technique that originated
in columnar relational systems.

In essentially late nineties or in
1990s and 2000 it's very widely used

now for again, analytics oriented
systems in columnar systems.

This is a widely used
widely used technique.

And what it contrasts
with is more traditional.

uh, Database query processing
architecture, which is called topple

at a time processing in basically all
databases, database management systems,

relational systems, graph databases,
or even RTF systems, any system that

implements a high level query language.

Ultimately, what happens in those
systems is that those high level query.

Language clauses are compiled down
into a expression tree, a relational

algebra expression tree whose roots
are scans of tuples and intermediate

nodes are some relational operations
like maybe joins, maybe projections,

maybe aggregations, so on and so forth.

Imagine essentially a very simple.

Expression tree.

Let's look at a very simple query.

So maybe select, count me
all the nodes that have name

Nikolai, whose name are Nikolai.

So a very simple expression tree
for this, that there's going to be a

scan node that scans the your nodes.

So maybe say user nodes.

And then.

A filter, this scan could also
be implementing the filters,

but let me just simplify it.

But let's assume there's
another operator called filter.

What it does is that it just checks if the
tuples that it receives have name Nikolai,

so it's going to run that expression.

And if it does, it's going to give it
to the next operator, which is count.

So there's three operators, a count
operator, an aggregation operator, a

filter operator, and a scan operator.

In a traditional tuple at a time
processing, what happens is that

the processing starts from the count
and count says to its child, which

is filter, give me your next tuple.

And then the filter calls its
next child, which is scan,

says, give me your next tuple.

And then they pass essentially
one tuple between each other

until you exhaust the scan.

In vectorized processing what happens
is that, the disadvantage of this very

old architecture now and not a lot of
systems I think built are built this way.

Certainly not a lot of analytics
oriented systems are built this way,

but it's got the advantage that it's
a, it's very simple and but it's got

the disadvantage that there's a lot
of essentially function calls that

happens between these operators.

They keep.

Calling getNextTuple for if you've
got a billion tuples you're scanning,

you're going to call essentially,
maybe like in my example, three billion

essentially func getGetNextTuple calls.

Vectorization, this technique called
vectorization, I think a better name

for it is block based processing.

That appears in some in, in
some literature, it appears

as block based processing.

What block based processing does
is that it passes instead a block.

of tuples between operators.

Each time you say get next tuples,
you essentially get a block of tuples,

like 1024, let's say, or 2048 tuples.

So it amortizes this expression,
the cost of calling these get

next tuples function calls.

So what happens is that now you need to
rewrite your operators, though, because

your operators can no longer assume
that they're getting a single tuple.

They all need to be rewritten.

Every single primitive operator in
the system needs to be rewritten in

tight for loops, which is actually very
good for instruction locality as well.

So it's well understood now that this
type of vectorized processing is good.

If you got If you want to essentially
build a query processor that is designed

to process lots of data because it's
so analytics oriented systems make this

assumption that their query processor
are going to essentially compute on

a lot of data passing through these
operators, vectorization is essentially

a common wisdom architectural principle
that minimizes these calls, which

compares also again systems like Okay.

Like Neo4j to the last that I
looked at it, it was this is

this topple at a time processor.

So those are essentially all architecture
principles have this performance

disadvantages if you're building
in an analytics oriented system.

Nicolay Gerold: Yeah, and for me, Yeah,

Semih Salihoğlu: and did that, but I want
to be clear that's not really a graph.

database specific optimization.

It's essentially an optimization
that any system that is being built

to be optimal or optimized for,
uh, read optimized query processor.

If you want to build a query
optimized Highly optimized query

processor that is designed to do a
lot of computation on a lot of data.

Then vectorization is a good idea.

It doesn't have to be
graphs, relational or RDF.

It's just a good idea to optimize your
processor to work on a lot of data.

Nicolay Gerold: Yeah, and it's in the
end taking advantage of modern hardware.

The, especially, we see it in
a lot of different databases.

Because we have more powerful computers,
we don't need So many distributed

systems anymore, so we can go back
onto single node and take advantage

of something like vectorization.

Semih Salihoğlu: Yeah, exactly.

Yeah, I completely agree.

A lot of these common wisdom
techniques, if you put them together

essentially pushes the state of the
art, certainly in graph databases.

I think a lot of the graph databases
that we studied were not built with these

common wisdom architecture principles.

And some of these are again, not
even graph database specific.

They're just general, broad
data management principles.

And they were built on these older
architectural principles, so that's one

of the gaps that we saw was that there's
a need for a state of the art system that

sort of put some of these common wisdom
architectural principles, along with

some of the graph specific ones that my
research group did together, and Kuzu

aims to be such a state of the art system.

Nicolay Gerold: Yeah, and I think
that's a cool part in data and AI.

You basically put together a lot
of optimizations to arrive at the

state of the art and don't have one
single core idea, but you pick and

choose from different areas from
like systems engineering, networking.

More like the hardcore, like data
engineering side, put it together

to build like a system that's better
than what we have seen so far.

Semih Salihoğlu: I agree.

I think it's in the nature of a lot of
system oriented projects and products that

the whole is much bigger than the parts.

There's a lot of parts that you
have to do, but, a lot of them

don't require a lot of innovation.

You just need to follow, I think, the
modern state of the art principles,

and this happens a lot in databases.

Every decade or so, the field moves and
we understand better, uh, how to build a

sort of modern fast engines and then new,
the newer systems are able to adopt them

and the older systems are just much harder
to adopt those those those principles.

And it's I think in the nature
of, in the nature of the field.

Nicolay Gerold: Yeah.

And we touched on a lot of
the technical aspects of Kuzu.

One thing I want to go into as well as
the ASP join and not like the complete

technical details, but rather the core
ideas behind it and what it replaces

and how it improves on the existing join
algorithms we have in graph data, anyways.

Semih Salihoğlu: Yeah.

Good.

So I think to explain, so ASP is the
term we used in the Kuzu paper in the

CIDR Kuzu paper, then that was published
before we even became a company and

spinoff as a, to commercialize Kuzu
in the actual implementation we use,

I don't know what the term that we use
for that join, join operator, but yeah.

We essentially, it's a let's call it
ASP, I think we call it multi way join

or something, but I can get into the
principles of what we were trying to do,

but to explain that I need to explain
factorization a bit better, so I gave

a high level overview of factorization
is that it compresses intermediate

relations, uh, the intermediate sort
of topples that are produced during

graph sort of pattern querying.

So the example that I gave was like a
few hop query about, maybe the accounts

that Nikolai's account transferred
to and maybe two hop accounts.

So let's maybe take
another simpler example.

Suppose the query is,
give me all the two path.

Transfers that Nikolai could be
facilitating, meaning that Nikolai's

account, you've got an account,
one account node let's call it

account and account Nikolai.

And that node has some incoming transfers
and it makes some outgoing transfers and

somebody is, let's say a fraud analyst is
interested in to see what sort of money

flows could Nikolai be facilitating, that
it's the middle person in these accounts.

Okay, so if you got essentially a
thousand incoming transfers and

a thousand outgoing transfers, so
there's essentially the two paths

that Nikolai's account is part of is.

1, 000 times 1, 000.

There's a million possible different
combinations of two path money flows

that Nikolai could be facilitating.

Okay, so which means that there's
actually, if I ask you that query in

Cypher, match Maybe a node A that has
transferred money to node B that has

transferred money to node C, where
the middle node B is owned by Nikolai.

The output of that query will
contain a million tuples.

Okay, just, just even though there's only
sort of 1, 000 incoming transfers and 1,

000 outgoing transfers, the sort of the
fact that this many to many joins or the

many to many sort of pattern, Um, growth
leads to a very large number of patterns.

In factorization, a different
way you could have stored that.

So in a, a, take any relational system
and run a, run an equivalent SQL query.

That system is going to, in its
query processor, generates really

a million intermediate tuples
and materialize it in memory.

It's it's gonna flow through memory,
but it's gonna materialize it.

But in a coolant way, you
could have represented.

That 1, 000, 000 tuples is you
could have represented as Cartesian

product of much smaller pieces.

You could have said my intermediate, that
intermediate relation is 1, 000 incoming

sort of relationship incoming transfers.

Cartesian product that with Nikolai's
account and Cartesian product that those

two with a 1000 outgoing transfers.

Now you got essentially a much more
compressed representation that contains

only roughly sort of 2000 records.

It does represent a million records.

But in memory, you've only
represented it as 2, 000 records.

And suppose all you want to do is
count the number of you could be

facilitating, number of two, two
path flows you could be facilitating.

Now, I don't even have
to ever uncompress it.

I can just very quickly count
the number of tuples there.

It's 1, 000 times 1, 000.

I can say the correct
answer is 1, 000, 000.

So factorization is this idea.

It's that idea that when there
are many to many joins, you can

represent the intermediate results
relations that are generated in a

much more compressed way by using
Cartesian products of sets of records.

So that, that is the
idea of factorization.

This ASP join that we were trying to
do, uh, it's highly, very detailed.

It's gonna be difficult
to explain each component.

But it was essentially trying to Get
good factorization structures, which

mean good Cartesian product structures.

And it was also trying to make sure that
we One thing we wanted to avoid in Kuzu

was, and a lot of graph database systems
do this, is that we wanted to avoid

sequential scans as much as possible.

So what do I mean by this?

So if you got essentially this two
hop query, You know, Nikolai's, um,

so let's maybe take another query.

Nikolai's sort of two path transfer
where Nikolai is the originator.

um, A common query plan for this query.

Now I'm making the a node Nikolai's
account is you go to Nikolai's accounts,

and maybe there's a thousand of those.

And then for each one of those, you
go to a thousand other accounts.

And if you need to scan some properties
of these accounts, a common sort of it's

a very bad query plan, is that you will,
you essentially get non sequentiality

when you go to Nikolai's accounts.

You're going to a thousand
different records, but there is no

guarantee that they will be located.

Consecutively on disk.

And when you go to their accounts,
you lose even more locality.

So a very bad way to process this is
that you do you go to Nikolai's account,

and then from that do another random I.

O.

to go to 1, 000 other accounts.

And then you go to Nikolai's second
account, do another 1, 000 random I.

Os and so on and so forth.

That's a very bad way to process this.

And our observation was that relational
systems never do processing like this.

What they do is they use hash joins,
which means they'll take essentially maybe

the accounts and the transfer tables.

And then they'll always scan
sequentially, and then any randomness

will be, will happen in that
intermediate hash table lookups.

So what we wanted to do was to also
ensure that we behave like a relational

systems hash joins in the sense
that we always scan sequentially.

When we do these sort of graph pattern
evaluations so ASP joins components and

I won't be able to explain the full ASP
join because that's a little it's going

to get, I'd need to get very low level
and I need maybe a board to explain it.

But.

It was essentially the problem we were
trying to solve is that we wanted to

get good factorization structures.

Plus, we wanted to ensure that
we always scan sequentially.

And what our realization was that
a lot of the times these conflict.

A lot of the times you can't get
good factorization structures

and get sequential scans.

So what we had to do was
apply this technique called

sideways information passing.

Where sometimes we store some intermediate
data in memory, figure out what needs to

be scanned, pass that down to the scans.

As information and the scans are
going to make use of this information

about what are the successful join.

So which, which records do I need to scan?

So it's very hard to, as I'm explaining,
as I'm realizing it's very hard to

explain something like ASP, ASP join,
but there's essentially this core

idea of trying to get factorization,
good factorization structures.

Trying to ensure that scans
are, um, scans are sequential.

And, sorry, the third piece that
we were trying to get is that graph

databases has the advantage that they
have the disadvantage of being non

sequential scans, but they always
scan only the parts that they need.

So they don't, so for example, in
that example that I gave of two hop

joins, even though you're doing non
sequential scans, you're never scanning

any record that you never need to.

So you're doing them randomly maybe,
you're going to your, degree first

neighbors, but you're always scanning
something useful, and that's something

relational databases don't have.

Relational databases, when they join
things, they'll scan the entire tuples.

Even though you need maybe only Nikolai's,
1, 000 records, transfer records, if your

transfer table contains a billion records,
they'll still scan all billion records.

So, that was the third thing
we were trying to get, and I

should have clarified that.

And if you want to get that as
well, and make scan sequential.

What you have to do is that you build
these hash join type of plans where you

get essentially some maybe your Nikolai's
record and you understand that you need

to only scan Nikolai's transfers and
you're going to pass that Nikolai tuples

information sideways to the transfer table
scan operator and tell the transfer tables

scan operator only scan give me nodes.

That have this Nikolai's accounts ID,
maybe account seven only scan me the

sort of transfers where the from account
is seven, so you can avoid scanning all

1000 records or all 1 billion records.

So you can only scan that 1000 records.

So we were essentially so maybe to
summarize it, we were essentially

trying to behave like relational people.

systems hash joins, but have the
ability to a do factorization and

be only scan the parts of these
relationship records and node records

that really needed to be scanned, which
is an advantage of graph databases.

But the traditional processing
in graph databases was

something we didn't like much.

Because of the non sequential
scans that it leads to.

ASP join was our solution, and
it's, it is one of the core join

algorithms that is implemented in Kuzu.

Nicolay Gerold: Yeah.

And I think it's one of the
downsides of columnar databases is

writes and especially also updates.

So usually in, in something like Parquet,
which is also columnar, you delete the

file and you basically create a new one.

And.

How long in Kuzu does actually the
right path take until it's also indexed?

And also a little bit on that is, should
you optimize for actually having more

of a read heavy workload with Kuzu and
for example, like batching rights and

minimizing concurrent rights with it?

Semih Salihoğlu: It's a good
it's a good observation.

I think it's like a fundamental sort
of design decision that you make when

you're starting to build your system.

Am I, do I want a system that's,
Optimized to take in small writes

very fast, build an OLTP like engine.

Or do I want a system that is
optimized for analytic queries,

which means read heavy queries.

And we consciously made the latter.

We are a read oriented system.

So you are right that columnar systems,
because of certain optimizations that

they do, simply just the fact, for
example, that You separate each record

into, in my example, there were 20
properties into 20 different kind of

column, like different parts of the disc
means that if I insert a single tuple,

I need to write to 20 places, right?

If I had a row oriented system, I
could write that record consecutively

to a single place, which means you
only have to do one IO, for example.

There is this fundamental sort
of trade off that you make.

So there is no, you can't get both of
the can't get both of these, I think.

You have to make a conscious decision.

But there is so it is fair to say that
columnar systems, and Kuzu is one of

them we are more performant if you batch
your writes and do your writes in batches

instead of doing a lot of sort of maybe,
I don't know, 10, 000 sort of simple

transactions, each one kind of adding
one relationship or one node record.

That's not a particularly strong not
a particularly a case where you're

going to perform very strongly.

But there is certain optimizations
that we do to minimize our to reduce

essentially the time that it takes to
commit transactions, write transactions.

This is particularly a
problem with the join index.

When you do CSR based join index, right?

The original design of CSR
is that everything is packed.

There is no, so if you, if I want
to say, so suppose Nikolai's account

is 07 and 07 has now a new transfer
that will go into the join index, but

the way I described the CSR index.

Was there's no gaps for me
to add a new edge in that two

column, the second column, right?

So for example, what we do in Kuzu
is that we actually, that second

column that stores the actual
neighbor node IDs, if you will,

actually contains gaps between each

sets of so after node zero sets
of edges, we store some gaps

after node one sets of neighbors.

We store some gaps so that
we essentially amortize.

We make it more update friendly so
that if you get a single insertion,

Nikolai gets another transfer.

We can essentially, we, there is usually
going to be a gap there so we can directly

write it at the end instead of resorting
and reshuffling the entire column.

So we do several optimizations like this.

So if you're familiar, there's a popular
data structure called packed memory array.

So packed memory array is
essentially a way to sort.

elements by leaving gaps.

And that kind of compares to
three based indices that also sort

elements or self sort records.

So our columnar design for the
join index, the CSR based index is

essentially based on packed memory array.

So we amortize we try, we make
that columnar storage more

update friendly, for example.

Nicolay Gerold: Yeah, to maybe close
it out the, I think graph databases.

are gaining traction at the moment,
but it's still very hard to adopt.

What would you say is missing to make
it actually easier for developers,

especially to adopt graph databases?

So what's like a missing link

Semih Salihoğlu: it's a
hard question to answer.

So part of the reason graph databases are

going to remain a niche sort of industry
is because relational systems, the

amount of education that exists on
relational systems is incomparably

larger, or there's a lot more
education on relational systems.

than other classes of
database management systems.

So every computer science curriculum
in the world teaches a database

course and in that database
course you're going to learn SQL.

So there's a lot more familiarity
with SQL, and I don't think

you can compete with that.

I think graph, you know, what,
so there, there are essentially

directions, I think, that could.

Increase the popularity of graphs.

There's a couple of waves to that could
increase the popularity of graphs.

Of them is if you try to build systems
that can essentially and property

graph databases, the one that Kuzu is
based on, which essentially follows

Neo4j and the class of systems based on
Neo4j property graph database systems

has the advantage that the data model.

Even though it's graph oriented, even
though it's object oriented, it's actually

very close to the relational model.

So basically you have node records.

You have basically have tables of
records, but each table, you just indicate

if it's an entity or a relationship.

So that's like the primary
difference between the property

graph model and the relational model.

Especially Kuzu's version of property
graph model, where we force you to

predefine your schema, is that you
have tables of records, but you

just need to tell us, okay, this
account table, that's actually an

object, that's actually a node.

Whereas this transfer records,
that's actually a relationship.

So that's the difference.

You give this additional
information, and based on that

information, we optimize this.

But to come back to my point, graph
databases, the property graph database

has the advantage that because they're
so close to the relational model,

It's actually very easy to understand.

And if we build essentially features on
these systems that make it actually easier

to query directly off of relational.

records.

So suppose you got, say, CSV
files and you got parquet files.

A lot of people dump their
records in this tabular format.

These formats are essentially
sets of records, right?

So sets of records identify certain
information in your data model, so

they could be identifying your accounts
and the information of your accounts.

If we built features around these
systems, That even close the gap

between where, okay, these sets
of records, I want to model that

are in CSV files or parquet files.

I want to model them as, uh, notes.

And then these other sets of
records, they're my edges.

So you're not even closer to
the relational model because

you're not even, you're just
providing some kind of mapping.

You're not even doing sort of, you're
not even doing a big step, mental

step of representing those parquet
files in your, graph database system.

It's more like you're doing a mapping
saying, okay you're saying that the

graph database systems are able to
query tables, tabular formats as well.

So I think that could essentially,
if you had features that allowed you

to do this say like a system like
DuckDB, DuckDB can scan and query

over row parquet and row CSV files.

And you can build similar features
in graph databases as well.

That's something we're doing
prototyping on in Kuzu.

How do we built features that
allow you to query these?

Raw tabular records also
in the form of a graph.

I think that could also allow
people that could also simplify.

The more you simplify essentially
querying records as graph the sort

of, the more you can contribute to
popularizing graph databases, I think.

Nicolay Gerold: Yeah, and what's next?

So what's on the horizon for Kuzu?

What are you building on?

What can you already teaser?

Semih Salihoğlu: Great, so we've got
a several sort of major fee features

coming up in the first quarter of 2025.

Of them is we're, that's going to
come very soon is the, we're going

to have a Wasm version of Kuzu.

So you'll be able to
run Kuzu in the browser.

We already have a version of this.

We are essentially testing it.

It's very close to being released.

So there were a few people who asked
for this, something that you can run

in the, in, in your browser, a database
that runs in your browser, the graph

database that runs in your browser.

So that's coming up soon.

That was one of the environments
where you couldn't run Kuzu, but

hopefully we'll close that gap.

Uh, and then we had.

Not built a lot of indices
around the system, and we're

building two indices right now.

One is a vector index so you'll be able
to do nearest neighbors queries over of

floating points that you store in Kuzu.

You can already do that, but we
don't have a fast way of querying

or finding nearest neighbors.

And a lot of RAG and LLM based
applications that want to

experiment with techniques.

Retrieval techniques
require that that feature.

So we're building that index and we're
building a full tech search index.

So there's several core retrieval
functionalities that that Kuzu is getting.

And finally, the next thing that's
immediately coming up is we were

lacking a graph data science library,
like a package of off the shelf

graph algorithms you could run.

Say you got a graph and you want
to say, I want to run page rank or

I want to run connected components
or I want to run between the

centrality to run these batch graph
algorithms, you'd have to so far move.

We had integrations with some
graph libraries like network X.

You could take your graph.

And in Python not in maybe other APIs
we have, but in Python, you could be

the single function called get convert
the query results into, say, a network

X object and then in network X run those
graph algorithms, but that has scalability

problems as well as performance problems.

What we are providing is now off the
shelf procedures that you can run

directly in Kuzu's Cypher dialect.

And run those algorithms and
compute those graph algorithms.

So those are the four things
that are immediately coming up.

Nicolay Gerold: Nice.

And if people want to start building
with Kuzu or want to learn more

about it, where would you point them?

Semih Salihoğlu: I think, come
to our documentation website, and

there's a getting started part.

We have a browser based command line
interface with a graph visualization.

That's a nice tool to use
to get started with Kuzu.

It's called Kuzu Explorer.

So I'd advise getting started,
on in our documentation page

and using Kuzu Explorer,

Nicolay Gerold: Nice.

And if people want to follow you,
where can they do that as well?

Semih Salihoğlu: We have a
LinkedIn and Twitter Kuzu page.

I think that's the best place
to fall off Kuzu related things.

I'm not particularly super
active on social media.

And we have an AI engineer, Prashant Rao.

Prashant is a lot more active than I am.

So Prashant usually breaks
all the news related to Kuzu.

And please sign up for our newsletters.

There's a newsletter on our website.

If you come to kuzuDB.

com you can sign up for our
newsletter as well and follow

the developments around Kuzu.

Nicolay Gerold: So, what can we take away
when we are building our applications?

So, first of all, I think we
went way deeper into one specific

technology than I usually do
and really focus on one library.

But I think it can teach you
a lot when you're developing

libraries or new tools as well.

In a lot of the stuff we are using
nowadays is a little bit older and

doesn't apply a lot of the techniques
and optimization we can take advantage

from through the modern hardware.

And we see, especially with Kuzu,
but also with DuckDB and Polars

that we can do so much on a single
node or embedded on users devices

because the hardware is so powerful.

We just need to optimize the stuff.

Actually take advantage of that.

And here, like Kuzu does a lot, like
vectorization and stuff like that, where

you suddenly can run stuff where you
actually needed a really complex database

before, but now you can do it in memory.

And through that, you can
apply it to a lot of stuff.

So for example, like if you're
using, um, knowledge graphs with

your little lamps, you can use.

Kuzu as a backbone, we had in the
interview a few weeks ago with Robert Cork

from Ask News, they are basically building
knowledge graphs for the user specific

query on the fly and then query it.

And I think this is something
you can do here as well.

So in your application, you likely
are always querying on sub graphs.

So you basically construct a
subgraph, load it into Kuzu,

query it, and then you go on.

And for that, you basically, it's
way more efficient than having a

stand alone graph database, because
you don't deal with distributed

databases from the get go, where you
actually have to do really complex

joins on UUIDs and stuff like that.

What I think is really interesting is,
the graph data science part, is our

graph algorithm algorithms at large.

They are interesting for some
applications, but they are,

in my opinion, rarely used.

So it's often like recommendation
systems, really large scale recommendation

systems, fraud detection, or
influence analysis on social networks.

But it's not really that For like most of
us probably, the advanced query algorithms

aren't worth it, or they are so deep into
optimization that we will likely never

get there because we can't get a bigger
bang for our buck with simpler techniques.

The, for rack, um, Kuzu's.

Is suitable in my opinion, I think
in general for rack, before you start

with something graph based first,
make sure that you actually optimize

the rest of your retrieval as well.

Um, because I think like you
have bigger bang for your buck.

In vector and keyword search,
then with putting a graph on top

and then also like adding smart
metadata to the different documents.

So before you add the graph,
then basically optimize these two

components and then add the graph,
but have a clear reason to do so.

Why you actually can't do the
same with simple metadata tags.

And yeah,

I think Kuzu gives us a lot of
different directions of how we

probably will build in the future.

So first it's embedded by design.

It's a library and not a server.

So it's an graph database
that runs locally.

So you can integrate it into scripts,
mobile apps, serverless environment.

I think we will see more and
more of that because our devices

become more and more powerful.

So on the server, I can run more.

On a single note.

And also I can push more
onto end user devices.

A second one is use columnar
storage, and this allows you to

actually have a way better scan
performance and also improves the IO

efficiency and also compressibility.

So basically you can
compress the data better.

And then you have like the
vectorized query processing, um,

which is also like with two days.

CPUs, you can really take
advantage of the CPU cache more

by doing vectorized queries.

And lastly, I think like what they have
done with ASP join and vectorization,

it pays to have people who really can
optimize the shit out of an algorithm.

Um, whether it's actually
necessary for your own use cases.

That's something you have to know.

I think like for libraries, it's always
something interesting to build in,

but only if it actually makes sense
for you as well, but yeah, that's it.

Let me know what you think.

Uh, we went really deep into Kuzu.

I think the architecture decisions
behind Kuzu were really interesting.

And also the different use cases you
mentioned, like what they are working

on, but they are thinking about
and how Kuzu should be used, let me

know in the comments what you think.

Otherwise I will catch you next week.

See ya.