Nicolay Gerold: Graph
databases struggle with scale.
Load a billion node graph, run a complex
query and watch performance tank.
And most systems crumble on
the large data sets or really
complex multi hop traversals.
Kuzu is an embedded graph database.
Which changes, changes this and it brings
modern database techniques to graphs and
it can handle a lot of data with speed.
So you can build fast analytics,
embedded applications in Android
apps, in serverless functions,
or also with VESM in the web.
You can do quick experiments.
Build temporary graphs in memory,
process the data on the fly, run
analysis without persistence.
And it works by taking a lot, what we've
learned over the years in databases
and applying it to graph databases.
So similar to DuckDB,
it uses a smart storage.
So it's column first instead of row first.
And through that, it can.
Do a lot of smart preprocessing
to filter down data.
So it actually limits the
amount of data being scanned.
It uses really fast joints
and modern processing.
And today we're going to be
really diving deep into how graph
databases typically worked and
how Kuzu improves on top of that.
And when you want to go really deep
into knowledge graphs, I think you
can take away a lot from this episode.
Let's do it.
Semih SalihoÄlu: Kuzu, as is
an embedded graph database.
So basically what that means in
modern jargon is that we implement
Cypher and we are a library.
You can put us into your scripts, you can
put us into your Android applications.
You can put it into a W.
S.
Landa.
And you can put it on many
environments as a library instead
of setting up a server somewhere
and then connecting to the server.
So some of the use cases that we see are
related to the fact that we are embedded.
Some of the use cases that we see
are because we have an in memory
mode or a temporary, ephemeral graph
mode, meaning that we don't persist
your databases, you just build it
on the fly and then throw it away.
And then some of the use cases we see
are people building very large databases.
And they have frustrations with
other existing graph databases.
And they come to Kuzu because Kuzu is a
state of the art engine in this space.
So we basically essentially see if I
were to categorize the use cases based
on the sort of the features that we
have or the properties of Kuzu, I'd say
they would bundle in the embeddedness.
The fact that we allow people to build
ephemeral graphs and the fact that we
scale very well on, on large databases.
Originally Kuzuz started as a research
project because we were going after
scalability and performance and some
of the usability decisions that we
did, such as being embeddable and be
having an in memory mode has also.
been advantageous in terms
of getting some use cases.
So let me give you an example from,
for example, this one use case
that uses us as an embedded system
and also uses this in memory mode.
This is something we wrote about.
This is a user called Bauplan,
and what Bauplan does is that
they basically give you a So it's
a function as a service company.
What that means is that they allow
you to build complex applications.
serverless applications that
consists of many serverless
functions working together.
And what they're trying to do is
give you a good development interface
where you write your programs.
Maybe this is a data processing pipeline
that you're writing that consists
of, say, 20 different AWS serverless
Lambda functions that work together.
And you write it in Python.
As if you're writing essentially different
functions in Python, putting it together
and then they will compile it to the
serverless functions and run it for you.
And as you, as they do this
they will actually optimize and
also do some debugging whether
your program is correct or not.
So what happens is that when somebody
clicks on, say, writes a program
in BioPlan's interfaces, Python
interface with maybe like 20 Python
functions, each that will cook.
compile to serverless functions,
and they click on compile and run.
What bar plan does is on the fly generates
this graph off the computation that will
happen, and it will do some optimizations
to minimize the data as there will be data
that will be transferred between these
functions, which is going to save the
user a cost AWS communication cost and
serverless function running time because
and also will do many sort of checks.
About the correctness of the
program, and they do this essentially
within a second or half a second.
They run hundreds of queries
inclusive, create a small graph
and hundreds of crazy inclusive.
So they needed.
This is a case where having an
embedded So, you graph database where
you are creating on the fly graphs
and throwing them away is useful.
So this was something obviously
surprising because that's not
how we started Kuzu at all.
Kuzu went for the billion scale graphs,
but this was a case where essentially
we benefited from some usability
decisions that we made that are really
orthogonal to the scalability and
performance decisions that that we made.
Nicolay Gerold: Yeah, and I think
that already leads us a little bit
into the more technical aspects and.
You already mentioned that Kuzu
is also one set of the use cases
is because people are unhappy
with the existing graph databases.
I want to move a little bit into
what are the different features that
actually makes the performance of
Kuzu better than its competitors?
Semih SalihoÄlu: Yeah.
So I can tell you a little
bit about kind of the origins
of kuzu and how kuzu started.
So kuzu is it's a spinoff from a research
project that I had, um, that started about
about four years ago, but prior to that,
about five, six years ago before that.
So maybe at this point, maybe eight
years, eight, nine years ago I have
been, I have entered this space as a
researcher, this graph database space.
And when I entered the space, the first
thing I did was I did a user survey.
Um, so this is published
in VLDB and VLDB Journal.
Um, and the question we asked
people were basically, what
do you do with your graphs?
How large are your graph?
You wanted to understand their
graphs and their use cases.
What type of applications
do you build on graphs?
And we wanted to understand what
are your main challenges with
existing technology, basically.
And this is priority me starting with
my research group, any kind of technical
project, uh, in this space and what.
We learned in that user survey, and
this was a user survey of about 90
people online, and then we talked
to maybe 15 people face to face.
What we learned is that a lot of
people had major scalability problems.
So there was this association
that you can't essentially manage
large scale graph databases.
Graphs are hard because graph queries are
hard and existing technology doesn't scare
and it was very visible like scalability
was the top sort of problem people
consistently told us in that survey and
I think it was actually highlight of the
paper was that scalability is an issue.
So my tenure case actually when I took
my position at University of Waterloo was
to study that problem that if we were to
develop a modern graph database engine.
How would we, what would its
architectural principles be?
That would make it scale
and make it perform.
About five, six years of research
and lots of articles that we're on,
several theses in the topic I advised.
After all that work, I had a
clear sort of picture of what
that system would look like.
So some of its core principles.
So the, one of the principles
was that I was convinced that
the system should be columnar.
That.
Sort of the storage, essentially layer
of the system, the physical design of how
we are going to store the records should
be columnar instead of row oriented.
So in a system like Neo4j, for example,
the physical storage of records is.
It's closer to row oriented.
It's even linked list based, it's
an interesting kind of up, very
update friendly, but not very scan
friendly scan friendly design.
Where you have essentially,
let's say, a node record.
And all the records will point
to their first edges and then the
first edge is going to point to its
second edge, so on and so forth.
But there's no kind of concern for
locality of all these records on disk.
Whereas in a columnar design is
optimized for scan performance.
Where you want to put data that
will be scanned essentially
that will be scanned together.
And it allows also for
you to compress the data.
So it's really optimizes for scan
performance and minimizing disk
IOs when you're doing large scans.
So the insight for why I thought the
graph database should be columnar
was that a lot of the queries that
people ask First of all, they scan
large chunks of the database because
graphs, essentially people do a lot
of traversal based queries, okay?
And but they usually are going to
read as they do these traversals,
as they find the patterns or the
path, paths that they're looking for,
they're going to read a few of the
properties of these edges and nodes.
So a columnar design is good for
that because it first separates.
different properties from each other.
So if you're if nodes have say
20 properties, but a query only
accesses one of them, you get an
ability to essentially touch a very
small part of the node records.
So that was, for example, one sort of
storage level architectural decisions.
We had a paper on this
back in, I think, 2000.
19 or 2020 in VLDB it was
a thesis that I advised.
Another thing was that I worked a lot
on the several core query processing
techniques novel processing techniques,
I should say that are specifically
designed to address the scalability
challenges that happens during query
processing on, on, on graph patterns.
So a fundamental kind of property of
graph queries is that, um, graphs.
generally represent data where there's
a lot of many to many connections.
So classic examples are just social
networks or financial transaction
networks, where an account, for
example, might transfer a lot of
money to a lot of other accounts.
So that's a many to many relationship.
And those accounts may be transferring
money to many other accounts.
So if you're looking for
a path, there's this.
By the nature of the query and
the nature of the data, the search
space within which you're looking
for a pattern grows exponentially.
So this exponential growth is essentially
a major problem with scale, like
essentially querying over large graphs.
So just a simple math.
If on average each node has 1, 000
connections, if you do a three hop query,
you're essentially looking for a search
space of 1, 000 Cubed 1, 000 times 1, 001.
So it just grows.
That's just already a billion possible
sort of patterns within which you're
looking for some pattern that you're
searching for so I advise a lot
of sort of work on we did a lot of
work on Addressing this problem.
How could you essentially
mitigate this growth?
Can you do some optimizations?
When while during query processing
instead of essentially enumerating a
size of say one billion in my example,
you can do something much smaller and
look for a pattern in a smaller space.
So there, one sort of very good
technique in the space is factorization.
Factorization is essentially a
compression technique, but it's
not it's very interesting because
compression in databases is almost
always used at the storage level.
When you're storing your records, you
compress them so that you minimize IO.
But factorization is a technique.
It's a compression technique
that's designed to compress the
data, the intermediate data that
query processes generate, not
the ones that you store on disk.
So it will, for example, factorization
can take that, two hop, three
hop queries and then represent
it in a much more succinct way.
So I can get into the details of that,
but we did a lot of work about how do
we essentially increase the scalability.
Of these systems, query processing
scalability, if you will so that some
of these harder queries can be answered
in a much more performant manner.
So all those essentially
techniques that I had and some
of these I can see we can get.
deeper into.
So all of these techniques were results
of many years of research that we did.
And at the end of which we
said, okay, let's put all of
these together in a new system.
And cruiser is that system that tries
to integrate all the state of the
art knowledge we know about how to
manage and query large scale graphs.
Nicolay Gerold: And to maybe nail it
really home to the difference between,
especially like row and column base,
can you maybe go into one example of
how a query would look like if you do
it row based versus doing it column
based and like to really nail the
point home of this, like escalating
complexity that I don't have so many
different options to query through.
Semih SalihoÄlu: Yeah, so I
can give a very simple example.
So for example, and this is not going
to be graph specific, but just so
that I can clarify the difference
between a row oriented storage
and a column oriented storage.
So if you take, let's just
focus on relational systems.
If you take a system like Postgres,
Postgres is row oriented has
a row oriented storage design.
What that means is that if you've got a
table of say let's say accounts, Okay, and
there is maybe 20 properties owner, the
address of the owner the balance the date
it was opened, so on and so forth, right?
What, the way this will be stored on
disk is that there will be one record.
Okay, that stores 20 properties of that,
each row essentially next to each other.
Once that is finished, you'll have the
second record and then the third record.
In a column oriented system, what you'll
have is that you'll take the first
property, say maybe the name of the owner,
The name of the first record and the
second record and third record, all of
those will be stored next to each other.
And then in a separate part of the disk,
you'll store, say, maybe the balances of
each of these accounts, one after another.
So the balance of the first account will
be the first integer, the balance of
the second account, so on and so forth.
So you essentially split
each record into 20 pieces.
And then store essentially in 20
different contiguous sectors, each
property or each column of this record.
And if you've got a query that,
for example, looks for something
selective, give me all accounts that
were opened on a particular day.
Now, what you can do is you can
essentially scan only the property.
You don't need to scan each record.
And then essentially on disk, essentially,
you don't need to scan all of the database
and maybe the whole database is 100
gigabytes, but let's say only the date
properties when the accounts were opened
is maybe 500 megabytes only if you just
put all of them together, so you can only
scan the 500 megabytes and then find the
ones that were opened on a particular day.
And maybe there's two records.
And then you can go do 20 different IOs
to consolidate the rest of the properties.
So that's a case where you can,
instead of scanning 100 gigabytes
of data, to find those records,
scan only 500 megabytes of data.
And that will give, essentially that's
essentially One of the main insights of
columnar databases in general and that's
a topic at this point three decades
old and it's span of many systems
modern analytics oriented systems do
column oriented Designs and our insight
was that a lot of the graph queries
are analytics oriented So instead of
optimizing the system for updates,
we optimize the system for scans.
So that's essentially The justification
for why a system like Kuzo is columnar.
Nicolay Gerold: Yeah, and how this
maybe as an example would look like if
you have a social media graph, I think
that this is the dominant example.
You basically have a query.
So basically I have a user let's take me,
Nicolai Gerald, and I want to find like
people who are followers of my followers.
So instead of basically, I first
have to identify like the starting
note, which is me, basically.
So I find it, and then I basically
filter down the, with the column
database like node, which have a
follow connection to Nikolai Al.
And through that I don't have
to do the role retrieval.
So basically I get the role where Nikolai
ALD is in it, and then I basically
have to get all of the different ashes
follow them, and then do basically
do another hop to follow them again.
But I can use the different.
Separate columns to basically filter them
down already and decrease my search space.
Semih SalihoÄlu: Um, so pop, this
is, I think, partly accurate, but let
me maybe, uh, separate two things.
So the columnar sort of record storage.
is orthogonal to I think another thing
you're mentioning in your example, which
is the ability to follow essentially
specifically your connections.
So that's a feature that graph databases
have, that relational databases does
not have, which is that when you in a
graph database say, here are my users and
here are other users that they follow.
Here are the follow
essentially relationships.
What graph databases, you,
universally, not just Kuzu, Neo4j,
but take any other graph database.
What they'll all do is that they'll
build a join index automatically.
You don't have to, as a user, tell
the system to build this index.
The system will automatically
build a join index.
So what that means is that they will build
an index so that for a particular user,
say Nikolai, you'll be able to find The
people Nikolai follows, and the people
that follow Nikolai with a single IO.
So they'll, that's essentially an
index structure, um, that is separately
built automatically in graph databases.
And that's, in some sense,
that's orthogonal to the columnar
design that I was talking about.
So I, in my example, I gave a column
there essentially design of node
records, but you can also have a
column design of these relationship
records that are being indexed.
So for example if your follow
relationships had many properties
on each follow relationship, maybe
you put many properties, like maybe
the date, the following started.
Or how many times they have seen each
other each other's posts or something.
Um, now, all of those properties
also need to be, so those, now
those are relationship records.
They also need to be stored somewhere.
And suppose, you can again make a choice
there as an architect of the database.
Do I want to store these properties of
these relationships in a row oriented
manner or in a column oriented manner?
And in your query, for example, your
queries doesn't care about the date
the following started or how many times
essentially these people have interacted.
So in a column oriented design, you
can essentially avoid scanning any
of those and just scan Essentially
the fact that they're connected to
each other, which is usually another
kind of property of the relationship.
And completely avoid scanning any of
the properties of these relationships.
Whereas in a system like row oriented
system that forces you to scan the
entire records, relationship records.
You are forced to scan also all the
properties that come with the relationship
record, this follows record, that you may
not be needing for the query processing.
So columnar versus row oriented is
essentially a choice about how much
of the records do you want to read
when you're answering the query.
And columnar systems gives you the
advantage that you can read as few as
parts of the records as the query needs.
But obviously there's a catch.
The catch is that when you're
producing your final outputs,
you have to do extra work to go
and grab the rest of the records.
That, suppose you're, suppose you want
to find the users that Nikolai follows.
Right?
And maybe when, uh, when they follow,
when and the day the following started.
So like you are trying to
return some of these properties.
So columnar systems, maybe to find
the final set of records does scan
little, but at the end they need to
go through an extra step to make sure
that if the final query output requires
other properties of the records, that
they're not maybe needed for the query
processing or to find the records.
But that are needed because the
user is returning them, right?
Now you need to do extra
work to put things together.
Whereas in a row oriented system,
because you've scanned all of the
entire record during query processing
anyways, You can just directly output
them, but overall, I think it's very
well understood that if you've got a
system, you're building a system that
is meant to be analytics oriented.
That's meant to optimize for analytics
oriented applications, meaning
applications that ask a lot of queries,
but maybe they do batch updates
instead of a ton of small updates.
They're not OLTP type of applications.
Columnar systems are, they've got
a, they've got a big advantage.
Nicolay Gerold: And in this
traversal, how do CSR, so complex
sparse rows, come into play for the
traversal or for the adjacencies?
Semih SalihoÄlu: Yeah.
So CSR is a specific physical design
for the join index that I mentioned.
So there are different ways
you can store the join index.
The join index is basically allows you
to Um, locate all the, um, relationships
of a particular node very very quickly.
You should, basically, you should
think of it as a, with a single IO,
with a single lookup, you can find
any of your nodes relationships
or at least start scanning them.
So there's, again in the storage of that
index you There's different designs.
You could obviously build a tree based
classic index B plus trees and try to
Have a joint index that will also give
you quick access to sets of records CSR
is a particular Physical design that's
very popular for graphs in general.
In memory, graph analytics libraries
uses use CSR based in the seas.
A lot.
Kuzu builds a disk based CSR
index, and CSR is basically you
can think of CSR as It's, it stands
for columnar columnar sparse row.
Sorry, actually, we can edit this, maybe.
CSR should be standing
for let's look it up.
Columnar,
let me find this.
Sparse row, I think,
but I want to make sure.
Columnar, compressed sparse row.
Sorry, not
columnar, compressed sparse row.
But it's essentially a columnar.
The reason I always Get confused
with this is we can edit this part.
So CSR stands for compressed sparse
row based storage and what but a good
way to understand what that is so if
you have essentially say From two.
Okay.
So that's the relationship, the essence
of the relationship that you're storing.
Okay.
You got from nodes and you got two nodes.
And let's assume that these from and to
are integer IDs that identify the nodes.
So one to two means node one.
follows node 2.
2 to 5 means node 2 follows node 5.
And you got these records 1, 2, 2, 5, etc.
So that's your essentially
essence of the relationship.
Again, the relationships could have
other properties, but let's ignore that.
The join index does not
usually store those properties.
So if you've got A from column and a two
column, and they are IDs of these nodes.
And if you sort it, say the from column
sorted these records by the, from column,
you're gonna get something like 0 1, 0 5,
0 7, 1, 2, 1 5, 102, so on and so forth.
So the from column will be sorted.
And now if you compress,
now that the from column is.
sorted, you can do run length
encoding compression to it.
So run length encoding
compression basically says,
if essentially there is Zero.
The from column values
are like 0 1, 0 5, 0 7.
That means there's three zeros,
so you just store zero and three
instead of storing zero three times,
and maybe if there's five ones
next, then you store one and five.
So if you did that sort of storage, that
is actually the essence of CSR storage.
CSR uh, storage is very
similar to essentially storing
all the tools consecutively.
And then, from the from column, you store
just essentially how many each node, how
many edges each node has, and you don't
necessarily have to store anything else.
So you essentially end up
compressing the from column a lot.
And if you're storing the backward
edges, Now you would have a to and
from relationship, and then you
would sort by to and compress the to.
So that's the essence
of compress sparse row.
It's essentially, that's why
it's a columnar storage as
well, and Kuzu stores on disk.
It's a joint index in this
compressed format in the CSR format.
And that contrasts with, say, Neo4j.
Neo4j stores these records, links these
records with each other and doesn't
have the guarantee, for example,
that the records of a particular The
relationships of a particular node,
the follows off, say, Nikolai are
not necessarily next to each other,
but they're linked to each other.
So you can with one.
I always go to the next index,
but obviously there's no locality
guarantee as the database grows.
Kuzu instead always guarantees.
that will always store the
edges, the relationships that
index in this CSR format.
Nicolay Gerold: Interesting.
So we've talked a lot about the
storage or memory layout of Kuzu.
What we've missed until now is the compute
part and especially the vectorization and
how it makes the queries more efficient.
Can you go a little bit into the
vectorization part and explain it,
how it works and how it contrasts
to regular graph databases?
Semih SalihoÄlu: Yeah, so this is also,
again, vectorization is not a very
graph database specific technique.
It's a technique that originated
in columnar relational systems.
In essentially late nineties or in
1990s and 2000 it's very widely used
now for again, analytics oriented
systems in columnar systems.
This is a widely used
widely used technique.
And what it contrasts
with is more traditional.
uh, Database query processing
architecture, which is called topple
at a time processing in basically all
databases, database management systems,
relational systems, graph databases,
or even RTF systems, any system that
implements a high level query language.
Ultimately, what happens in those
systems is that those high level query.
Language clauses are compiled down
into a expression tree, a relational
algebra expression tree whose roots
are scans of tuples and intermediate
nodes are some relational operations
like maybe joins, maybe projections,
maybe aggregations, so on and so forth.
Imagine essentially a very simple.
Expression tree.
Let's look at a very simple query.
So maybe select, count me
all the nodes that have name
Nikolai, whose name are Nikolai.
So a very simple expression tree
for this, that there's going to be a
scan node that scans the your nodes.
So maybe say user nodes.
And then.
A filter, this scan could also
be implementing the filters,
but let me just simplify it.
But let's assume there's
another operator called filter.
What it does is that it just checks if the
tuples that it receives have name Nikolai,
so it's going to run that expression.
And if it does, it's going to give it
to the next operator, which is count.
So there's three operators, a count
operator, an aggregation operator, a
filter operator, and a scan operator.
In a traditional tuple at a time
processing, what happens is that
the processing starts from the count
and count says to its child, which
is filter, give me your next tuple.
And then the filter calls its
next child, which is scan,
says, give me your next tuple.
And then they pass essentially
one tuple between each other
until you exhaust the scan.
In vectorized processing what happens
is that, the disadvantage of this very
old architecture now and not a lot of
systems I think built are built this way.
Certainly not a lot of analytics
oriented systems are built this way,
but it's got the advantage that it's
a, it's very simple and but it's got
the disadvantage that there's a lot
of essentially function calls that
happens between these operators.
They keep.
Calling getNextTuple for if you've
got a billion tuples you're scanning,
you're going to call essentially,
maybe like in my example, three billion
essentially func getGetNextTuple calls.
Vectorization, this technique called
vectorization, I think a better name
for it is block based processing.
That appears in some in, in
some literature, it appears
as block based processing.
What block based processing does
is that it passes instead a block.
of tuples between operators.
Each time you say get next tuples,
you essentially get a block of tuples,
like 1024, let's say, or 2048 tuples.
So it amortizes this expression,
the cost of calling these get
next tuples function calls.
So what happens is that now you need to
rewrite your operators, though, because
your operators can no longer assume
that they're getting a single tuple.
They all need to be rewritten.
Every single primitive operator in
the system needs to be rewritten in
tight for loops, which is actually very
good for instruction locality as well.
So it's well understood now that this
type of vectorized processing is good.
If you got If you want to essentially
build a query processor that is designed
to process lots of data because it's
so analytics oriented systems make this
assumption that their query processor
are going to essentially compute on
a lot of data passing through these
operators, vectorization is essentially
a common wisdom architectural principle
that minimizes these calls, which
compares also again systems like Okay.
Like Neo4j to the last that I
looked at it, it was this is
this topple at a time processor.
So those are essentially all architecture
principles have this performance
disadvantages if you're building
in an analytics oriented system.
Nicolay Gerold: Yeah, and for me, Yeah,
Semih SalihoÄlu: and did that, but I want
to be clear that's not really a graph.
database specific optimization.
It's essentially an optimization
that any system that is being built
to be optimal or optimized for,
uh, read optimized query processor.
If you want to build a query
optimized Highly optimized query
processor that is designed to do a
lot of computation on a lot of data.
Then vectorization is a good idea.
It doesn't have to be
graphs, relational or RDF.
It's just a good idea to optimize your
processor to work on a lot of data.
Nicolay Gerold: Yeah, and it's in the
end taking advantage of modern hardware.
The, especially, we see it in
a lot of different databases.
Because we have more powerful computers,
we don't need So many distributed
systems anymore, so we can go back
onto single node and take advantage
of something like vectorization.
Semih SalihoÄlu: Yeah, exactly.
Yeah, I completely agree.
A lot of these common wisdom
techniques, if you put them together
essentially pushes the state of the
art, certainly in graph databases.
I think a lot of the graph databases
that we studied were not built with these
common wisdom architecture principles.
And some of these are again, not
even graph database specific.
They're just general, broad
data management principles.
And they were built on these older
architectural principles, so that's one
of the gaps that we saw was that there's
a need for a state of the art system that
sort of put some of these common wisdom
architectural principles, along with
some of the graph specific ones that my
research group did together, and Kuzu
aims to be such a state of the art system.
Nicolay Gerold: Yeah, and I think
that's a cool part in data and AI.
You basically put together a lot
of optimizations to arrive at the
state of the art and don't have one
single core idea, but you pick and
choose from different areas from
like systems engineering, networking.
More like the hardcore, like data
engineering side, put it together
to build like a system that's better
than what we have seen so far.
Semih SalihoÄlu: I agree.
I think it's in the nature of a lot of
system oriented projects and products that
the whole is much bigger than the parts.
There's a lot of parts that you
have to do, but, a lot of them
don't require a lot of innovation.
You just need to follow, I think, the
modern state of the art principles,
and this happens a lot in databases.
Every decade or so, the field moves and
we understand better, uh, how to build a
sort of modern fast engines and then new,
the newer systems are able to adopt them
and the older systems are just much harder
to adopt those those those principles.
And it's I think in the nature
of, in the nature of the field.
Nicolay Gerold: Yeah.
And we touched on a lot of
the technical aspects of Kuzu.
One thing I want to go into as well as
the ASP join and not like the complete
technical details, but rather the core
ideas behind it and what it replaces
and how it improves on the existing join
algorithms we have in graph data, anyways.
Semih SalihoÄlu: Yeah.
Good.
So I think to explain, so ASP is the
term we used in the Kuzu paper in the
CIDR Kuzu paper, then that was published
before we even became a company and
spinoff as a, to commercialize Kuzu
in the actual implementation we use,
I don't know what the term that we use
for that join, join operator, but yeah.
We essentially, it's a let's call it
ASP, I think we call it multi way join
or something, but I can get into the
principles of what we were trying to do,
but to explain that I need to explain
factorization a bit better, so I gave
a high level overview of factorization
is that it compresses intermediate
relations, uh, the intermediate sort
of topples that are produced during
graph sort of pattern querying.
So the example that I gave was like a
few hop query about, maybe the accounts
that Nikolai's account transferred
to and maybe two hop accounts.
So let's maybe take
another simpler example.
Suppose the query is,
give me all the two path.
Transfers that Nikolai could be
facilitating, meaning that Nikolai's
account, you've got an account,
one account node let's call it
account and account Nikolai.
And that node has some incoming transfers
and it makes some outgoing transfers and
somebody is, let's say a fraud analyst is
interested in to see what sort of money
flows could Nikolai be facilitating, that
it's the middle person in these accounts.
Okay, so if you got essentially a
thousand incoming transfers and
a thousand outgoing transfers, so
there's essentially the two paths
that Nikolai's account is part of is.
1, 000 times 1, 000.
There's a million possible different
combinations of two path money flows
that Nikolai could be facilitating.
Okay, so which means that there's
actually, if I ask you that query in
Cypher, match Maybe a node A that has
transferred money to node B that has
transferred money to node C, where
the middle node B is owned by Nikolai.
The output of that query will
contain a million tuples.
Okay, just, just even though there's only
sort of 1, 000 incoming transfers and 1,
000 outgoing transfers, the sort of the
fact that this many to many joins or the
many to many sort of pattern, Um, growth
leads to a very large number of patterns.
In factorization, a different
way you could have stored that.
So in a, a, take any relational system
and run a, run an equivalent SQL query.
That system is going to, in its
query processor, generates really
a million intermediate tuples
and materialize it in memory.
It's it's gonna flow through memory,
but it's gonna materialize it.
But in a coolant way, you
could have represented.
That 1, 000, 000 tuples is you
could have represented as Cartesian
product of much smaller pieces.
You could have said my intermediate, that
intermediate relation is 1, 000 incoming
sort of relationship incoming transfers.
Cartesian product that with Nikolai's
account and Cartesian product that those
two with a 1000 outgoing transfers.
Now you got essentially a much more
compressed representation that contains
only roughly sort of 2000 records.
It does represent a million records.
But in memory, you've only
represented it as 2, 000 records.
And suppose all you want to do is
count the number of you could be
facilitating, number of two, two
path flows you could be facilitating.
Now, I don't even have
to ever uncompress it.
I can just very quickly count
the number of tuples there.
It's 1, 000 times 1, 000.
I can say the correct
answer is 1, 000, 000.
So factorization is this idea.
It's that idea that when there
are many to many joins, you can
represent the intermediate results
relations that are generated in a
much more compressed way by using
Cartesian products of sets of records.
So that, that is the
idea of factorization.
This ASP join that we were trying to
do, uh, it's highly, very detailed.
It's gonna be difficult
to explain each component.
But it was essentially trying to Get
good factorization structures, which
mean good Cartesian product structures.
And it was also trying to make sure that
we One thing we wanted to avoid in Kuzu
was, and a lot of graph database systems
do this, is that we wanted to avoid
sequential scans as much as possible.
So what do I mean by this?
So if you got essentially this two
hop query, You know, Nikolai's, um,
so let's maybe take another query.
Nikolai's sort of two path transfer
where Nikolai is the originator.
um, A common query plan for this query.
Now I'm making the a node Nikolai's
account is you go to Nikolai's accounts,
and maybe there's a thousand of those.
And then for each one of those, you
go to a thousand other accounts.
And if you need to scan some properties
of these accounts, a common sort of it's
a very bad query plan, is that you will,
you essentially get non sequentiality
when you go to Nikolai's accounts.
You're going to a thousand
different records, but there is no
guarantee that they will be located.
Consecutively on disk.
And when you go to their accounts,
you lose even more locality.
So a very bad way to process this is
that you do you go to Nikolai's account,
and then from that do another random I.
O.
to go to 1, 000 other accounts.
And then you go to Nikolai's second
account, do another 1, 000 random I.
Os and so on and so forth.
That's a very bad way to process this.
And our observation was that relational
systems never do processing like this.
What they do is they use hash joins,
which means they'll take essentially maybe
the accounts and the transfer tables.
And then they'll always scan
sequentially, and then any randomness
will be, will happen in that
intermediate hash table lookups.
So what we wanted to do was to also
ensure that we behave like a relational
systems hash joins in the sense
that we always scan sequentially.
When we do these sort of graph pattern
evaluations so ASP joins components and
I won't be able to explain the full ASP
join because that's a little it's going
to get, I'd need to get very low level
and I need maybe a board to explain it.
But.
It was essentially the problem we were
trying to solve is that we wanted to
get good factorization structures.
Plus, we wanted to ensure that
we always scan sequentially.
And what our realization was that
a lot of the times these conflict.
A lot of the times you can't get
good factorization structures
and get sequential scans.
So what we had to do was
apply this technique called
sideways information passing.
Where sometimes we store some intermediate
data in memory, figure out what needs to
be scanned, pass that down to the scans.
As information and the scans are
going to make use of this information
about what are the successful join.
So which, which records do I need to scan?
So it's very hard to, as I'm explaining,
as I'm realizing it's very hard to
explain something like ASP, ASP join,
but there's essentially this core
idea of trying to get factorization,
good factorization structures.
Trying to ensure that scans
are, um, scans are sequential.
And, sorry, the third piece that
we were trying to get is that graph
databases has the advantage that they
have the disadvantage of being non
sequential scans, but they always
scan only the parts that they need.
So they don't, so for example, in
that example that I gave of two hop
joins, even though you're doing non
sequential scans, you're never scanning
any record that you never need to.
So you're doing them randomly maybe,
you're going to your, degree first
neighbors, but you're always scanning
something useful, and that's something
relational databases don't have.
Relational databases, when they join
things, they'll scan the entire tuples.
Even though you need maybe only Nikolai's,
1, 000 records, transfer records, if your
transfer table contains a billion records,
they'll still scan all billion records.
So, that was the third thing
we were trying to get, and I
should have clarified that.
And if you want to get that as
well, and make scan sequential.
What you have to do is that you build
these hash join type of plans where you
get essentially some maybe your Nikolai's
record and you understand that you need
to only scan Nikolai's transfers and
you're going to pass that Nikolai tuples
information sideways to the transfer table
scan operator and tell the transfer tables
scan operator only scan give me nodes.
That have this Nikolai's accounts ID,
maybe account seven only scan me the
sort of transfers where the from account
is seven, so you can avoid scanning all
1000 records or all 1 billion records.
So you can only scan that 1000 records.
So we were essentially so maybe to
summarize it, we were essentially
trying to behave like relational people.
systems hash joins, but have the
ability to a do factorization and
be only scan the parts of these
relationship records and node records
that really needed to be scanned, which
is an advantage of graph databases.
But the traditional processing
in graph databases was
something we didn't like much.
Because of the non sequential
scans that it leads to.
ASP join was our solution, and
it's, it is one of the core join
algorithms that is implemented in Kuzu.
Nicolay Gerold: Yeah.
And I think it's one of the
downsides of columnar databases is
writes and especially also updates.
So usually in, in something like Parquet,
which is also columnar, you delete the
file and you basically create a new one.
And.
How long in Kuzu does actually the
right path take until it's also indexed?
And also a little bit on that is, should
you optimize for actually having more
of a read heavy workload with Kuzu and
for example, like batching rights and
minimizing concurrent rights with it?
Semih SalihoÄlu: It's a good
it's a good observation.
I think it's like a fundamental sort
of design decision that you make when
you're starting to build your system.
Am I, do I want a system that's,
Optimized to take in small writes
very fast, build an OLTP like engine.
Or do I want a system that is
optimized for analytic queries,
which means read heavy queries.
And we consciously made the latter.
We are a read oriented system.
So you are right that columnar systems,
because of certain optimizations that
they do, simply just the fact, for
example, that You separate each record
into, in my example, there were 20
properties into 20 different kind of
column, like different parts of the disc
means that if I insert a single tuple,
I need to write to 20 places, right?
If I had a row oriented system, I
could write that record consecutively
to a single place, which means you
only have to do one IO, for example.
There is this fundamental sort
of trade off that you make.
So there is no, you can't get both of
the can't get both of these, I think.
You have to make a conscious decision.
But there is so it is fair to say that
columnar systems, and Kuzu is one of
them we are more performant if you batch
your writes and do your writes in batches
instead of doing a lot of sort of maybe,
I don't know, 10, 000 sort of simple
transactions, each one kind of adding
one relationship or one node record.
That's not a particularly strong not
a particularly a case where you're
going to perform very strongly.
But there is certain optimizations
that we do to minimize our to reduce
essentially the time that it takes to
commit transactions, write transactions.
This is particularly a
problem with the join index.
When you do CSR based join index, right?
The original design of CSR
is that everything is packed.
There is no, so if you, if I want
to say, so suppose Nikolai's account
is 07 and 07 has now a new transfer
that will go into the join index, but
the way I described the CSR index.
Was there's no gaps for me
to add a new edge in that two
column, the second column, right?
So for example, what we do in Kuzu
is that we actually, that second
column that stores the actual
neighbor node IDs, if you will,
actually contains gaps between each
sets of so after node zero sets
of edges, we store some gaps
after node one sets of neighbors.
We store some gaps so that
we essentially amortize.
We make it more update friendly so
that if you get a single insertion,
Nikolai gets another transfer.
We can essentially, we, there is usually
going to be a gap there so we can directly
write it at the end instead of resorting
and reshuffling the entire column.
So we do several optimizations like this.
So if you're familiar, there's a popular
data structure called packed memory array.
So packed memory array is
essentially a way to sort.
elements by leaving gaps.
And that kind of compares to
three based indices that also sort
elements or self sort records.
So our columnar design for the
join index, the CSR based index is
essentially based on packed memory array.
So we amortize we try, we make
that columnar storage more
update friendly, for example.
Nicolay Gerold: Yeah, to maybe close
it out the, I think graph databases.
are gaining traction at the moment,
but it's still very hard to adopt.
What would you say is missing to make
it actually easier for developers,
especially to adopt graph databases?
So what's like a missing link
Semih SalihoÄlu: it's a
hard question to answer.
So part of the reason graph databases are
going to remain a niche sort of industry
is because relational systems, the
amount of education that exists on
relational systems is incomparably
larger, or there's a lot more
education on relational systems.
than other classes of
database management systems.
So every computer science curriculum
in the world teaches a database
course and in that database
course you're going to learn SQL.
So there's a lot more familiarity
with SQL, and I don't think
you can compete with that.
I think graph, you know, what,
so there, there are essentially
directions, I think, that could.
Increase the popularity of graphs.
There's a couple of waves to that could
increase the popularity of graphs.
Of them is if you try to build systems
that can essentially and property
graph databases, the one that Kuzu is
based on, which essentially follows
Neo4j and the class of systems based on
Neo4j property graph database systems
has the advantage that the data model.
Even though it's graph oriented, even
though it's object oriented, it's actually
very close to the relational model.
So basically you have node records.
You have basically have tables of
records, but each table, you just indicate
if it's an entity or a relationship.
So that's like the primary
difference between the property
graph model and the relational model.
Especially Kuzu's version of property
graph model, where we force you to
predefine your schema, is that you
have tables of records, but you
just need to tell us, okay, this
account table, that's actually an
object, that's actually a node.
Whereas this transfer records,
that's actually a relationship.
So that's the difference.
You give this additional
information, and based on that
information, we optimize this.
But to come back to my point, graph
databases, the property graph database
has the advantage that because they're
so close to the relational model,
It's actually very easy to understand.
And if we build essentially features on
these systems that make it actually easier
to query directly off of relational.
records.
So suppose you got, say, CSV
files and you got parquet files.
A lot of people dump their
records in this tabular format.
These formats are essentially
sets of records, right?
So sets of records identify certain
information in your data model, so
they could be identifying your accounts
and the information of your accounts.
If we built features around these
systems, That even close the gap
between where, okay, these sets
of records, I want to model that
are in CSV files or parquet files.
I want to model them as, uh, notes.
And then these other sets of
records, they're my edges.
So you're not even closer to
the relational model because
you're not even, you're just
providing some kind of mapping.
You're not even doing sort of, you're
not even doing a big step, mental
step of representing those parquet
files in your, graph database system.
It's more like you're doing a mapping
saying, okay you're saying that the
graph database systems are able to
query tables, tabular formats as well.
So I think that could essentially,
if you had features that allowed you
to do this say like a system like
DuckDB, DuckDB can scan and query
over row parquet and row CSV files.
And you can build similar features
in graph databases as well.
That's something we're doing
prototyping on in Kuzu.
How do we built features that
allow you to query these?
Raw tabular records also
in the form of a graph.
I think that could also allow
people that could also simplify.
The more you simplify essentially
querying records as graph the sort
of, the more you can contribute to
popularizing graph databases, I think.
Nicolay Gerold: Yeah, and what's next?
So what's on the horizon for Kuzu?
What are you building on?
What can you already teaser?
Semih SalihoÄlu: Great, so we've got
a several sort of major fee features
coming up in the first quarter of 2025.
Of them is we're, that's going to
come very soon is the, we're going
to have a Wasm version of Kuzu.
So you'll be able to
run Kuzu in the browser.
We already have a version of this.
We are essentially testing it.
It's very close to being released.
So there were a few people who asked
for this, something that you can run
in the, in, in your browser, a database
that runs in your browser, the graph
database that runs in your browser.
So that's coming up soon.
That was one of the environments
where you couldn't run Kuzu, but
hopefully we'll close that gap.
Uh, and then we had.
Not built a lot of indices
around the system, and we're
building two indices right now.
One is a vector index so you'll be able
to do nearest neighbors queries over of
floating points that you store in Kuzu.
You can already do that, but we
don't have a fast way of querying
or finding nearest neighbors.
And a lot of RAG and LLM based
applications that want to
experiment with techniques.
Retrieval techniques
require that that feature.
So we're building that index and we're
building a full tech search index.
So there's several core retrieval
functionalities that that Kuzu is getting.
And finally, the next thing that's
immediately coming up is we were
lacking a graph data science library,
like a package of off the shelf
graph algorithms you could run.
Say you got a graph and you want
to say, I want to run page rank or
I want to run connected components
or I want to run between the
centrality to run these batch graph
algorithms, you'd have to so far move.
We had integrations with some
graph libraries like network X.
You could take your graph.
And in Python not in maybe other APIs
we have, but in Python, you could be
the single function called get convert
the query results into, say, a network
X object and then in network X run those
graph algorithms, but that has scalability
problems as well as performance problems.
What we are providing is now off the
shelf procedures that you can run
directly in Kuzu's Cypher dialect.
And run those algorithms and
compute those graph algorithms.
So those are the four things
that are immediately coming up.
Nicolay Gerold: Nice.
And if people want to start building
with Kuzu or want to learn more
about it, where would you point them?
Semih SalihoÄlu: I think, come
to our documentation website, and
there's a getting started part.
We have a browser based command line
interface with a graph visualization.
That's a nice tool to use
to get started with Kuzu.
It's called Kuzu Explorer.
So I'd advise getting started,
on in our documentation page
and using Kuzu Explorer,
Nicolay Gerold: Nice.
And if people want to follow you,
where can they do that as well?
Semih SalihoÄlu: We have a
LinkedIn and Twitter Kuzu page.
I think that's the best place
to fall off Kuzu related things.
I'm not particularly super
active on social media.
And we have an AI engineer, Prashant Rao.
Prashant is a lot more active than I am.
So Prashant usually breaks
all the news related to Kuzu.
And please sign up for our newsletters.
There's a newsletter on our website.
If you come to kuzuDB.
com you can sign up for our
newsletter as well and follow
the developments around Kuzu.
Nicolay Gerold: So, what can we take away
when we are building our applications?
So, first of all, I think we
went way deeper into one specific
technology than I usually do
and really focus on one library.
But I think it can teach you
a lot when you're developing
libraries or new tools as well.
In a lot of the stuff we are using
nowadays is a little bit older and
doesn't apply a lot of the techniques
and optimization we can take advantage
from through the modern hardware.
And we see, especially with Kuzu,
but also with DuckDB and Polars
that we can do so much on a single
node or embedded on users devices
because the hardware is so powerful.
We just need to optimize the stuff.
Actually take advantage of that.
And here, like Kuzu does a lot, like
vectorization and stuff like that, where
you suddenly can run stuff where you
actually needed a really complex database
before, but now you can do it in memory.
And through that, you can
apply it to a lot of stuff.
So for example, like if you're
using, um, knowledge graphs with
your little lamps, you can use.
Kuzu as a backbone, we had in the
interview a few weeks ago with Robert Cork
from Ask News, they are basically building
knowledge graphs for the user specific
query on the fly and then query it.
And I think this is something
you can do here as well.
So in your application, you likely
are always querying on sub graphs.
So you basically construct a
subgraph, load it into Kuzu,
query it, and then you go on.
And for that, you basically, it's
way more efficient than having a
stand alone graph database, because
you don't deal with distributed
databases from the get go, where you
actually have to do really complex
joins on UUIDs and stuff like that.
What I think is really interesting is,
the graph data science part, is our
graph algorithm algorithms at large.
They are interesting for some
applications, but they are,
in my opinion, rarely used.
So it's often like recommendation
systems, really large scale recommendation
systems, fraud detection, or
influence analysis on social networks.
But it's not really that For like most of
us probably, the advanced query algorithms
aren't worth it, or they are so deep into
optimization that we will likely never
get there because we can't get a bigger
bang for our buck with simpler techniques.
The, for rack, um, Kuzu's.
Is suitable in my opinion, I think
in general for rack, before you start
with something graph based first,
make sure that you actually optimize
the rest of your retrieval as well.
Um, because I think like you
have bigger bang for your buck.
In vector and keyword search,
then with putting a graph on top
and then also like adding smart
metadata to the different documents.
So before you add the graph,
then basically optimize these two
components and then add the graph,
but have a clear reason to do so.
Why you actually can't do the
same with simple metadata tags.
And yeah,
I think Kuzu gives us a lot of
different directions of how we
probably will build in the future.
So first it's embedded by design.
It's a library and not a server.
So it's an graph database
that runs locally.
So you can integrate it into scripts,
mobile apps, serverless environment.
I think we will see more and
more of that because our devices
become more and more powerful.
So on the server, I can run more.
On a single note.
And also I can push more
onto end user devices.
A second one is use columnar
storage, and this allows you to
actually have a way better scan
performance and also improves the IO
efficiency and also compressibility.
So basically you can
compress the data better.
And then you have like the
vectorized query processing, um,
which is also like with two days.
CPUs, you can really take
advantage of the CPU cache more
by doing vectorized queries.
And lastly, I think like what they have
done with ASP join and vectorization,
it pays to have people who really can
optimize the shit out of an algorithm.
Um, whether it's actually
necessary for your own use cases.
That's something you have to know.
I think like for libraries, it's always
something interesting to build in,
but only if it actually makes sense
for you as well, but yeah, that's it.
Let me know what you think.
Uh, we went really deep into Kuzu.
I think the architecture decisions
behind Kuzu were really interesting.
And also the different use cases you
mentioned, like what they are working
on, but they are thinking about
and how Kuzu should be used, let me
know in the comments what you think.
Otherwise I will catch you next week.
See ya.