Nicolay Gerold: Knowledge graphs are all
the hype now, but ontologies, taxonomies,
and knowledge graphs are new inventions.
Most of them have been studied for
decades, and Google popularized the
knowledge graph, In 2012, by building
on decades of semantic web research,
and despite the current bus, these
concepts build on established work.
And when we look at all of them, we
can think of metadata as foundation.
It tells you what data means, where
it came from, and how it connects.
But why is it all the hype now?
AI needs context.
LLMs are powerful, but blind.
They need clear definitions, trusted
sources, and connected concepts.
And metadata provides this backbone.
And knowledge graphs, taxonomies,
and ontologies use the metadata
to make it more usable by the LLM.
So your first steps are typically
to start with the facts.
Something like Your table names, column
types, data flows, access parents.
Then you add meaning.
So you add business terms, common
definitions, key relationships, usage
rules, and then you connect everything.
You link the technical
to the business terms.
You map data flow to processes.
You tie the metrics.
To the sources and you
build from the ground up.
So metadata is new, but it's really
important for AI systems, data governance,
business intelligence, and also cross
team collaboration and larger enterprises.
And today we are talking to Juan
Sequira, who has been building
data catalogs, ontologies and
knowledge graphs for decades.
And we will be talking about how they
are building knowledge graphs and how
you could apply it to your own use cases.
Let's do it.
What do you think is the difference
between, like, all the different
buzzwords that are being thrown around?
A data catalog, knowledge
graphs, ontologies, taxonomies,
and how do they work together?
Juan Sequeda: Okay they're buzzwords
now because people have been
Because it's becoming popular,
But it is not a buzzword, the word
ontology, all those things you said,
ontologies, taxonomies, those are
things have been worked on, defined
for, decades and decades, if not,
you can go back to the history of
this for since the Greeks, right?
So just organizing knowledge
is not a new thing, right?
So that's the one thing
I would put out there.
So this is not a new buzz thing
people are trying to go do.
Knowledge graphs, Again, I remind
people, the term, the word, knowledge
graph, became popularized when
Google made a blog post in 2012.
And where they introduced their
concept of a knowledge graph.
Which is based on all the work that,
at that time, that came out of the
semantic web research community.
So when people say, oh, so then we can
talk about the semantic web and stuff.
But so you have all these people
working on things and the Semantic
Web community was building on so
many other techno, other technologies
and approaches from the past, right?
You always build on the
shoulders of giants.
So the knowledge graphs was
popularized by Google around this.
And now with ai and LLMs ID,
I'd argue that the work that our
lab did over two years ago now.
We were the first ones to the best
of my knowledge to say, Hey, we
believe in LLMs are going to be used
to chat with your data, specifically
chat with your structured data.
You need semantics, you need
ontologies, you need a knowledge graph.
So the research that we put out was
showing, Hey, if you put a knowledge
graph layer on top of your relational
databases, and you do question
answering over your knowledge graph
instead of just directly over your
SQL, the accuracy increased like 3x.
Thank you.
Was our original research and that spawned
and people had already been talking about
knowledge graphs and LLMs Yeah, those
should go together But we were we were
to the best of my knowledge We're the
first ones actually put some numbers on it
saying how well does it actually improve
and is it actually worth that improvement?
And then you can start seeing things
about graph rag and coming up But
so that's my long story there.
But the quick thing is it's not
We're hearing it more, but it's not
a buzzword that people have just
made it up recently for this stuff.
Nicolay Gerold: And, is, because it
seems a lot you're connecting to a lot of
different data sources, you're cataloging.
What does everything mean?
What are the different definitions
of the different data types?
What are anthologies, taxonomies
that we are using across the data?
Is there a certain scale you first
have to reach in your maturity
and in your AI and data journey
before it really becomes relevant?
Or should you start with it from the
get go thinking about what are the
connections between the different
data sources and how should I define
the different aspects of my data?
Juan Sequeda: So it should all start with
the problem you're trying to go solve.
That's it.
Yeah, this is, I know we're having a
technical audience, but this is something
I even push for the technical audience.
If you are trying to go just play
around with technology, then yeah,
we can have that discussion and say,
like, why this technology is better.
But if you're thinking, if we're
having the conversation about, I want
to apply this technology for in the
enterprise and what then the question is.
What is the problem
you're trying to go solve?
Because if your problem that you're trying
to go solve is I just need to go answer
all these questions then just put it in
a freaking relational database create
an OLAP fact and dimensions there you're
done, you don't need that stuff, right?
And so then so I think the question
is, when so I'll rephrase it like
what are the situations where
things get more complicated when
you start saying, oh wait I need to
start rethinking about all of this.
So one if You first scenario is I think
we live in a normal paradigm what I
call the application centric world
where we build systems to satisfy a
particular application in a use case.
And that means that we build a silo for
this stuff, and it works perfectly fine.
We, you can use whatever system that
you want to go do, and I built and I
saw that application that have another
problem I build another application.
So at the center of the
world is the application.
Now, what happens is that we built,
that's how we start building these silos.
And then people start realizing,
oh wait, I have these other
questions and what do I need to do?
I need actually data that's from this
application and that application.
And what do we end up doing?
Building another application.
And they start and they
start building these things.
So you have all these silos of things.
At some point you'll say,
this doesn't scale anymore.
This is wasting my time.
But some other people are like this
is just how we work and it's fine.
We're gonna live in these silos.
And that's, I think, how all
organizations have done that.
But then you start entering the
problem I didn't even know what data
I have, or I get multiple answers
depending on how I answer these things.
So if those.
Issues start, if you're starting to have
those issues in your organization, that
it just, I have a new thing I need to
go build and it takes a lot of time.
And I'm like that why
does it take so much time?
If I, if what I'm asking for already
exists with that other stuff over
there, why are you telling me I have
to go wait so much time, or I'm trying
to go find data around these things.
And it just, there's
too much of this stuff.
Ask a question and I get so
many different perspectives.
If you're starting to hit those types of
problems and it's actually costing you.
You are losing money because of that,
then I'm like, okay, so you start
realizing that you have an issue of what
I'm going to call metadata management.
You have this knowledge about what your
stuff means and how it's organized.
It's all spread across and that's
starting to generate pain for you.
So I think the next step that you
should start considering is going into.
All this metadata management, it comes
into like the worlds of data catalogs
and stuff that I want to say, Hey, let's
actually figure out our let's organize
our, the mess that we have, let's figure
out what we have and then figure out
how we start organizing anything at that
point, you can still, you still have
more, you start increasing more of the
maturity on how much you want to go do.
So all this to go say is that if
now thinking about this from a
tech perspective, I always say
that your first knowledge graph
should be of your metadata.
So it should be you're starting to pull
in and integrate the metadata of systems.
So I know we're just jumping to
knowledge graph and we haven't even
defined what a knowledge graph is.
I'll just keep it my very simple,
put my honest, no BS answer here.
Let's actually break it down to what
the word knowledge and graph is.
By knowledge, The meaning of things,
and this is the data model the ontology,
the schema that is what makes that
very, we're making that explicit.
So it's just drawing out the
concepts or represent the concepts
about a particular domain.
My domain is e commerce.
I'm talking about customers, orders,
order lines, products, right?
That's my domain.
And then the data is the graph part.
And why graph?
Because at the end of the day, you're
connecting all these things together,
and I actually think that graphs are
very flexible data structure, because you
can take tabular data, you can translate
it to a graph, you can take tree style
data, JSON, XML, this stuff, put it
into a graph, you, when you do entity
extraction over text, you're pulling
out nodes, you're doing relational
extraction, you're pulling out it just
became that lowest common denominator to
integrate all types of sources to go to.
Now, if I say about, So what happens
if I start thinking about, I need to
go create a schema and ontology about a
particular domain, you're probably going
to have a lot of discussions with people.
Oh, but I think about it this way.
I think about it that way.
I'm like, you know what?
That's true.
Why do we, let's not tackle that problem.
Let's tackle an easy problem
where we're probably not going to
have a lot of discussions because
we're going to have to agree.
And this is where metadata comes in.
A database.
What is the ontology, the
semantic layer, the schema of a
database, you have a database.
You have schemas of tables of
columns are part of the tables.
These columns have primary
keys, foreign keys.
That's your model right there, right?
I want to bring in a dashboard.
I was like there's this
is how it's defined.
What a Tableau dashboard and so forth.
Then you start relating
dashboards to, to, to tables.
And so you start creating
that model that represents.
All the metadata from sources
and you just start bringing that
in and you start creating that
knowledge graph of your metadata.
And that's what a data catalog is.
The data catalog is really an application
over your integrated metadata coming
from so many different sources.
Long story here to go say, how do you
start getting, how increasing, I think,
realizing that you have a problem of
a lot of managing a lot of metadata
because you don't know what stuff means.
Therefore, your first application of
a knowledge graph should be metadata
management, which is a data catalog.
Nicolay Gerold: Yeah, and I am working on
a problem right now where it's actually
already like with one back and one front
end And like the different lingos you
have, the consumer facing one, but also
the one you're using in your backend,
which you then have to unite, which
often is a very challenging thing,
because what you're doing in the backend
is often abstracted from the frontend
engineers, which are using a different
lingo, which is more customer facing.
Can you.
Maybe go into a practical example,
how that data catalog would look
like for a use case you can share.
Juan Sequeda: Yeah and this is where
I'm going to say that I think a lot
of it is it's a social aspect too.
Because you go talk to people
and go figure this out.
So let's start.
I think about this.
If we go from a top
down perspective, right?
Sorry, a bottom up perspective.
I first want to bring in all
my technical metadata, just the
facts of what we observed that
are the facts of life, right?
I have these, let's keep
it super simple example.
I want to understand my relationship
between a tab, my tableau, my
dashboards, my ETLs, which are going
to be things transformations done
in dbt, and I probably have fivetran
or some other ETL clouter, and I
have something like snowflakes,
keep it super simple like that.
So then I'm going to start, I'm
going to extract all that metadata
from these different places.
So that's the first thing.
So imagine that.
Think about it.
When I started off saying you have all
these applications and these silos.
The database itself is a silo.
The dbt transfer themselves are a silo.
The work that you're doing in
your dashboard are in the silo.
I'm going to start
connecting that all together.
So the stuff that's in my data, in
my database, like in so flake, I
have my tables, my columns, right?
All that stuff in there, my
views and I have there in dbt,
I start making transformations.
I have SQL queries and code that
I want to be able to connect.
And what am I doing?
I'm connecting probably.
I'm transforming things
inside of that database.
So there is this one table over here that
got transformed to another table, which
I may have changed the name or something.
So I'm trying to keep all track of that.
By the way, those are, that's the lineage,
and that's part of the graph, I think.
There's a table here, inside, that table
goes next to this transform has this
stuff, generates this other table here.
And then, in another silo, you realize I
have this system that is actually doing a
bunch of logic too, and it is And there's
a dashboard that uses a particular table,
but that table is actually inside a
snowflake, but that table was transformed
by some dbt thing that from this table
that other from snowflake, by the way,
and that where that table come from
some five trend or EL process that was
dumped from some other source over there.
You want to go to do but we pull
that all together all catalogs do
this, but they're like, what is the,
why see this as a knowledge graph?
One, if that's, if you only care about
dealing with those four or five sources
of data, then you don't probably don't
need to do a knowledge graph, but if
you need to deal with a new source
that's coming along, a new source that's
coming along that I don't know, new
source and so forth, I think that is
where you want to have that flexibility
to extend and add more things.
Now, everything that I just
talked about are just the
facts of the facts that exist.
I just pulled that in.
There's nobody can deny this.
I just, that's what these
systems are telling me.
That's that technical metadata.
Now we're going to go talk
to the business metadata.
And this is where people say,
okay, how should we start
thinking about business metadata?
Let's start thinking about
it as what are my business?
What are the terms?
What are the words that we use?
Let's start creating a list
of words, your business,
glossaries, and all these things.
You can then start saying,
hey, is this a term?
Oh, this is a metric.
So I can start saying more things.
A metric has a calculation
around these things.
Oh, this term is related to another term.
And that relationship is hierarchical.
Oh, I created a taxonomy
around these things.
Oh, this term is a relationship
between these things.
And there's, and the relationship
name is, Was placed by an
order was placed by a customer.
Oh, so we start then adding more and
more information around these things.
And you can say, Hey, a customer
can purchase many orders.
Yes.
Yeah.
Okay.
There's a cardinality of
keeping track of that.
Oh, an order can only be
purchased by one customer, right?
That's the different credit.
So that's more and more knowledge
that I'm adding that can keep track
of that in that business glossary.
So you start from a business
glossary and it can add more
and more semantics to these.
Now the question is, how do you come
up with these business glossary terms?
Of course, everybody wants the magic wand,
and I think this is where you're like,
with AI, it's definitely helping to to
either identify what these things are, or
actually create descriptions and so forth.
So now I just have my layer
of my business metadata.
I have my layer of my technical metadata.
What we also need to have is this
layer of, let's call it relationships,
saying, Hey, this table called CUST
underscore PR, whatever, right?
Oh, that means customer or that is
data about related to customers.
So I start saying I have
a concept called customer.
I have a table here called
CUST underscore whatever.
I want to make a
relationship between these.
The question is, how do you
create these relationships?
I want these relationships
to be automatically created.
Of course, AI will help to go do that.
But then this is where kind of governance
comes in because you say, Hey, I'm really,
I'm adding an edge between these two
things that the AI said, and you can say
I want somebody to come in and approve.
Is that the correct thing or not?
Or what should I say?
So then that's how you can start
building these things little by little.
And this is what happens underneath inside
of a graph and starts getting built out.
Nicolay Gerold: Yeah.
And when company wants to start going
from the application perspective
to the like more knowledge graph
data catalog perspective, putting
the data at the center, how do
you see that transition working?
Like you have a jumbled mess
of different data sources.
Like where do I even begin?
Juan Sequeda: So first of all
this is the social aspect, right?
So this is okay, what is the
most important data that we need?
That should be that we
know that we can reuse.
And this is actually the main reason
why I tell people why you should start
investing in semantics and knowledge.
So if you create a data set, the
data product, whatever you want
to call these things now, that's
another big buzzword data product.
The value of that is that it should
be reused and that, that's why do we
create all these silos is because I
like, I have to go, I need data to go
answer this question and you realize,
wait, I could probably use that data,
but then you go look at it and you're
like, Oh, I got these questions.
I don't know who to go talk to.
What does that mean?
You know what?
Let me just go start myself.
I'll do it myself.
And then I do it myself
and I solve my problem.
And then somebody comes along and
says, I have something very similar.
I could be using that that, oh, that
person doesn't work here anymore.
I don't know.
I don't know.
And that's, I don't want to touch it.
I'm going to start again.
We re we build more over and over
again because we're not reusing
the data that exists and we don't
reuse the data because we don't
understand what that data is about.
So the purpose to actually start
investing in semantics and knowledge is
because we want to reuse data, extend
data, compose data, or combine them.
Okay.
So what is the value of me
reusing data and doing things?
This goes to economies of scale.
I want to make sure that one
plus one is greater than two.
So if I built this data for one particular
use case and it has and it's well
described, somebody else comes along.
Who has a completely different use
case, which I had no idea about, and
they're able to go use that data.
Look I made a unit of work to address
problem A, and that unit of work
is now able to address problem B,
C, D, E, and I only focused on one.
So I think that's the economies
of scale to go to that.
So if you are truly, this
is a mindset, a mind shift.
That we need to start having.
And I think that mind shift needs
to be from a technical perspective,
saying hopefully people, the technical
people listening is I want to make
sure that this work that I'm doing
can be reused over and over again.
But I also want from an executive top
perspective saying, I want to make sure
that what we're investing in, we can
juice squeeze so much juice out of that.
So this is really creating a culture.
Of people thinking about I want
to go Not only solve the problem,
but I want to make sure that the
work that i'm doing is reusable And
for that it's a culture of that.
It's creating incentives
around these things.
And so again, all this
is just a social aspect.
There's no technical things to go do here
Nicolay Gerold: Yeah, I think
what I always like, I don't like
the term data products either.
I think in terms, I think data asset is
something I rather like to talk about
because you're like building on something
that can be built upon in the future.
What are some of the things,
especially when it comes to data
catalogs and knowledge graphs?
That you have seen become
possible after they were created
that were impossible before.
Juan Sequeda: Yeah, so I wouldn't
say it's not they're not they weren't
impossible It's that the cost of
doing it was so low and it was faster.
So it's like it's Sarah we
enabled serendipity around this.
So I think it's not, I don't, I'm not
arguing that the impossible, right?
Everything's possible.
It's just software.
We'd have to go write things, right?
You can do anything in the assembly code.
You can still do that, right?
So things that I, if we look at,
for example things that, metadata
examples and stuff using very
explicitly things on the graph.
So one of the stuff I really love doing
is that when you look at the, look
at metadata as a graph, in addition
to Common things that you do on a
graph is like data lineage, right?
I want to see things and
you have impact analysis.
Oh, if I change this, what
is it going to impact, right?
Or this number is not happening, right?
Those are features you can code those in.
Everybody has that type of stuff, right?
So that's just the bread and butter.
But if you start leveraging the graph,
things that I like to go do is I want
to go find bottlenecks in the graph.
Oh, what is like in the graph
of the metadata and what
would a bottleneck imply?
Oh, it means that there's a lot of
either a lot of things go into it
and just a couple things go out to
it or many or small things go into it
and a lot of things go out of it or.
So that implies that it's
a very important node.
That's a big table, a transformation
that's being used a lot.
That depends a lot of things.
And then you can say, Hey
who's responsible for that?
When was the, where's the documentate?
So we need to really, how do I prioritize
where I should go focus my work on?
Finding your nodes, finding
your complexity nodes
around these things, right?
That's a particular example that
we're seeing a lot of people with our
organization do so you can just start
thinking about, start applying a bunch
of like old graph analytics around these
things, finding communities around this.
Oh, this is a very common things
that people are thinking about.
This is a similarities, right?
These are these clusters.
We've got a lot of tables and data
around this type of same stuff.
What's going on around these.
So I think just a bunch of like graph
analytics, you can go apply to that.
So that's one particular example.
Now let's shift a completely
different example.
Tied it next to the metadata.
I think the value of the knowledge
graph and the and specifically.
Again, knowledge and graph one.
If I look at the graph, it's the
flexibility and extensibility to go
do this and knowledge that can start
extending my schema, my extending,
my ontology without having to be
dependent of a vendor, which I
think this is the value of this.
So we talked about snowflake and EDBT and
an ETL and EL tool moving, all that stuff.
But let's say I want to start
bringing in my, all my employees.
Because I want to create, I want
to also add into my graph, the
structure of my employees and what
teams there are, because I don't
know who should be responsible for
this particular table, but I do know
at least what department should be.
So if something does happen, I want
to know who should I go talk to.
So bringing people around these
things things that we're starting to
see people, I want to extend this.
I want to bring in my, my, yeah we're
bringing in our machine learning models.
Okay.
But this machine learning model, right?
Takes us input.
This data over here has these features
it's being used for these applications.
So you start connect.
I'm adding it more and more.
We're starting to see people integrate
and adding like business processing
saying, okay, I want to understand how
we how we do pricing for our product.
Okay.
So do pricing for a product.
We have literally, we need this datum.
Then we take, we have this decision model,
and the decision model is, if this, then,
did, blah, blah, blah, okay, there's a
decision that happens, and, oh, and part
of that decision is you have to go talk to
Bob, and this person, or Alice over here,
okay, so if you think about it as a graph,
this data comes into the decision, people
go into this decision, the output is a
decision that may go to another decision,
and blah, and all of that stuff happens.
So when we talk about data lineage, How
am I talking about business lineage?
I want to understand how this
business works and how we define
pricing around that stuff.
And if we're going to do a pricing
change, I want to understand what are
all the pieces that are a part of it.
And where are the bottlenecks?
Oh, Bob is involved in so many
different projects around this.
So it all, we start combining
more and more of these things.
I can give a couple more
example, but I don't know.
Let's stop there and throw it back to you.
Does that make sense?
Is this interesting?
Is that interesting?
Nicolay Gerold: Yeah, I love that stuff.
I think like the practical hands
on examples, I think like knowledge
graphs is talked about a lot.
But most examples just take
an LLM to extract entities and
relations, throw it into some form
of storage and then query over it.
Juan Sequeda: Text, right?
So let me talk about this for a second.
So why what people are hearing knowledge
graphs more today is because they're
hearing it in the context of graph rag
and they're hearing the context of.
I want to be able to extract more
context out of my documents, right?
Cause the current way you're extracting
the context from your documents that
you're just doing embeddings and
sticking to vector database, and
you're lacking the relationships
and the main concepts and stuff.
So you want to do that?
So that's the first step.
And I think it's already
shown that's already better.
A lot of our work early on
is pushed for that, but.
You're using that for a particular
application and everything I'm
talking about here is like really
step out and start thinking about
the context of your organization.
Think about the, you want to build
the brain of your organization,
how things work, and that
doesn't happen overnight, right?
You got to, you have to start
with the use cases, start with a
brilliant and start building that
stuff out because guess what?
That brain, that context can
be used for something like.
Graph reg can be used for other types of
decisions that you want to go do for any
other applications, not just for graphic.
So it's so if you're goes back to
really, why are we doing this, what is
the problem you're trying to go solve?
And I think you realize that problem
independently are really hard.
But this, if I start integrating
data and I start treating knowledge
as that first class citizen helps
me go solve all those problems.
So maybe what I should be doing is
investing in that strong foundation first.
And that strong foundation starts
with metadata is that's, that I'll
tell you is something I learned the
hard way I started doing things all
over the place and realized, oh,
shoot, I should start bought a little
bit bottoms up with the met with the
layering semantics little by use cases,
I ended up going to the ocean and
stuff like that early on in my career,
Nicolay Gerold: metadata starts
to bite you once things go wrong.
And then you actually realize you need it.
I think like people realize like
graph rack, you need to have a
good context engineering for LLMs.
What's missing is a little bit like
you need a good context engineering
for your entire organization as well.
That you actually understand what is the
data, where is it coming from, how do I
actually use it, what could I want it for?
Juan Sequeda: Note that not everybody
needs to do that investment.
If you're a small startup now, just
go hit the ground running, go do
things, I think that's level maturity.
When you start realizing that it's
biting you back because you're trying
to go you're trying to solve these big
business situations, not just, not even
just answering a simple question, right?
You're, there's these big problems.
Like we need to improve our
operational efficiency of how we are.
Our are stacking boxes inside
of the crates that we're
shipping of our products.
Because it's just every we have
problems all over time, right?
And we, then we, that's, then that's how
you realize like it supply chain itself.
Actually if you look at so one
of my co-authors is oral Silla.
He's the one of the original
writer authors of the Semantic
Web vision paper from 20 2001.
He's actually a guy who designed
the first version of RDF back in 98.
And he's at Amazon and he's
on, he's a principal graph
technologist at Amazon Neptune.
So he gave the keynote at the
Semantic Web Conference last year.
And he was able to publicly talk about
how Amazon supply chain is all now
modeled in a Knowledge Graph and RDF that
they've done, runs on on, on on Neptune.
And it's about all these things.
I couldn't get into the details, but it
was built for this particular use case.
And then.
All these other use cases start,
we're able to go and start reusing it.
That's the economies of scale.
Nicolay Gerold: Yeah, and also there's
a certain degree of diamondism,
yeah, I hope I used that correctly,
you have to consider, because in a
in a startup especially your main
dataset is probably still in constant
flux it's changing all the time.
So tracking metadata around it and how it
relates to the things, it's Not as useful.
Juan Sequeda: there's no ROI on that.
Okay.
I tracked it.
What am I going to do?
I'm going to change it anyway.
So I just, we're like, yeah, so
this is where you have to be very,
again, honest, no BS, because
you don't want to go do this.
And I think this is mainly more
for when you start getting into
the issues of scale of enterprises.
Is that's what we're seeing.
Nicolay Gerold: Yeah.
And it's similar for me, like feature
stores, for example, are in a very
related domain, like data lineage,
cataloging the different datas.
How do you actually see like your data
catalog and knowledge graph relate
to feature stores and how do they
either work together or does one of
the other make the other obsolete?
Juan Sequeda: That is a fantastic
question, which I am not prepared to
answer because, and I'm telling you
actually I attended the MLOps conference,
which was here in Austin last year.
And I was just wondering, listening
to talks and I'm like there is a
whole different world, which I am not.
I'm not in that world, that bubble.
And they say different words, but
they're very related to what I'm
seeing in the data, the metadata world.
And they're not, and they're not talking.
Cause at the end of the day, it's like.
The data world does all their work
and they, and the data lands in
the lake or wherever, and then the
MLAI world, specifically ML, right?
It comes in and they take that
and they go do all these stuff.
But then they're like,
we're building features.
Wait, that's how you're
transforming data, right?
You have some observabilities, right?
And ops, we have all that
over here too, right?
You have systems that are
tracking where all this stuff is.
Oh, that's our catalog and metadata.
You have guardrails that you're doing.
That's also governance over here, right?
No, it's a lot of parallels.
No, I'm not implying that there should
just be one tool for everything.
I maybe there could be, I don't
know but there are just parallels.
I'm like, Hey.
Let's go hang out at the same bar.
Let's, or let's go talk and let's learn
from, let's learn from each other.
Nicolay Gerold: I think
that's the issue of AI.
I think AI as a field likes to
reinvent a lot of stuff, which has
already been invented a while ago
by different fields, and ends up
with a similar solution in the end.
Juan Sequeda: This is and this is not
just ai, it's so many different fields.
And so let's, we stick
specifically with ai.
Back in the eighties when the
focus of a lot of the AI was
on symbolic reasoning stuff.
You had the AI world doing
things and you had the database
world doing these things too.
Then.
Same thing happens, right?
They use different words, right?
And it's just common human nature
because we live in our own bubble and
and we think about things and at the
end of the day, it's it's validating.
It validates, Hey, we're
talking about it, right?
I think this is why, at least
I personally consider myself.
Somebody who likes to build bridges.
So I like to hop around
and observe things.
Because then maybe, you observe it.
Nobody pays attention.
It's okay, but maybe you do find
these things saying, Hey, one plus
one is greater than two on this stuff.
So let's work together.
Nicolay Gerold: Yeah, and what are the
types of questions or the questions
you actually As to figure out the
data catalog and the knowledge graph
you have to construct for a certain
use case or for a certain business.
Like how do you actually run through it?
What types of data do I have?
What types of use cases
do I have to support?
Juan Sequeda: This is a great question.
So I think if we look at it
from the governance, the space.
What we're seeing in the market
right now is from a use cases,
there ends up being, I think, in our
bucket, like three, three use cases.
One is around search and discovery.
I don't know what data I have.
I just search for data.
And the reason I want to search for
data is because we're trying to be
data driven and do things with data.
But so what is the actual use case?
I like to, I need to push
people more of yeah, I get it.
This is a problem everybody has.
They can't find it.
But what are you trying
to go do with your data?
You really want to get to more details,
but in general, they want to go,
they're trying to find data so they can
answer questions to go solve a problem.
That's going to make them more
money and save them money.
That's one part.
And that's where it's search discovery.
These data marketplaces fall in.
These are these data products
are falling into all these,
into that area right there.
By the way, I agree with you, like
the, I don't record this right now,
but this last, this weekend was
day to day Texas here in Austin.
And we had yesterday, we had
an entire session for an hour,
like 20 people talking about it.
So what is the data product
and everything at the end?
It was, I think Joe Reese was in the room.
He's look, if I were.
You think the plumbers are in their
conference saying, Hey, what is a ranch?
And no, people would think
about us being like we're
insane talking about this stuff.
So I agree that this being so
pedantic is not helping us.
And I think it's a very technical
computer thing that we want to be
able to have things so defined.
I'm like, okay, so we define it.
So what are we going to go do?
How's that going to change our lives?
Like anyways okay.
So that's one thing on
the search and discovery.
The other application
is around governance.
So it is about and I like to see
governance in two aspects, which
is think about it as the protective
one or kind of the enablement one.
So the analogy is, why do
we have brakes in a car?
So usually people say,
so we can slow down.
I'm like, yeah, be protective.
Now I want to make sure yeah, don't have
accidents, but you can also in the way
of interpreting breaks in the cars that I
want to be able to drive very fast safely.
So then the governance is, okay,
who has access to the data?
Where is our PII?
Who's responsible for this?
And who can put all that?
That's the governance aspect of things.
And the third one, I call it more of the
technical, the data engineering side.
And this is, Oh, I need to,
I want to have my lineage.
Where's my data coming from?
Is the contracts, the quality, all
those types of aspects right there.
So I think those are the three
main buckets around that.
So those are the three main use cases.
And now, I forgot your original
premise of the question.
Okay but now that I've described
those three main use cases,
ask the question again.
Nicolay Gerold: I was asking, I
nearly forgot the question as well.
No, I was going in the direction, like,
how do you see what questions are you
asking when you start building the
knowledge graph and figuring out the
data catalog that you want to build?
Juan Sequeda: so the first
question is, which are the
use cases that you care about?
Are you more during the search and
discovery data marketplace data products?
Do you care more about the
governance around these things?
Do you care more about the
data engineering type of stuff?
So that's the first thing
we're going to go do.
And then after that, I want
to, we want to ask people, it's
So how do you measure success?
What is your goal?
How do you know that you
are being successful?
So Is it adoption?
Is it usage?
So if I'm doing search and discovery
My goal is I want to have more people
finding data And I want them to be
able to go yeah, and find more of that.
And by the way, that can also apply
for governance because I, by the
way, these are not isolated, right?
People are like, I'm going to only do one.
No, you end up using, you
end up doing all of them.
But it's which one is the
one that's driving you?
And on the, so the search and
discovery will be, I need to, I
need, I want more people to go drive
and find data and stuff like that.
The governance can be a
very regulatory thing.
It's we just got fined for something.
Our goal is that they get fined again.
Okay.
Perfect.
I tell people if that's your real goals,
like you got to be cautious about that
because then if you really never want to
get fined, just make sure nobody touches
the data and then you're done, right?
But then, so I think if people come in
from a very kind of regulatory point
of view, I really tie that to, but you
also want people to go use the data.
So let's make sure that.
You're making sure that you're
keeping the data safe, but
making an enablement of that.
So therefore there should also
be an adoption, a safe adoption
that nothing bad happens.
If you're coming from a data engineering
perspective, you may be saying, yeah,
we get so many tickets around this stuff
and people are complaining on that stuff.
So then.
You also want so then your goal is to
see that those requests come down or
the time it takes to resolve them are
faster, but then at the same time, you
want to make sure that the usage goes up.
So I think an overarching kind of theme
that I, that we realize is that people
should say, I want the adoption of data.
I want more people solving more problems,
using data to go do these types of things.
So that's we want to set the
stage with folks like that.
And then I think this is when you
it's really valuable to bring the
product mindset, the product page,
and then not to get the data products.
It's just in general.
It's who are your consumers?
Who are going to who's
going to be using this?
What are the problems that
they're going to try to solve?
Finding data is not The problem
they're trying to go solve,
they're trying to reduce churn.
So there need to have a
lot of churn customer data.
Okay.
So then let's figure out what are the
customer data on these things to go do.
And then yeah, our concern is
that people complain about,
there's so many, the numbers don't
match around all these things.
Okay.
So on what topics?
So then that's how you start realizing
what are the things that they need
to start investing in and what are
the other types of sources that
we need to be able to go in, which
ones should we be focusing on?
Which one we should we not be focusing on?
And then the honest thing is that for a
lot of these use cases, you're just using
the kind of the very foundational bread
and butter stuff of a catalog where you
don't even need to go into the advanced
kind of capabilities of a knowledge graph.
What we do see later on is that everything
I just described is like the foundations.
Like you need to eat your vegetables
and go to the gym a little bit, right?
And do that type of stuff.
And when they start doing that, they're
like, okay, that prop, we came for that.
We came here with that problem.
We got that under control now.
What do we do next?
And obviously AI is one of the
drivers of what to go do next.
People want to go do AI and that's when
we're seeing, okay, you want to go do AI.
You need to have your context, then
you need to start investing more in
your metadata semantics to go do that.
And then that's even though even the
work that we were doing early on was
like, I want to be able to go chat with
my data, things that we're doing now
is I want to go chat with my metadata
to ask questions and build these apps.
So that's when we're seeing the
maturity heading towards it.
If I kind of summarize this right now
is there's like the true foundations
that comes in on those three aspects
of search and discovery governance
and kind of data engineering data ops.
And then from there, the next step people
are coming into is that they want to
have that start adding AI to things.
And the first level is
conversations with your.
With your metadata, with your, with the
context with Hey, what does this mean?
How do we define this?
Actually saying, I'm trying
to go solve this problem.
What should I go?
What should I use to
go solve this problem?
Oh, you should be thinking about this.
You should be thinking about this
data, by the way, you should probably
talk to this person or these teams.
And then I think after that,
they're doing much more than
they get into more advanced AI.
Oh, I'm building AI agents and stuff.
And I want my agent to orchestrate
to the context of because I want
to chat with my data and combine
structured data with unstructured data.
By the way, combining structured data
with unstructured data, knowledge
graphs are playing a big role in there.
So they realize if I want to go do
that, I probably need to go start my
foundation and all the work that you're
doing early on to know, to do the basics.
I need to find my data.
You're building that foundation.
About semantics, you're building that
knowledge graph little by little, which
at that moment, the original moment,
you're probably not taking advantage
of it, but you're still building that
stuff such that when you're ready to
go do it, take advantage of doing more
advanced AI stuff, it's there for you.
Nicolay Gerold: Yeah.
And when you have the use case for
example, you want to reduce churn or
you have data faults you want to fix.
How do you determine, okay, what's
the minimum amount of coverage across
my data that helps me fix that?
Juan Sequeda: So this goes into the
whole methodologies and processes
around this again, talk to people.
This is a big thing about it.
And I have the approach I propose
always, I call it the iron thread and An
analogy to this also, parallel I think
one of the, one of the technologists I
admire and read a lot is a guy, Gregor
Hope he was at AWS recently, a big guy.
He has a book called The Architect
Elevator for enterprise architecture.
And his analogy is that these
enterprise architects, the best ones
can go drive, can run the elevator
down to the engine room, but all
the way up to the penthouse, right?
So you're navigating that.
So an iron thread
approaches, you say, look.
Let's take one particular question,
a very small, a small set of
questions around things that you
can tie directly all the way up.
Who's asking this?
Why are they asking it?
What is the problem it's causing?
How are they solving this problem today?
Why is it painful?
What happens if they
don't solve this problem?
Get all that.
And then you start driving all
of that all the way to the bottom
and figure out what you need.
You go figure out the politics
of the people that you need
to go deal with to go do this.
Oh, who, what does customer
mean around this stuff?
Yeah there's two, three,
four people of these.
Let's go figure out what that is.
Where is this in the data?
Where was the code written for that stuff?
And then you just got to go all the
way down until you hit the rock bottom.
You say, okay, this is
the ultimate source.
You created that thread around that stuff.
And now you're able to solve
that particular set of questions.
And then what you realize is that once
I've done that entire thread, what
other questions or problems can I be
solving with the work that I did on that?
And you realize it's oh, I can
probably solve a couple of other things
that I was not even thinking about.
That's the serendipity, right?
That's the 1 plus 1 is greater than 2.
And now you have a process.
And then from a social perspective,
you started to meet people.
You started to gain
trust with their people.
You now technically know
what's working, what's missing.
You already realizing, Hey, that
would be a good candidate for stuff.
And then you do it again.
It's a threat, right?
So then add another thread
to go do that stuff.
Add another thread.
And then you figure out that
muscle, how to go do that.
Then you have other people can say, Hey,
here's how you should go do this and let
other people start building those threads.
And then you start getting
that stuff together.
So that's the approach that
I suggest to be able to go do
Nicolay Gerold: And we basically, we
walk down the ladder or the elevator.
We are at the bottoms.
And you start constructing
the knowledge graphs.
How do you do it?
What do you use for the extraction?
Or is the first knowledge
graph mostly human made?
Juan Sequeda: Yeah.
So in this conversation we're having
up to now, which I didn't even know
where we're going to head towards.
So I love this.
We've been having all this conversation
about data catalogs and metadata, right?
So from that perspective
in the enterprise.
The that's built automatically
because you're just pulling in, you're
pulling in the stuff, the facts.
There's a, you're creating
collectors or drivers.
Our tools like what we
do we do that for you.
And then we already have your
ontology that represents.
What tables and columns and
dashboards on how they relate that.
So that first technical metadata is
built completely automatically for you.
Now the business terminology one, that's
where sometimes the first level of
things are going to happen manually.
So what we typically see is that
either people already have some sort
of of a business glossary somewhere,
some spreadsheets that people have,
it's in a confluence doc, whatever.
So then they can easily carve it
into a spreadsheet and then that into
some CSV file, whatever, and then
we turn it up and then we load it.
And now, so It's It's manual because
somebody already existed, it was
created, but then turning that word,
that exit, that, that thing into the
graph, that's done automatically, right?
You just can transform that.
Then comes making those relationships.
Those relationships can be done
automatically, but then I think this is
where the human comes in the loop, right?
You want to make sure it was that, did
that relationship make sense or not?
Or what we see a lot is people
making those relationships manually.
But that knowledge graph right there
is a metadata knowledge graph, right?
It's talking about this concept customer.
It's related to this table.
And these tables are these set of tables,
these tables are in these databases and
they're these dashboards that uses that.
So I think that's that metadata knowledge.
Now, if we take another particular
use case, let's talk about the
stuff that's happening in graph
rag and people are basically
extracting things from from texts.
That's you.
That's usually almost always
done automatically, right?
Cause you're just
basically doing NLP, right?
You're doing that.
We entered the extraction relationship
extraction and you go build these things.
But what I see a lot is that, that
knowledge graph is really more focused on
the, it's more graph and less knowledge.
Because you're pulling in and all the,
okay, here is this, here's this thing,
Bob and Alice and so forth, right?
Then you're extracting them.
Okay.
Then somehow it's a it's person,
but maybe you define one person.
Then they define this one as a
customer, but then you start having
all these different, the knowledge
part is not really well aligned because
you did everything automatically.
You didn't add any governance around that.
You didn't have, nobody's going in and
man, you imagine what this stuff means,
or how do I know that it generated two
entities for Bob, but was it the same
Bob or they're different Bobs, right?
So that's the process that
happens when it's fully automated.
When it comes to text, yeah, you, I
think you use everybody does every,
you have to do this fully automated.
But I think there lacks right now,
a notion of how do I do some sort of
governance or just understanding, did
that extract the correctly or not, and
I think that's something that we're
in new territory, which is like how to
understand that semantics, that governance
layer on top of the knowledge graph that
is extracted automatically from text.
So.
that part is fully automated,
but there may have mistakes.
But I think right now, the
situation that we're in, we're not
doing much of the manual stuff.
I think we need to go do that.
Then we look at stuff like, I want to
go build my knowledge graph of my data.
That's inside a relational database.
So in that first scenario, I'm
talking about the metadata.
So I'm like extracting the table names,
the column names and all that stuff.
And this third one, I'm
saying, I'm literally talking
about what's inside the data.
I want to create a knowledge graph that
says, Bob, perch made this purchase
on this date on these products.
And there's a, and there's, we have all
these products are related and so forth.
That is, you can see my back.
I have a book right here that's the
book I have, which is about designing
and building enterprise knowledge
graphs from relational databases.
What we, what I see.
So I call this an era
before LLMs and after LLMs.
Before LLMs, it's always been
very manual because you want to
understand what this stuff means.
So you're like, hey, what is a customer?
Okay, I can create a node called customer
and there's a table called customer.
Okay, but is that, does it mean that
every single row in the table customer
actually means what it's a customer?
Maybe the customers are the ones
that are only active, so I have to
know that it's only the, where active
is 1, or A, or whatever, right?
How do you start finding
all these things out?
I think now with LLMs it's a way of
helping us create these mappings,
these transforms, we're able to go do
that be able to semi automate that.
But I think for that, you actually always
need some human oversight and governance.
If you're using this for an enterprise
perspective, where they're going
to require that governance aspect.
Nicolay Gerold: I think that's
two very different areas.
The one knowledge graph is very
curated and probably way smaller.
And with LLMs, we see this really
open knowledge graphs, which
are like infinitely expanding.
It's, I'm really torn on the second
part, whether it actually provides any
value and whether I actually want it, or
whether I want to constrain the entities,
the relationships I actually extract
to the specific use case I'm handling.
Juan Sequeda: I agree with you
but we need humanity basically
to figure this out on their own.
We need people to go off and try
things and let them smash into the
wall and see what works and what
doesn't work because I'm with you.
But maybe people doing this, they find
things that we can go make very useful.
My hypothesis is that depending
on the use case and my bias
towards more enterprise use cases
where you want, you need to have.
Accuracy, you need to have explainability,
you need to have governance around this
stuff, you need to generate trust around
this you need to start, you need to
start investing, what it's called, in
that semantics to constrain these things
and having oversight around that stuff.
But other use cases you may not need to.
But I think there's other
situations to even consider, for
example let's look at Wikipedia.
Wikipedia scales.
There's humans involved.
There's incentives for them to go do that.
Why are Wikipedians
incentivized to go do that?
I actually don't know.
I'm not very curious to go do that.
So it's like, how do we Get that
because the argument for not
doing things such in a constrained
way is because it won't scale.
I'm like, yeah, that's true.
How do we go?
So I think LLMs are helping us go scale.
A lot of the research that we're working
on right now is how to extract what's
in people's heads to scale with LLMs.
And then I think we're
starting to go to see that.
But like people, we've been
building Wikipedia for a long
time and Wikidata and stuff.
How is that working?
I find that interesting.
I want to learn more about that.
Nicolay Gerold: Yeah, I'm really
Juan Sequeda: people have done a
lot of research around that stuff.
It's I just need to find time
to read the papers that have
been working on this already.
Nicolay Gerold: I'm really
torn on it because I think
LLMs are also really fussy.
They're probably been trained on
100 definitions of a customer.
And which one are you using in that case?
I'm not sure.
Juan Sequeda: And I think this is why
if you think of the, an architecture
around this is you have, I think LLMs
will touch everything AI, and it's
not just LLMs, there's like normal
machine learning models, right?
So you have AI touching everything
on top, think about it as.
Having the problems that you're
solving, which you create applications,
you create AI agents and so forth.
And then at the bottom, you have
the data, which you're going to
use to go answer those questions.
That can be structured data,
tables, columns, right?
The text, all that stuff.
How do you get from the
applications to the data, right?
I think that, we've been working on this
when we create your data warehouse and
stuff like, but what we really want to
be able to do is, I call this you want to
build that enterprise brain, that context.
And if I, the analogy here,
think about is the library.
I want to I'm organizing my library.
And when you have your library, you
have a question, you go to the library
once upon a time, people would go to
libraries, believe it or not, right?
So if you, if you go to a library
and you know what you're looking
for, you could just walk in.
You can find the signs, go this you,
there were self service to go help you
find the book, you find the next book.
This is what I need.
I check out.
I leave.
I'm done.
That, but sometimes I'm trying to
solve a problem, which I don't know,
or I tried it and I couldn't find out.
So you would go to the reference desk,
to the librarian, and you say, Hey,
I'm trying to solve this problem.
I need this.
And the librarian asks you
questions and tries to understand
really what is your information
need, what you're trying to do.
And Oh, hold on.
Okay.
You may need this and you need
this and you need that, right?
And then you go do that.
And by the way, I can't answer your
question where it is, but this may
give you more things for you to
go think about and re reevaluate
what you're really trying to go do.
So what is the analogy of that?
So I think that library is that
enterprise context, that brain, I'm
calling it, you're governed data and
knowledge, and then we need to have
that reference desk, that librarian,
which you don't always have to go to
the librarian, but that librarian,
think about it as your orchestration.
It's I'm here, you need questions.
I can tell you, go here.
Oh, you want to know what a customer is?
Boom.
Here are five definitions of customers.
You want to know how
many customers we have?
Here are five different answers based
on these five different definitions,
and if you have more questions,
you can talk to these people.
That's what that reference is.
So I think that's the kind of the
evolution of things that we're
doing, that we're heading towards.
Oh,
Nicolay Gerold: In your mind, what
makes something good entity in a
good relationship that should be
included in the knowledge graphs?
And what is just a property?
Juan Sequeda: okay.
So a rule of thumb that I have, and by the
way, this is an art and a science a lot.
We talk about these things, right?
So it really depends on
what you're optimizing for.
If you're optimizing for a particular
query performance of something,
then maybe you merge everything
and everything is a property.
So it really depends on what you're
working, what you're optimizing for.
So that's the first thing.
My answer to your question is,
it depends, which I always hate
that answer but this really is.
Okay.
But if you put optimizations or pre
optimizations aside it the, a rule
of thumb I say is, if I need to say
more things about something, then
that something should be an entity.
And that means, and that implies, that
it has a unique way of identifying it.
Because I need to point to it.
I need to say that thing.
And that thing is one, two,
three, or ABC or foobar, whatever.
It has a name, has a unique entity.
Go do that.
And then what you, what sometimes
happens is you say, Oh, you don't
realize that, that you have that
you're categorizing things and then
it starts getting bigger and bigger.
And then other people want to
go point to that thing and they
don't have a way to point to it.
And then that's when you realize
I probably should need to
normalize it and extract it out.
So an example, a person has an address.
Okay, I could just create a concept
called a person attribute property
name, email, street address,
street, city, but there that's fine.
But what if I want to reuse that
address again somewhere else?
I want to say that's a home
address or it's a work address.
I'm starting to say
things about that address.
Okay, then let me extract that
part and make it an address.
I say a person has a work
address and so forth.
And I can point to that.
I can reuse that address for more things.
So that's the rule of thumb.
But I also say don't be
pedantic around these things.
Just start small.
Iterate around this.
And the other thing is, you are probably
not doing, you're not probably modeling
a domain that has never been done before.
So look at what has been done, and
that's why you have reference models,
and go inspire yourself from that.
And I think the one I'll recommend
everybody to look at is go to schema.
org.
Schema.
org is the ontology, the schema
that all the big web giants, Google,
Yahoo, Microsoft got together almost
15 years ago to start building
small and got bigger and bigger.
And takes time.
It's done to completely centralized ways.
Use that, for example, as a reference
as an inspiration in finance is a
favorite one to hide over the finance
industry, business, oncology, anything,
every domain, a bunch of stuff.
Nicolay Gerold: Yeah, I think at
the moment when I'm looking at the
different knowledge graphs, tools,
Especially when it comes to storage
and querying, I don't see many people
actually using a lot of graph analytics.
I think a lot of it is really over
engineered for 90 percent of use cases.
I would to know, what do you actually
recommend for storage when people
build their knowledge graphs?
Juan Sequeda: So I'll say that for
some context is, I think that the
knowledge graph database market stuff
has always been slow because people
end up and paying attention to it.
And I'm hoping now that there's gonna
be more and more interest to it.
So that's why you that's that is my
reasoning of why we don't have enough
tools or better tools to go do things.
And honestly, there's
not that many, right?
If you think about it from a
database perspective of graph, right?
You Neo4j is obviously the
big popular when you go do,
then you have the cloud one.
So AWS has Neptune.
Google launched a recent
one called Spanner graph.
Nicolay Gerold: Yeah.
How
Juan Sequeda: Microsoft has always
had something called Cosmos DB, but
they're building something else out.
I know.
So they're all this.
There are other ones like in Europe,
there's a company called Antotex.
That's GraphDB.
They merged recently with with
pool, the pool party, which a
company that does taxonomies.
There's another one called Stardog.
They're all really small.
And I think there's a lot
of open source ones now.
Kuzo DB, I think is one.
From a from an academic buddy of
mine that folks at DuckDB have put
in a graph layer of these things.
There's also a tiger graph.
So there's all these
things coming out there.
Then, and then there's and some of them
are very, are property graphs and ciphers.
Some of them are also RDF and Sparkle.
Some of them will do both.
An open source one to go
look at is Apache Jena.
All right.
That's an RDF one and you can
have different storage layers.
We're in public about this at data
dot world, we use everything's an
RDF and our storage layer that we
have is something called RDF HDT.
So header dictionary triples,
it's basically like the
parquet formats for RDF.
So we've separated storage and compute
in our part of our compute engine is
Apache Jeno though, and we've extended it.
So support this and that works for us at
a web scale data that world is, I think
one of the largest or front, the largest
open data community to, in addition
to being an enterprise data catalog.
So we have 2 million users.
We have half a million data
sets, open data sets, and
every data set people upload.
Yeah, it's a CSV, it's a parquet.
We have an RDF from these four.
And it's all queryable.
So we've been able to show all
these scales on a web scale.
Nicolay Gerold: Do you see enterprise
knowledge graph usage evolving?
What are the trends you're
actually excited about and what
do you think we should throw away?
Juan Sequeda: The number one
that we're seeing is metadata.
So if you look at, I don't know, I'm
not, I'm going to bet that your audience
doesn't pay attention or probably even
hates the analysts like Gartner and
stuff, but Hey, there, they do have a
pulse on the market and it's very clear
that metadata, how they even present the
market is that metadata and all this stuff
is all knowledge or foundation of it.
That's what I'm really excited people
realizing that metadata is a graph.
I think what a lot of the analysts
said, Hey, metadata is a graph.
I'm just connecting things, right?
And then you start, you want to
build in, you want to bring in
more things and connect more and
more things to that stuff, right?
So I think realizing that knowledge
graph your first knowledge of
application is a metadata application.
That is a trend that I'm now
seeing more and more again.
So that's number one.
Of course, with AI, LLMs is now starting
to think more about ontologies and
taxonomies to go add extra context.
Ontologies was always a very don't say
the O word we would say, but now people
are starting to go say these things
and people think they're cool because
they're saying this new cool thing.
I'm like, I've been working
on this stuff for 20 years.
So people have worked on this stuff
for 50, 60, 70 years, even more.
So I think that's the next trend
is knowledge as being the source of
the context that these LLMs lack.
And those two things are, they're,
I, they're just one of the same.
The metadata knowledge
graph of your enterprise.
Is the, you start building that out is the
context that can be used for everything
in your enterprise, including all the
AI LM applications that you're building.
So that's the number one trick.
Nicolay Gerold: Yeah, what is
actually something that you would
love to see built in the space
that you aren't building yourself?
Juan Sequeda: I hate that.
I don't have an immediate answer to you.
It makes me feel bad.
It makes me realize I'm in my own bubble.
Because I'm working on this
stuff that I think it really
needs to be done, but let's see.
I think the And I would love for
things to be the, I don't know what
would be built to accomplish this, but
one of the things that people say I
can do this in a wire graph, I can put
everything in a table and do this, right?
So there's that mindset of I know I have
a hammer and everything for me is a nail.
So everything's a relational database.
I don't need that.
So I would like things to
be much more easier to say.
It doesn't matter.
It's all that, that, that database,
it could be relational, it could be
tabular, almost like these multimodel
databases that have existed, but I
think they're, but it needs to be,
it have a way to bridge that gap of
why do I need to go do the graph?
It's it's already there for you.
I want that to be, it's already
there to go, to merge it.
So I think there's something
there for databases to be done.
Imagine you have a way where I can
query this in SQL, I can query this
in Sparkle, query this in Cypher.
It's all the same data, whatever
application you want to go do.
And which in reality would mean that
the semantics is a first class citizen.
In, in how you're storing,
modeling, querying your data.
So I think that, that's something,
because right now we just see silos
of here's a relational SQL database.
Here is a, here's a graph
database that only does cipher.
Here's a graph, like I think
there's an opportunity to merge
all this stuff right there.
Nicolay Gerold: So someone has to
build SG Lang on a different level.
Juan Sequeda: Oh, I think, for starters,
for example, like a Neptune or all
these other graph databases are like now
having the one graph project where it's
doesn't matter if it's one or the other.
I think the ISO standard for SQL is
that they have a version now it's
called PG or SQL PGQ, which is an
extension of SQL that does graphs.
And they also have the
graph query language.
So things are heading in that direction.
I just want to go start seeing
more of these implementations.
Nicolay Gerold: Yeah.
What's next for you and Data.
World?
What are you building right now
that you can already teaser?
Juan Sequeda: What?
So what I've been teasing here is
so we've been extracting metadata
from all these technical systems.
I want to extract metadata
from what's in our heads.
I want to be able to extract metadata.
I want to extract our tacit knowledge.
I want to be able to start
connecting these things together.
So that's one that's one thing and
the other one is really showing more
examples and pushing out things of all
the really cool things that the graph
analytics stuff that you can do over
your metadata and how that actually
plugs it back inside to your graph.
It's Hey, we observe, learn this thing.
So then this is something
you should go do in action
Nicolay Gerold: So what can we
take away when we want to build
with Knowledge Graphs or we
want to build Knowledge Graphs?
If you look at them from a high
level, Knowledge Graphs connect
data points meaningfully.
So they map relationships between
concepts and make more hidden
connections actually visible.
So it's like an organization's
brain in the end.
It's not just storing the data,
but also understanding how things
relate, which are often very use case,
domain, but even company specific.
And data catalogs, on the
other hand, track what data
you have and what it means.
So it's like a library index of data
assets in the end, but also all the
relationships that different data
assets have to each other and the
attributes they have, like access
controls and things like that.
And when you need them, so
you, first of all, you should
always start with a problem.
You don't build these systems
just because they are trendy.
You build them because you have a
problem that actually is worth solving.
Solving with them.
So you need a data catalog when people
waste time searching for data, teams get
different answers for the same questions,
or you're losing money from poor data
management and you need knowledge graphs.
On the other hand, when you have to
connect data across many sources,
the relationships between the data.
Often are very important.
So they are.
Carry a lot of meaning
and a lot of context.
And you also need the context for the AI
system so it can make better decisions.
So when you build your first systems,
or you want to go into the direction
of knowledge graphs, you shouldn't
start with knowledge graphs,
ontologies, taxonomies from the get
go, you should start with metadata.
So, you should think about, okay,
how can I add metadata to the
existing data I already have?
So, add it on top of your current system,
add it as attributes, um, in your vector
database, in your Postgres database, and
add it as a column, and use it to filter.
And then over time, you can start to think
about building full knowledge graphs.
And, you can then basically
create it very naturally.
And add business terms, add definitions
to the different terms later.
I think what was pretty interesting is he
talked about the three key use cases he's
seeing, which is search and discovery.
So basically you have to find relevant
data quickly and understand what
data means and know who owns what.
And I think like this is the main use
case we see for knowledge graphs right
now, like beyond their specializing on
basically enabling search or building
a data catalog on all data assets.
So it's like on a higher level,
but if you're building a knowledge
graph for example, for search,
I think the same applies.
Um, you want to find relevant
data quickly for the specific
search query that's coming in.
And the second one is
basically governance.
So basically you track data
access, you monitor sensitive
data, and ensure compliance.
Unlike I use case specific level.
I think you can apply the same.
Basically we talked last week with
Danielle Davis on temporal rack and adding
a time component to your, to your data.
And I think like this is
the governance component.
You can use a knowledge graph in general.
For as well, like when
was the data created?
When was it updated and
when might it be outdated?
And same goes for monitor
monitoring sensitive data.
So you can add the relationships and
rules to the knowledge graph that
Only certain users can access it.
And then you basically have like
the technical operations part, which
is, I think, doesn't really track to
like the regular knowledge graphs we
would use in search, um, which are
more about like data lineage, track
data quality and manage dependencies.
I think this is not as mappable.
On to different use cases, but the
one data that was working on, um,
say to summarize it again, to make
it work building incrementally.
So pick one problem, map the path.
So like, how can you go from a
business need to a technical solution?
Then create a single thread end to
end and expand carefully from there.
So it isn't about size, it's about
solving the problem you have at hand.
So start small and make it useful fast.
So don't over engineer, start
simple, focus on actual problems.
Balance automation with human
oversight and remember, it's If you're
a small startup, or like if you're
very early in your AI journey, you
probably don't need a knowledge graph.
Yes, that's it.
We will have one more episode on
knowledge graphs next week, and
then we have, I think, one or two
more search episodes before we
actually move on to the next season.
So, let me know what you think
of the episode, and otherwise,
I will catch you next week.
. Juan Sequeda: And so you start
adding more tasks and stuff into your
graph and connecting it to people.
So those are the two aspects that
our lab is working on right now.
Can you?
Nicolay Gerold: Yeah, if people want to
learn more about knowledge graphs, Data.
World, data catalogs, or you,
where would you point them?
Juan Sequeda: You can find me on LinkedIn.
So I'm just Juan Sequeda that's one one
about data world is go to data world.
If you want to learn more about knowledge
graphs I have a book called designing
and building enterprise knowledge
graphs, actually just ping me, just
find me and I'll just send you a copy.
I just want people to go read this stuff.
My colleague who I work in our lab,
Dina Alamang, has a book called Semantic
Web for the Working Ontologists.
So if you're interested in learning
about how ontologies work and
built that's the textbook for that.
And then sorry for being more
shameless plug on our podcast.
You can listen to Catalog and Cocktails,
the Honest No BS Data Podcast.
Thank you.
Nicolay Gerold: So what can we
take away when we want to build
with Knowledge Graphs or we
want to build Knowledge Graphs?
If you look at them from a high
level, Knowledge Graphs connect
data points meaningfully.
So they map relationships between
concepts and make more hidden
connections actually visible.
So it's like an organization's
brain in the end.
It's not just storing the data,
but also understanding how things
relate, which are often very use case,
domain, but even company specific.
And data catalogs, on the
other hand, track what data
you have and what it means.
So it's like a library index of data
assets in the end, but also all the
relationships that different data
assets have to each other and the
attributes they have, like access
controls and things like that.
And when you need them, so
you, first of all, you should
always start with a problem.
You don't build these systems
just because they are trendy.
You build them because you have a
problem that actually is worth solving.
Solving with them.
So you need a data catalog when people
waste time searching for data, teams get
different answers for the same questions,
or you're losing money from poor data
management and you need knowledge graphs.
On the other hand, when you have to
connect data across many sources,
the relationships between the data.
Often are very important.
So they are.
Carry a lot of meaning
and a lot of context.
And you also need the context for the AI
system so it can make better decisions.
So when you build your first systems,
or you want to go into the direction
of knowledge graphs, you shouldn't
start with knowledge graphs,
ontologies, taxonomies from the get
go, you should start with metadata.
So, you should think about, okay,
how can I add metadata to the
existing data I already have?
So, add it on top of your current system,
add it as attributes, um, in your vector
database, in your Postgres database, and
add it as a column, and use it to filter.
And then over time, you can start to think
about building full knowledge graphs.
And, you can then basically
create it very naturally.
And add business terms, add definitions
to the different terms later.
I think what was pretty interesting is he
talked about the three key use cases he's
seeing, which is search and discovery.
So basically you have to find relevant
data quickly and understand what
data means and know who owns what.
And I think like this is the main use
case we see for knowledge graphs right
now, like beyond their specializing on
basically enabling search or building
a data catalog on all data assets.
So it's like on a higher level,
but if you're building a knowledge
graph for example, for search,
I think the same applies.
Um, you want to find relevant
data quickly for the specific
search query that's coming in.
And the second one is
basically governance.
So basically you track data
access, you monitor sensitive
data, and ensure compliance.
Unlike I use case specific level.
I think you can apply the same.
Basically we talked last week with
Danielle Davis on temporal rack and adding
a time component to your, to your data.
And I think like this is
the governance component.
You can use a knowledge graph in general.
For as well, like when
was the data created?
When was it updated and
when might it be outdated?
And same goes for monitor
monitoring sensitive data.
So you can add the relationships and
rules to the knowledge graph that
Only certain users can access it.
And then you basically have like
the technical operations part, which
is, I think, doesn't really track to
like the regular knowledge graphs we
would use in search, um, which are
more about like data lineage, track
data quality and manage dependencies.
I think this is not as mappable.
On to different use cases, but the
one data that was working on, um,
say to summarize it again, to make
it work building incrementally.
So pick one problem, map the path.
So like, how can you go from a
business need to a technical solution?
Then create a single thread end to
end and expand carefully from there.
So it isn't about size, it's about
solving the problem you have at hand.
So start small and make it useful fast.
So don't over engineer, start
simple, focus on actual problems.
Balance automation with human
oversight and remember, it's If you're
a small startup, or like if you're
very early in your AI journey, you
probably don't need a knowledge graph.
Yes, that's it.
We will have one more episode on
knowledge graphs next week, and
then we have, I think, one or two
more search episodes before we
actually move on to the next season.
So, let me know what you think
of the episode, and otherwise,
I will catch you next week.