How AI Is Built

Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.
When most people think about embeddings, they think about ada, openai.
You just take your text and throw it in there.
But that’s too crude.
OpenAI embeddings are trained on the internet.
But your data set (most likely) is not the internet.
You have different nuances.
And you have more than just text.
So why not use it.
Some highlights:
  1. Text Embeddings are Not a Magic Bullet
➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information
  1. Embedding Numerical Data
➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions
  1. Multi-Modal Embeddings
➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance
A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).
Mór Kapronczay
Nicolay Gerold:
00:00 Introduction to Embeddings 00:30 Beyond Text: Expanding Embedding Capabilities 02:09 Challenges and Innovations in Embedding Techniques 03:49 Unified Representations and Vector Computers 05:54 Embedding Complex Data Types 07:21 Recommender Systems and Interaction Data 08:59 Combining and Weighing Embeddings 14:58 Handling Numerical and Categorical Data 20:35 Optimizing Embedding Efficiency 22:46 Dynamic Weighting and Evaluation 24:35 Exploring AB Testing with Embeddings 25:08 Joint vs Separate Embedding Spaces 27:30 Understanding Embedding Dimensions 29:59 Libraries and Frameworks for Embeddings 32:08 Challenges in Embedding Models 33:03 Vector Database Connectors 34:09 Balancing Production and Updates 36:50 Future of Vector Search and Modalities 39:36 Building with Embeddings: Tips and Tricks 42:26 Concluding Thoughts and Next Steps

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: Today we are
continuing our series on embeddings.

When most people think about embeddings,
they think about Ada and open AI.

Just take your text and throw it in there.

But that's often to crude . OpenAi's
embeddings are trained on the

internet, but your dataset most
likely is not the internet.

You.

You have different nuances, a different
focus and also different use case.

And also you have more than
just text, so why not use it?

Especially when texts only embeddings tend
to fail . Today, we are talking to Mór,

Mór is the ML lead at Superlinked, and we
will dive into how they embed everything.

Categories, numbers,
locations, texts, images.

And how you can combine them.

To create a more robust or relevant,
but also more explainable search.

Let's do it.

Mór Kapronczay: one very
like hindering limitation.

Is only using text, as you said for
example, if you just simply go to open

AI and make it embed the numbers from
zero to a hundred and you would put the

cosine similarities between 50 and all
the numbers, but you would expect to

get is like a triangle shaped graph,
but in reality, what you would get is

something that has the trend that you
would expect, but has extreme noise.

So like 40 would be more similar
to 50 than 49 because 40 has

a zero in it kind of stuff.

So yeah, I think this is extremely
limiting and people are, I think, yet

to see that, or I still experience
people expecting to just pour everything

into a text embedding model and
expecting it to do magic and work well.

But, like having different data embedded
in a more data specific way is is

definitely something that I think we
need more education on as a community,

Nicolay Gerold: You think if
it gets smarter with how we

represent different datas texts?

We can work around the loss, the
limitations we have currently.

Mór Kapronczay: Yeah.

So on the one hand, I think thinking
in text will always have the

limitation of thinking in the next
token prediction and that approach

in itself has its limitation.

I don't know if you've seen the example
where people ask a GPT 4 if some number

is a prime and the model always answers
with yes or no, that's the first token.

And if that's right or wrong,
that's, that has a, like a big risk.

And if it's wrong, it will try to
like argue for the wrong answer

because now the token is predicted.

So you can only predict the next token.

So there's that, I think, which is.

A very significant problem
that arises from text.

And there are also different accounts
out there that there is a limit to, I'm

definitely not an expert on this matter,
I just, read some paper abstracts talking

about this but they say that there is
a limit to how much language can be an

efficient medium to convey intelligence.

So is language the way to
artificial intelligence?

There are like a heated debate right now,
if that's the way to, to achieve that.

Nicolay Gerold: I think language is very
Lossy, it's a bad compression in the end.

It's really hard to express
Even what I want to have built

Mór Kapronczay: Yeah, you always
end up like writing meta algorithms

or dummy algorithms because yeah.

Nicolay Gerold: How would you actually
define unified representations of data?

Mór Kapronczay: Oh, yeah, that's
a very, really good question.

The way we like to look at it is let's
imagine that you're a company doing

a lot of stuff that like, That you
would want to be data driven and the

unified representation is a good enough
representation that would power all of

these use cases that you would want to do.

In our case, for example, we believe that
analytics semantic search recommender

system or a personalized reg can be driven
based on the same vectors that you create.

On top of your data.

For example, you could as a organization
have the same user vectors that would

drive the personalized rack system
that would drive what content would

recommend next to the user and so on.

I believe what you would, what you should
strive for in terms of unified as a, as an

actor in the place is unified in the sense
that it would power all your use cases

. When I joined in.

More than two years ago, we were
building a personalization engine.

So we were wanting to be the engine
behind personalized experiences online.

So updating vectors, user
vectors, updating content vectors.

And we could imagine this happening in
like social networks, media outlets,

any, basically any online experience
that has some personalized content.

aspect to it.

And when we were building out that
product at some point we realized that

actually what we are building here,
the approach is is much more broad.

So actually what we are building
here is a vector computer.

That's when we pivoted and started to
call ourselves the vector computer,

which is essentially a middle layer
between data sources the vector DB

and the applications that are Trying
to use all these all these vectors.

So like maintaining the data
transformation into embedding, store

them in a vector database, and then
through the query functionality power,

all the use cases that would somehow
benefit from these from these vectors.

Nicolay Gerold: Maybe let's
start from the inverse.

Is there anything that cannot be turned
into embeddings or that isn't sensible to.

Mór Kapronczay: Wow.

Nice.

I haven't really thought
about this before like this.

I know one thing.

Location is tricky.

Location is tricky to
be turned into vectors.

Because you could you could create these
spherical coordinates that, that way you

could turn GPS coordinates into points
in the sphere of a unit a unit of time.

What's a circle in 3d sphere, right?

A unit sphere unit spheres so points
on a unit sphere, but the problem with

that is that it's very hard or nearly
impossible to scale the distances.

So you would have people in the
compared to the whole hemisphere

of the earth, people in the same
city, or 200 kilometers away.

Would be like extremely close
to each other because the whole

representation spans the whole globe.

And it's really hard to create
a representation where you could

scale that distance away so that you
could create meaningful distances

inside the city, for example, that
would be like useful for something.

Like a global location representation
that you can scale is something that I've

tried to create, but failed miserably.

So that's one example
I think I can mention.

Nicolay Gerold: And now the other
side, especially for recommendation

systems, because I think there are
more, there is more data that is

actually turned into embeddings.

What are the different
things you actually am bat.

In the recommendation system.

Mór Kapronczay: I think the most
important distinction with recommender

system data is whether it's
content data or whether it's like

behavioral data or interaction data.

In terms of content, I think
recommender system is not very

different from other use cases.

You would, you could have text, you
could have timestamps, categorical

data whatever is there in any
other semantic search use cases.

And I think when it gets
really interesting is the.

Interaction data or
behavioral data for that.

I think everybody knows collaborative
filtering or like in general matrix

factorization methods, which are really
interesting and really useful in this

vector based recommender system case,
because you could basically have a

vector part that refers to the content
of the item that you are trying to sell,

like the embedding of the image of the
dress embedding of the description.

Embedding of of the category, whether
it's a skirt or a T shirt or whatever.

And you would have a different vector
part that would somehow embed the

consumption patterns of the users.

These vector parts would be closer
to, closer between dresses that are

frequently bought together, for example.

So this way you could model in the same
vector, both aspects of the product.

Yeah.

Nicolay Gerold: And how do you actually
combine the different embeddings and

especially in recommendation systems?

So I have, for example, the embeddings
of the images, the product description,

but also metadata, different
categories, maybe even some numerical

values, the spotless directors of
the different texts and categories.

How can I combine the different vectors?

Mór Kapronczay: I think when
you think about this, you should

definitely start from the distance
function that you're using.

We specialize on dot product similarity,
which is a more general case of

cosine similarity, and essentially
it just works like a Adding up the

multiplication of the corresponding
vector index items, basically.

So like some product in Excel,
for example, and and if you go

from this direction, one important
thing to realize is that you can

basically just times to a vector and
the similarity will be times too.

So you can pretty easily wait stuff.

And also.

if you have different vector parts
concatenated together, if you times to

one vector part, then that vector parts
contribution to the similarity will be 2x.

So these are really simple
mechanics that you can make use

of in a multi modal vector case.

So stemming from this, what we do,
Is we create the individual vector

parts that refer to each aspect of
the data, each a different embedding.

We always normalize them and
we can recycle what's a useful

normalization in that case.

So these, and then we have these
normalized vector parts we can

then concatenate, and then we
can also do reweighting on top.

So like I have three normalized
vector parts, but I do realize now

that I really want the fresher items.

Thank you.

So I would give more weight to
the recency based vector part,

and then it would contribute more
to the similarity between items.

So that's basically the approach.

Nicolay Gerold: And how do you decide
which parts of the data should have

their own embeddings and when to add
new embeddings from different modalities

or different parts of the data?

Mór Kapronczay: In that question, I would
always start from the query patterns.

I would always think that what would
our users want to query from this data?

And if we know that users prefer
popular content in their feed, then we

know we should embed the like counts
and the comment counts on posts.

Or.

If we know that this is a new site,
so we want recent news articles to

be on top of the page, we know we
need to embed the creation timestamps

and give more weight to them.

That's an algorithm that we came up
with on a company retreat in Montenegro.

Is, it's essentially very similar to
positional encoding, or definitely

gets the idea from positional encoding.

What we do is we project a time
frame on a quarter circle, and where

now, or like the point of query is
a definitive point, like the point

where cosine is one and sine is zero.

And so if you go back in time,
the cosine similarity will be

proportional to the time spent.

And the good thing about this.

Is this, so the hard thing about
time, is that it passes, right?

So you create an embedding and in the next
second, you would have to create a new

embedding if you don't do something smart.

And what you can do is at embedding
time, you can decide that, okay, I will

want to re embed my data a day from now.

So let's create embeddings
that are valid for another day.

And that's the trick to set
the point in the quarter circle

to be like one day from now.

Calculate back the time frame
from that time and embed the

timestamps based on that time frame
projected on the quarter circle.

So you would have valid recency
embeddings for another day.

Nicolay Gerold: What speak against.

What speaks against.

pre-filtering or post-close during the
results or also using a rerun code to

basically re rank based on the recent CS.

And just using the scores.

Mór Kapronczay: Yeah that's
one of my favorite topics.

I feel that people are very lenient
towards re ranking because if you

think about it, vector search is like
blazingly fast, like extremely fast

and building any layer on top that kind
of re ranks the results from the, from

vector search is like a magnitude slower.

So it will increase your
system's latency a lot.

So I would think based on this, that
I need pretty strong justifications.

So I should re rank,
so I should pre filter.

And I think I have yet
to seen these reasons.

So I think most of the aspects
that people are re ranking for.

Can be expressed in vectors and can
be pushed into the vector search.

Actually, there is one, one thing that's
not really compatible with vector search.

And that's that's when, at recommender
systems, there's always this

notion of make the results diverse.

Now, vector search in results inherently
will not be diverse because they will

be like similar to the query, right?

Of course, there are
techniques to do that, but.

I think that's the only occasion I've
encountered in my career that there is

an aspect that's not a, that's, that
we cannot push into vector search and

make it and express it through vectors.

Nicolay Gerold: And if I
would want to transfer it.

So for example, if I want
to add numerical quantities.

So for example, I'm doing a company
search and I want to embed, for example,

the revenue they have, how would
you actually go about creating a new

embedding for this new type of data.

Mór Kapronczay: Yeah, so in this
case, this is a numerical data,

so I would use our number space.

The number space is, embedding numbers
is I think the most varying based

on the task you are doing, right?

Because if you have like a regression
model, you would just put the

number in the model and it works.

Or maybe you do a log transform if
it's like power load distribution,

but not much of a fuss.

It can handle it.

But for vector search, you
can't really use the number

itself in the vector because.

Then you would have a one D
vector that you cannot scale.

You cannot do anything with it.

It's just a monolith.

You cannot touch it.

That's very not beneficial
for that purpose.

So for numbers, we are doing something
quite similar that we, that I

mentioned in the recency that we would
have a range from like a min max.

And we would embed the range
projected to the quarter circle.

And what you can also do under the
hood is do some logarithmic transform.

If you have skewed power load distribution
data, you can add like log 10.

So it would skew out this this
inconsistency and make the

space more like log scaled.

Nicolay Gerold: Yeah.

And they have a case where we are in the
financial space and we use a post-filter,

for example, based on the revenue.

And the issue is if you have.

A minimum amount of
items you want to return.

If you're using a filter it's very hard
because the filter could throw away.

More items.

The venues would want,
or you put one to return.

So you basically have to do a second
pass through the entire retrieval system.

And if you could embed that.

In the vector space.

You could basically.

Circumvent it.

Mór Kapronczay: Exactly.

That's a very important problem.

We often encounter that.

Using a binary filter and expressing
some more elaborate preference for

some value is a very different thing.

If you can only have a binary filter,
then where would you put the filter?

You can't express differences
between items that make the filter.

So you lose a lot of expressive
power with these filters.

Instead of, smoothly blending
these different aspects

together through vector search.

Nicolay Gerold: The
categorical and ordinal data.

How do you go about embedding those?

Mór Kapronczay: Yeah, so when you
think about categorical data, I

think, so there is the aspect, there
is the case where you have labels.

You can do magic like target encoding.

That's something I encountered
to be, and please don't be mad at

me, to be like the lazy solution.

Because most of the times when you
want to do target encoding, you

could actually do something smarter.

But you are just lazy, so you would
just impute the target variables.

I'm happy to debate this later, if
anybody wants to talk about this.

I know it's, it pushes all
the buttons for everybody.

So yeah, if you put this aside and let's
imagine this is an unsupervised case

and you don't have labels, I think you
have two major directions you can go.

One is just essentially text embed
the category name, if it's a telling

name, like if you have let's say
at e commerce products, you have

skirts or you have t shirts.

It makes sense to embed those with a text
embedding model because then you would

have categories that are semantically
closer or further away from each other.

So that makes sense.

But let's look at another case
when you have categories like

A1 or like B3, you wouldn't
semantically embed those categories.

I would say that the best approach
here is like the naïve approach,

which doesn't doesn't expect any kind
of relation in the data, which is

like one hot encoding, essentially.

Because that way, all the categories
are orthogonal to each other.

There is no differentiation
in their relationships.

So you could use that to
express relations that are only

dependent on the category itself.

There's no other notion in the embedding.

Nicolay Gerold: What
about the ordinal data?

Because I think ordinal data for
me is one of the most challenging,

because usually there is meaning in
the different categories or ranks.

Mór Kapronczay: Yeah, I would probably
just go and the number embed it.

So I would just use our number embedding,
because this way you can, you still

have the kind of the ordinance, so
the ranks, you would still have them.

The only difference is you would have
like integer instead of floats, right?

So you would not have
items between categories.

But that's, I think so that's
exactly what a number space would do.

Have.

If you have 10 categories that
are ordinal on the range, you

would have 10 points where.

Like 10 different representations of like
sines and cosines on the quarter cycle.

Nicolay Gerold: If I know now
have my item representation.

So I have tatext maybe an image,
maybe some additional metadata.

Which I can embed as categories
or as numbers or whatever.

What impact does the embedding
size of these different embeddings

have on the overall system on
the weighting and the retrieval?

At large.

Mór Kapronczay: Yeah, that's a
also very interesting question.

So one lucky thing is that dot
product scales extremely well, right?

So if you create like a 10x bigger
vector the dot product will take 1.

1 time, 1.

1, 1.

1x time.

So it's luckily it's not a big issue.

Long vectors in terms of vector search.

What can be an issue is memory, right?

For example, if you think about Redis,
which is an in memory database if you

have too many too long vectors, you
can hit some memory barriers there.

So that can be problematic.

I think there is we see
all this quantization thing

happening in embeddings.

I think that's something that we should
keep a close eye on and use those.

In our systems as well.

So don't store float 64
numbers all the time.

If float 16 or float eight is enough.

And now you have like quarter.

memory usage.

Yeah, I think these are important thing.

And in terms of the semantics, so
the weighting and everything what you

have to keep an eye on is the relative
weights of the aspects themselves.

So you can easily switch some embedding
off by setting it to zero weight.

You can easily pile it up
by increasing the weight.

So that's, I think that's the main
thing I love about our approach.

And that's a lesson learned hard in
the previous company or the previous

product that we built that, fast
experimentation with these weights is

probably key to like good performance.

And if you have to always re embed
your data set, whenever you change

some parameter, that's painful.

So you would want to push all the levers
that you would want to tinkle with

further down the line to the query level.

Which way you don't have to touch your
knowledge base and you could try a number

of configurations because you would,
because that's the good thing about

dot product that it's symmetric, right?

So you can alter your query vector.

It will be the same if you alter
your knowledge base vector.

So you can just make your changes on
the query vector and iterate fast.

Nicolay Gerold: Are you already playing
around with dynamically weighing based on

the query so for example, if I have a very
visual driven or formulated query that you

pay more weight onto the image embeddings,
as opposed to the text embeddings.

Mór Kapronczay: Absolutely.

Absolutely.

So that's definitely something the natural
language querying feature is doing, tries

to like, tries to find the intent behind
this natural language that the user types

in and the sets weights based on this.

And another approach is that if
you are lucky enough to have a

uniform set of weights that's, that
supplies for all your users, you can

as a developer also preset those.

But I think the more interesting
case is when every user has a

different weight preference.

And in that sense, I think one, one
step we did is the natural language

query so that the underlying instructor
model can detect all these intents

and set the weights accordingly.

And also if you have labels, you're
lucky enough to have labels yourself.

You can also train a model with with
even like user specific weights that

would set them dynamically for you.

Nicolay Gerold: How do you
actually evaluate whether a new

embedding adds additional signal
or information to the retrieval?

Mór Kapronczay: Yeah.

This again boils down to
whether you have labels or not.

If you have labels, you can basically
calculate like IR IR metrics and you would

have a proper eval and you could say that
it lifted MRR or it lifted DCG or maybe

I said the acronym wrong, but whatever.

And yeah, if you don't have labels though,
that's a more interesting question.

What I see us and our customers do is
AB test, try one with the embeddings and

without for a limited time and observe
the results or what you can do in some

cases, even is eyeballing the results,
try your most common queries and see if

you like the results more than before.

That's probably a first step you do.

To even justify going on an A B
test or running some eval script?

Awesome

Nicolay Gerold: How do you see going
into a joint embedding space versus

going into separate embedding spaces
for the different data representation?

So for example, using poly gamma
for embedding images and texts

into the same embedding space.

Versus using separate embedding
models for image and text.

Mór Kapronczay: that's something I'm,
I think the most interested in right

now is these is these multi modal
embedding spaces where you can shoot

in with some text and get back some
images and it's the same vector space.

So what, I think what's, what should
be noted, is that when you have

a, an embedding, embedding space
with concatenated vector parts

referring to the different aspects,
you have a similar vector space.

What's different is you
have more dimensions.

So it's a larger vector
space that is for sure.

So that has all these like
memory implications, but on the

other side, what it brings is
that you can wait the different

aspects of your data differently.

So I can So for example, let's
consider the same use case.

I'm searching with with text
and want to get back images.

I can do this in both cases.

In the multi vector embedding case,
I would set the weights accordingly

that I don't have an image now, I
only have text, and would search in

the vector vector space this way.

So what you get for more dimensions
or more storage needed, What

you get is explainability.

So you can always explain what
part of the query triggered what

part of the vector that comes back.

So what's the contribution of each
data modality to the query result

and get the weighting option, right?

So you can easily even tune it based on
the results you get from explainability.

So I think that's the trade off here.

And yeah, everybody should decide
which is more important for them.

There's also maybe one more
aspect here is the first is a

very like non trivial task, right?

So you have to train a pretty complicated
multi headed encoder model, create

some very elaborate representations
while on the other case, you can just

take models off the shelf and do it.

Quite in an unsupervised way on your end.

So you don't need anything.

You just run it on your data.

Nicolay Gerold: How do we actually see the
different output sizes of the different

embeddings of different embedding models?

Have you any experiences which sizes
are completely unnecessary because.

Nowadays, it seems like we're going up
to like 10,000, 20,000 for some models.

Mór Kapronczay: Yeah.

Yeah.

Yeah, I think it's interesting because
I think part of the reason these sizes

are going up Is that we want general
purpose embeddings to perform extremely

well in very niche tasks as well.

And what we're trying to achieve here
is, like bloating up the embedding size.

So maybe a general purpose model will
be used for question answering and

machine translation and everything.

It will be good for you.

So I'm not in the position to judge
whether this this this adventure

will go well or not, but what I do
sometimes is I try to tinker with

embeddings by putting them into some
dimensionality reduction model to

see if there is some, like leftover
information compression that the makers

of the embedding model failed to do.

And I never, ever found any.

So like these spaces are pretty
tightly compressed and nicely created.

I would say that maybe what we need is
like more specialist models and like

appreciation for specialist models more.

Like this model is for search query
relevance and it's extremely good in that.

And it's 512 dimensions.

I think it makes sense if
you think about it, right?

Normally what's, what justifies
having that many dimensions is

having to reflect that many aspects.

And if you can use all your aspect
budget on a specific task, in theory,

you could more easily succeed.

And maybe different tasks
need different things.

It's quite obvious that Semantic
similarity and like search query

relevance are not the same thing.

So I would want the general
embedding to make the same meaning

question to question, meaning the
same thing close to each other.

While I would want a search query
relevance specialized embedding

to have the answer to the question
close to each other, right?

So different purposes might
need different models.

Nicolay Gerold: I think you guys are at a
very special position that you're working

with so many different embedding models.

I would actually.

Love to know what are the libraries
and frameworks that you guys

are using, or you guys grab?

The most, when it comes to embeddings.

?
Mór Kapronczay: Yeah.

So I have a love hate relationship
with sentence transformers.

I like it, but it's also sometimes
feels like a, like an academic project.

So like from two to three, there
was some pretty major performance

downgrade, which you can hack yourself
around, but it's just not nice.

So there's this.

But also it's really handy.

It decides your poolings for you.

So I like it.

I like, I also like it but at, in an ideal
world, we will soon might not use it.

And in general, just working with
vectors and arrays, I find NumPy

obviously to be an awesome choice.

It is very fast.

It is very, it has very handy
functions to do all these.

All these vector operations that we
do and yeah, otherwise we are trying

to keep the code as plain as possible.

So we have some, like old kind
of like bear architects that are

always, asking the questions is
the use of this package justified?

Or should we pull in another other
should we depend on another not entirely

professionally maintained library?

Should we not?

So I think we are really trying
to keep it simple and also

keep it the fastest possible.

So for example, like using NumPy
was justified based on the fact

that NumPy is far superior to
operating with lists, for example.

Right now what we are experimenting
with, I can probably tell that is putting

basically some PyTorch lightning based.

Adaptor model on top of our embeddings.

So if you have labels, you can, in
theory, fine tune your embeddings

with this adapter layer on top.

And then you could use that.

What is, really strange again, is
that when I experiment with only a

text embedding and I have some labels
on top and I will want to train an

adapter that would make the text
embeddings better for that specific

like classification problem or whatnot.

I have very rarely been able to
come up with a uniform, like weight

distribution that makes the embedding
better if it's general purpose.

So again, I feel that the guys
creating these embedding models are

doing a good job because you can't
just really train an adapter on top.

If you see Llama Index had this
example where they introduced the

adapter model, like on their own
example, they could create like half

or 1 percent increase in weight.

Performance of the downstream model.

It's really hard to make these embeddings
better with these adapter models.

We are creating connectors
to vector databases.

We are churning them out as we speak.

We have Redis and Mongo ready.

We have a small in memory implementation
to play around in the notebook.

And the others are created
based on client demand.

There is a way to influence that actually.

I think on our GitHub, you can vote on
a, on an issue if you want a specific

vector database connector to be developed.

So yeah I think your arguments are valid,
like it's not extremely efficient if you

have sparse vectors, that's true, but in
some cases what you, the main thing you

want from this application that has a VDB
behind it is to be like fast, like you

want someone loading the page to be fast.

If that doesn't happen in like at
most half the second, you lost the

person, like attention span over.

Tap closed, done deal.

So yeah.

Sometimes you might want to battle
with some inefficiency if at the end

you would have like very low latencies
that are required in this case.

Nicolay Gerold: How do you actually
balance like being in production but

still doing the updates with new embedding
models or new parts of the embeddings?

Mór Kapronczay: yeah.

Like one, I think that's one of
the, one of the hottest question

in like maintaining a real
time up to date search index.

I think even Amazon in their
managed solution promises to

be up to date in two hours.

Thanks.

So that's, I think that's a considered
amount if you think about like places

where in two hours the whole world
changes, like in Twitter, I don't know.

So yeah, I think our general approach
is have small approximate changes

when doing online, the online system.

So let's imagine you have a system
that's online, that's running,

and there is events coming in that
would alter the vector embeddings.

What you would do is.

Smaller approximate changes that are
easy to calculate and easy to do.

Here an important note is whether
the VDB will automatically

add a change to the index.

That's something that
Redis does instantly.

So you change some vector and
it's instantly part of the index.

Not all vector databases do that.

So that's something that you
have to be mindful when deciding

what VDB to use under Superlink.

But yeah, so that, that is one aspect.

And the other is do regular heavy
weighting through some batch workload,

like PySpark or something like that
would do the precise heavy lifting

calculations, switch the vectors and
give the, give the stage back to the

online system for doing the small.

Daily, so like through the day,
intraday updates that happen to keep

everything to keep everything up to date.

And that's, if it's just a text
embedding model, I think it's

not particularly complicated.

The model doesn't really
change through time.

But if you have a graph
embedding model, depending on.

User interaction data with products.

Now that gets more interesting.

Like with what frequency
would you retrain the model?

Cause you can just re embed
like a new product comes in.

You can just re embed it.

That's easy.

But if you would want to change the vector
of an existing item, that can be tricky

so that you need a good strategy for it.

Nicolay Gerold: I think this is really
interesting, especially, I didn't know

that the Redis is updated instantly
I think you always have to use a

combination of something like a Redis
cache plus your main storage engine.

So you're that you don't
really have a race condition.

Mór Kapronczay: Yeah.

Yeah.

Yeah.

And if you have infinite money and
infinite memory, you just use a very big

Nicolay Gerold: Nice.

What do you see in the future
with adding new modalities?

Mór Kapronczay: Yeah, I think, so I think
the piece that we need to find out here,

and I'm not really an expert in this, but
having to create a good representation

for vector search for this data.

Because if you think about it, if we would
have a good representation for like audio

data or sensory data for vector search,
we could just create vectors and we could

observe like a previous malfunction.

Use the data before the malfunction,
search in the vector space, and if

you see a similarly behaving sensor,
we can do predictive maintenance.

So if we have these representations ready,
I think this kind of boils down to the

classical machine learning use cases.

We can do through vector search,
I think, and but so like really

a lot more efficiently with these
low latencies and easy pipelines of

just searching in the vector space
and returning the closer vectors.

Nicolay Gerold: What would you
say is missing from the space?

Mór Kapronczay: Yeah, I think
that's a, that's an important piece.

So like having a hugging face for
all kinds of data, if you think

about it can I just take from the
shelf a model that embeds like

sensory time series of temperature
measurements it's not that trivial.

If we had that for all data modalities,
I think building extremely interesting

and useful stuff would be A magnitude
more easier that's for sure.

And also there's cost, I think maybe
to a lesser extent than before.

But still running these big models is
expensive or calling APIs is expensive

if you have considerate low load.

So that could go down even more.

That would be nice.

Nicolay Gerold: And what's next
for you guys like what's on the

horizon what you can already teaser

Mór Kapronczay: Yeah.

I think in terms of the product we are
working on some evals to on the one hand,

give tools to people wanting to evolve.

different embeddings and
different setups on their system.

On the one hand, give tools to them,
and on the other, create benchmark

performances on well known datasets
to have a way to compare using just

the text embedding model to a more
detailed, more fine grained multimodal

vector space for the same problem.

That's definitely an
area of focus for us now.

And and I can't really say,
much about these, but I think

there are really interesting
partnerships cooking right now.

So you might expect us to find,
you might expect to find us in your

favorite tool, powering search for you.

Nicolay Gerold: Nice and if people
want to start building the stuff we

just talked about where should they go?

Mór Kapronczay: Yeah.

Interesting.

So what I can say is I think
I benefited a lot from.

like general purpose machine learning.

Because I always say when you think
about normalizing vectors, it's

like when you standardize your
data for clustering, for example.

You do the same thing.

You start from the distance function
and you realize you would need certain

distributions of your data in order to
have a, like an equality of opportunity

between variables, so to speak.

And so this kind of like relationship,
this kind of understanding of

the space, I think is very well.

Very well comes from machine learning or
like traditional machine learning studies.

So I would definitely start there.

And in general, I think I would
urge people who want to work with

embeddings to like, learn machine
learning and understand the difference

between training and test data,
like the bias variance trade off.

So yeah, I think that's
the most important thing.

And you know what makes extremely
hard these days to learn this?

Is the extreme amount of noise.

Everybody claiming that this is now
the model that changes everything.

And probably it's not, and like most of
the cases they are not changing anything.

And you can easily get lost in
reading like 10 papers a day and still

know very little about the space.

This noise filtering, this is I think,
Like having people sending out these

newsletters where they try to boil down
this information for people, I think

that's a very useful thing they do.

So I would probably follow some people
and subscribe to their newsletters.

One more thing I would like to mention
is that we are like babysitting an

open source project called the vector
hub where people can upload their

pieces on on vector search and we have
them edit it, have them publish it.

So that's that's something where the
con, where for the content I can vouch

for because I'm involved in the editing.

So that's something that's useful.

And also maybe one more thing I would
like to mention is the vector database

comparison table that we are also
babysitting as an open source project.

That's also useful if you look for a
vector database, but get lost in all

the non up to date documentations.

That you can see a lot of different
aspects of these databases and

compare them and choose the
one that you would like to use.

Nicolay Gerold: If people want
to follow along with you and

Superlinked, where can they go?

Mór Kapronczay: Definitely follow
our CEO and co founder, Daniel.

I think his page is like the pure
gold for following super linked.

And of course we are on all the
popular places like LinkedIn, Twitter.

And yeah, but I, on the first place I
would suggest you to follow Daniel on

LinkedIn and check VectorHub regularly.

So, what can you take away?

What can you use?

So your applications, I think,
first of all, you can embed.

Way more than you would actually think.

I think we've heard that also from
a Aamir Shakir in the reranking

episode if there they played around
with adding metadata into the text.

To actually embed the information,
like for example, times and

locations, And use those in the
ranker which worked good as well.

So it makes sense that you can embed
different types of information.

I think also his point that with
sparse embeddings, the assumption

that the different categories are.

Orthogonal to each other.

Often doesn't hold.

I think this is very used case
specifics, and you have to

figure it out on your use case.

So if you have a set of categories, Or the
category is completely unrelated or not.

And then you can decide
like, how do you handle them?

And.

If they are not orthogonal to each
other, you might be better off

or not unrelated to each other.

You might be better off using something
else or one of their techniques.

I think embedding time dimensions
and different numbers, probably the

thing that's most interesting to me.

I already mentioned the episode
that we have a use case right now

where we basically, we have a set of
results we actually need in the end.

But we have a bunch of filter conditions
and we basically have a recommendation

system, but we have a fix of this,
which should come out in the end.

Which is a target And filter out.

It would be really nice if he, for me
would have that in the embedding already.

We can just take the top 100 results..

How it's our system is
set up at the moment.

We basically do the retrieval
and then apply it like a post

filter after the retrieval.

And this.

For one.

Requires us to return more
results than we actually need.

But at the same time, it can happen
that I'll post further, throws away

more to more results than we actually.

Wand.

So we have to do another pass
through the retrieval and then

duplicate them basically, and
then do a new basically fusion.

Because we then have to rank the two
types of, or two paths of retrieval.

Our new.

Which is very laborious.

Also like.

If the performance and ultimately
I on numbers, I think this was

something to be expected that.

Since LLMs cannot really
handle numbers that well.

Why would we expect
regular embeddings to do.

Developing them as well.

The explainability of
embeddings is something.

Where I agree in some kind of way you
have more explainability even if you

use a singular embedding but also, you
don't really know where it comes from.

You know, that the high similarity
might be through the categories,

but why do you have high similarity?

That's still for you to figure out.

And especially if you have something
like image, embeddings and texts,

embeddings, and you have a search, you
probably won't be able to figure that out.

But also I think like adding or
combining multiple different embeddings.

Might be really interesting
for some systems.

Especially like in product search where
you actually have a visual component.

You can use images, but
you have a lot of metadata.

And I really want to try
that stuff actually hot.

Especially with embedding numbers,
locations and stuff like that.

Yeah.

Let me know what you think.

I'm, I'm really curious what people think.

About whether the approach woodwork.

I will try it out.

I, if you're interested in what my results
are for my search applications, just

give me a ping and I might do a poll.

So another video on that.

Otherwise also let me know
what you think of the episode.

If you want to hear more of
that, if you have a bunch of

additional episodes on embeddings
coming out in the next two weeks.

Subscribe, give me a like,
and if you have any topics you

would love to hear more about.

as said Hit me in the comments.

Shoot me a quick message on
LinkedIn on Twitter, wherever.

And I will most likely respond.

Yeah.

I will see you next
week and have a good one