How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.
Nicolay Gerold: Uni-modal
embeddings, take some form of input.
Inputs like text images, audio,
and convert them into a vector.
Maybe the stuff.
You're used to the most.
And.
That most applications use nowadays.
Are they.
For example, only work on texts to text.
Multimodal embeddings learn
a joined embedding space.
So you can actually
search images with text.
The assumption is that they can learn
a richer representation of the world.
So if we take.
For example.
Texts to image, embedding
models like clip.
They can learn concepts.
That are.
Very hard to be put into words or.
They are impossible to be put into words.
But we can go.
Go even further.
So for example, LIDAR radar.
ultrasonic We could even add
modalities we as humans cannot process.
We can also add human behaviors.
Events emotions or humor
and try to embed those.
But.
How do we actually train such modals?
I.
I think only a few people have even
fine tuned multi-modal embedding models
few have tried them from scratch.
Because that's mostly
reserved for the big labs.
And today we are trying
to answer a exactly that.
We are continuing our series
on embeddings and on search.
So welcome back to how I spoke today.
We are talking to Michael Guenther who
is a machine learning scientist at Jina
AI working on multimodal embeddings and.
You will be.
Especially looking at Jina clip.
A model.
Model that has broken most of the
benchmarks when it comes to text to
image embedding models when it came out.
And we will.
Be talking about.
How they trained it, how
they constructed a theataset.
How you can.
Construct a data set for your
specific use case and how you can use.
It to fine-tune your own models.
But we will also.
We'll be talking about the applications.
All of it.
Total how it is already used.
In production or how, what
use cases are they have seen?
Michael Günther: Re-rankers are mostly
used for text embeddings, right?
So it's not they're not so
much like for CLIP like models.
Re-rankers are not so common, but
yeah, for text embedding models for
text retrieval models, it's really
like that on one hand, the Re-rankers
are having much better accuracy.
So basically you put in like the
query and the document, so that
gives the model a much better way
to find a good retrieval score.
And it's easier to adjust the
relevancy based on the query.
So that makes it also working much better
for out of distribution because embedding
models are generally working much worse on
out of distribution tasks than re-rankers.
Yeah, the only thing which helps with
embeddings to get better on all of
the distribution tasks are instruction
based embeddings, of course, but this
doesn't get as close as Re-rankers
can work on out of distribution.
Nicolay Gerold: Yeah.
What do you mean exactly by
instruction based embeddings?
Michael Günther: So instruction based
embeddings are basically embeddings where
you which are trained in a way that you.
Add for each text where you want to
embed a prefix, like prompt in large
language model, which describes what
is the task it should be embedded for.
For example, for retrieval, you, it's
of course a difference if you have a
task, which is A question answering
tasks or the model should find the
answer to the question or if it's a,
for example, duplicate detection task
where the task is it should embed it
so that it's similar to to a duplicate.
And then you have also like typical
benchmarks, for example, for fact
checking where you basically have a
fact given and then you want to redrive
evidences for the fact which are the.
Okay.
Say yeah, the fact is true, but also
you want to retry things which are
completely negation of the fact, which to
basically falsify that the fact is true.
Yeah.
And it's all kind of different
objectives, which you,
Nicolay Gerold: we will ever
hit the universal embedding?
So one embedding model to rule them all?
Michael Günther: it's, I think one
thing is that as I said, you have
like very, different objectives.
And yeah, probably it would have
been a model where you can really
precisely describe what you actually
want to embedding model to before.
And then yeah, then also one, one problem
is if you're in a company and you have a
very specific data and the embedding model
hasn't seen anything like that before,
then probably it will also not be perfect.
Even for large language models, they
are universal, but still people fine
tuning them on domain specific data
if they want to employ them on a very
specific use case in their company.
Nicolay Gerold: Yeah.
Maybe connected to that, because I see
already especially for, smaller companies,
but especially startups many of those
are not, Doing like the traditional AI
or ML where they actually think about
the architecture, set up an entire
model, train it on a task, investigate
it, but rather do the pre trained model
and fine tune it on the task paradigm.
Do you actually expect that
engineers in the future will still
train entire models from scratch?
Or do you expect that for like
most tasks that are out there, we
will have a few pre trained models
and it's just like more efficient?
In terms of performance and also compute
to just take them and fine tune them.
Michael Günther: Yeah, I think it's, I
think for generally, pre training allows
a model to basically learn concepts
which are very general and probably
can be transferred for many tasks.
So I think there are lots of applications
where pre training actually helps.
Also, for embedding models, you can
see that definitely if you fine tune
in a pre trained embedding model,
which is already contrastively
trained to produce good embeddings.
It will perform much better than if
you just try to train a model, an
embedding model from scratch, even
from a pre trained transformer model
if your data collection isn't massive.
And However, there are also still, of
course, applications where you can see
either your, for example, you have text,
which is not really natural language
then probably taking a pre trained
model doesn't help you in any sense.
And also, of course, there are
some applications where you could
use I could use an existing pre
trained big model and fine tune
it, but you need more efficiency.
So maybe it's better to basically
train a very a very small I don't know,
random forest classifier or whatever or.
Even for embedding models, there are
probably some techniques which are much
more efficient to generate and things.
Maybe that's maybe for your application
that makes more sense to have a very
small model you can run on a CPU and do
lots of inferences in a very short time.
Nicolay Gerold: Yep.
Do you still, in your research,
try out different architectures?
For example, LSTMs, RNNs, or CNNs
for generating embeddings as well,
and training them on a specific task?
Yeah.
Michael Günther: We're doing relatively
applied research, I would say, and we
It's always a risky and probably in most
cases, you're not very successful if you
try to invent everything from scratch.
So what we usually do is we, try to
see what is already working and try to
do more no more smaller modifications,
where we think that we have a relatively
high probability on succeed with this.
Oh, and not I think if you're doing
research in academia, you are, have a bit.
It's a bit easier to try out
something completely new than, yeah.
Nicolay Gerold: Yeah.
What would you say?
Like on a conceptual level, what's
the main difference, especially
between the multi modal and the
single model embedding models?
Michael Günther: Yeah, the obvious
difference is, of course, it can, the
single model, the model can only accept
one type of modality as input, and the
multi model can accept Multiple types.
And then, on most of the models, they
have somehow component, which can,
of course, since the input modality
is different fundamentally, for
example, images are fundamentally
different to process and text.
You need a specific way to
transform it into an input.
You can give into your network.
So for example, vision transformers,
they take the image features and then
they put it into the split one image
into multiple small image patches, which
they can process by for text models.
Once you have a tokenizer, which
tokenizes the text sequence and then
basically map each of the tokens to a
specific input embedding representation.
So those are different techniques
and then then you usually have an
encoder component for each modality.
However in particular how the
inputs are combined by a model,
it can be very different depending
on the type of multimodal model
Nicolay Gerold: Can you
double click on that?
What are the different ways you could
combine the different modalities?
Michael Günther: as a
four text image models.
There are I think the, I would say three
rough types, probably you can find more
but which most commonly see one is this,
I would say CLIP like models that you
basically have a vision transformer
model and the text transformer model.
Text transformer model is.
responsible for encoding the text
and the vision transformer model is
responsible for encoding the image.
And then both have some projection layer
where they map the output on a shared
vector space or basically mapping it into
embedding of the same dimensionality.
And those are basically two independent
towers, which however are trained.
Transcript simultaneously to basically map
an image, which is related to a text value
on similar embedding representations.
And then I think other way are vision
language models, for example, as there
is this PoliGemma model, which also
has been used for recent ColPali model.
And they basically it's a vision language
model where you first have a, also
a vision transformer model, a vision
transformer, as I said, a bit usually
splits the images into image patches
and then process as, tokens like the
transformer model for text, and then
it generates for each for each patch,
basically one patch embedding at the end,
and those patch embeddings are basically
used then by another transformer model
in a way that is more like, Similar
to text transformer model gets first
as a prefix, all the embeddings of
the image patches, then getting some
prefix text and can then basically
further generate text based on it.
And the ColPali model, basically what
they have invented is that they take
such a kind of model, taking in the
embeddings of the image patches, pass
it through a transformer model, and
then you're getting multiple embeddings.
For each of the vision patches and then
you have a ColBERT type retrieval matching
where you basically have a query, which
is also tokenized into embeddings and
then you matching each of these token
embeddings to each of the image patches
and determine the maximum similarity
and average these maximum similarities
to get a final similarity score.
And this is specifically good for
more complex document matching.
Yeah, okay.
This was a bit long about the
vision language models except and
I think a third type is also that
you just have basically some kind of
vision encoder and a text encoder.
Which producing embeddings for
image patches and for for tokens.
And then you have another transformer
component, which gets both as
input and produces an output.
Maybe an interesting model of it recently
came out, which is using this architecture
is a magic lens model from Google and what
they basically doing is that they want to.
I wanted to have a model
for image to image matching.
But the if you do image to image matching
it's often not so clearly defined what
is the what is the objective here.
As we, Instructions for text
embedding models are used to basically
define a retrieval objective.
So they also wanted to have this for
images that you say you have an image and
then you say I don't want to have just an
image which is somehow similar but I want
to have something which shows the same
thing as a sketch or something like that.
And then you can take these
instructions and add the image and
encode it into a joint representation.
Nicolay Gerold: Yeah.
And.
Especially from an applied side, how do
you actually decide like which of these
three types might be most suitable?
Are there like some heuristics or from
your experience like some guides which you
can give or do you in the end always have
to end up testing the different options?
Michael Günther: Yeah, one thing you can
definitely look at is if you have the
papers available or some documentation
about a model, look at what kind of
benchmarks they are evaluating on.
And then you can basically look what
benchmarks Is most similar to my problem
and then you can see take a model
which performs good on this benchmark.
And this is a general recommendation.
If you don't have a benchmark, which is
some don't know a benchmark, which is
somehow similar to your problem, then
of course you can also take a look at
the architecture and say, okay, maybe
if I just have a In generic text to
image search problem, maybe I just
use a normal CLIP model, but if I have
something more complex, then maybe I use
a model with a different architecture
which can better model this problem.
Nicolay Gerold: When I have a
retrieval task, what is especially
the trade off in terms of performance?
Whether I just use multiple single
modality embedding models and run
multiple retrieval systems at the same
time and then re rank versus using
one joint model with a joint embedding
space or whatever for retrieval.
Are there any research paper that come to
mind that actually try to compare those?
Michael Günther: I don't know, I
haven't seen too much research papers
comparing this, or at least I have now
in mind, but yeah, what I'm, what is
if you so the case you're asking is
you have a text query and you want to
have search for some item which has
an image and a text property, right?
Nicolay Gerold: Yeah,
Michael Günther: Yeah,
something like this.
So then so what you can do is you can
basically just either basically use a
text embedding image embedding model,
and a text to image embedding model,
and then basically using the text
embedding to redrive similar embeddings
of just the text, and then using
basically a CLIP like model to redrive
similar embeddings of just the text.
We try for the text similar images
and then somehow we combine the
results it or you can watch what you
can use with your CLIP like model.
For example, you can.
Use the text encoder and the image encoder
of the CLIP embedding to produce two
embeddings of your, the image part and
the text part of an item and average them.
That can also work decently good,
especially with these Jina CLIP model,
which we have trained to produce
high quality of text embeddings also.
And which can also be used
for text to text retrieval.
There we have seen that on some product
search tests that actually using the
average embeddings works quite good.
However for example, one common problem
is that you might have some items
which only have a text property or only
have image property, or one of them is
very low quality and you cannot use.
And then one problem you observe
is that the scale of the similarity
values, which you redrive from.
An item which has only a text or which
has only an image is very different from
a scale from a averaged embedding if
you combine average embedding from this.
So in, in those cases, I think
then sometimes using like just two
single or two two embeddings, which
just capture a single modality
are probably a better choice.
Nicolay Gerold: Yeah.
And what is, especially when I'm
looking at new modalities, I think
like the image bind paper for example
included depth as one separate modality.
I think like the first question for me
would be like, what makes a modality?
And then how can I, based on the
modalities I include, how can I actually
figure out how to best combine the
modalities I'm adding to the model?
Michael Günther: Yeah, I think
it's, I think what's made up the
modality is, it's maybe just already
a bit philosophical question.
I don't knoW.
So if you want to add a new modality to
a model so I think one, The simple way
you could do is that maybe if you have
some model already, which can transfer,
translate from one modality into another,
for example, if you have some audio
sequence and you can just use something
like Whisper or so to convert it into
text and then use a text transformer,
maybe this already will work, right?
But might not work if you have some
audio, which is not transcribed.
Speech or something
like that, for example.
So that could be a simple option.
I would sometimes just maybe first try
out before training your own new model.
And another thing is that if you want
to train at a new modality into a
model, then what's I think one of.
Very difficult aspect is always to then
find sufficiently good training data you
can actually use to train these models.
So you need probably some data sets
which relates to two modalities in a way.
That it captures the notion of similarity
you want to capture and having those
kind of data in a sufficiently big
amount and, yeah then you probably,
if you design a model, you then have
to come up with What is a good method
to encode this data into into a
representation which can be processed by
neural networks efficiently and so on.
Nicolay Gerold: So basically first
challenge would be to figure out
a representation or train a model
to represent the new modality.
The second would be then to basically
figure out a way or create a data
set that can bring these, the new
modality and the existing modalities
into a joint embedding space.
So I can actually calculate
the similarity between them.
Can you maybe, By the example of Jina
CLIP give us an example of how the
process actually looks like for creating
a dataset for training a multimodal model.
And what especially also the
different fields in a dataset are.
Michael Günther: Yeah what we used to
train Jina CLIP was, of course, we utilize
a lot of already existing data sets.
There's this LION 400 million and
LION5B data set, which you can, which
is basically widely used to train CLIP
models and we just reuse this data.
Then we also, as we, our focus was
to create a model which is better on
for text to text search rely on our
collection of text training data we have
from our Jina text embedding models.
Basically, this was a relatively long
process to collect those data sets.
Basically, we just tried to get
a very diverse set of pairs of
texts from different domains so
that the model was very general.
We also did a lot of filtering.
For example, we filter
out duplicate text pairs.
As we filter out,
things which are not from, for
example, if you want to train the
English model, then we don't want
to have data from other languages.
And yeah, we also did some more
complex filtering methods like
consistency filtering, where you
basically take an existing embedding
model and try to filter out text
pairs where the text is unrelated.
Similar for the lion data set, for
example, they also to construct it used
an existing CLIP model and try to filter
out all the basically lion consists of
images and their captions so that you
basically can train a model to a CLIP
model to encode texts and a caption
and image into a similar embedding
representation and and basically encode,
try to train the model so that it encodes
the captions of an image of a different
image into a dissimilar representation.
And yeah, they also used a CLIP
model to basically check whether the
text where you was actually related
to the image and if it's below a
threshold and they filter it out.
That's a kind of filtering they do.
Yeah, for Jina, for the Jina CLIP model,
we've noticed that basically it's should
I get already into the training or,
Nicolay Gerold: Yeah.
Michael Günther: okay.
So the training is basically
so we basically maybe to have
a short background story.
So what we actually so at first we could
do is to basically take an encoder our
Jina text embedding models and try to
adjust existing vision transformers by
continued training to encode the images
into similar embeddings as the text
based on an image captioning data set.
And Thereby, we wanted to keep the text
encoder freezed so that we don't change
anything on the text encoder and the
model is also then compatible with our
existing Jina text embedding models.
However, it turns out this was
actually not working at all.
And we later also found out that there
are actually some papers which show that
if you freeze a text encoder basically,
you cannot efficiently train a CLIP model.
And and basically then we I thought,
okay, we try to train the model
more from without freezing anything.
So we try to train also the text
encoder part and here we thought
okay, we Basically utilize a kind
of multi task training method.
So basically train the model on pairs
of texts and pairs of texts and images
as it was usually trained together.
So training on both
objectives at the same time.
And this was working quite
well, but we also realized that.
The model has not as good
text encoding capabilities.
And we found out that this was
actually the case because the
image captions you usually train
a model on, they are very short.
So they are not very, the models cannot
learn too much from the image captions.
So basically we needed some data set which
has longer captions and more descriptive
descriptions of the actual images.
And then we basically found an existing
data set, which has a syntactic
descriptions of images, which are partly
generated by a large language model and
partly by image captioning model, which
is then trained on these synthetic data.
So that was basically how we basically
get the collected the data training.
Nicolay Gerold: And did you also
compare it to a triplet loss where you
actually also have negative examples
for the images or the text in the end?
Michael Günther: Yeah, good question.
So the actual training we,
so for text models, they are
usually trained in a two stages.
So after the normal pre training of
the text transformer model, you usually
train on text pairs, which you can
easily get in get in a very large amount.
So you can easily get one billion
pairs or something like that.
And then basically trained it in a
contrastive way, so that the model is
trained to encode two text values which
are part of the pair to be similar.
And from the text pairs in, which are
also inside the batch, you then take
all the right side of the pairs which
are not part of your pair and say the
similarity score should be low to And
usually after that you train models on
more smaller, high quality data sets
where you actually have For example,
for retrieval data sets and some queries
and the specific relevant documents
and also annotated some documents which
are not relevant to this text value.
And then you can use them as
additional negatives in your loss.
And you can also if.
If your high quality data set doesn't
have annotated negatives, you can
also mine them by, yeah, for example
taking an embedding model, look what
an existing embedding model gives
you for relevant text values, which
might not actually be relevant because
a re ranked model ranks them lower.
And I'm doing the training
of our oriJinal CLIP model.
We also wanted to do the same to.
improve the text retrieval
capabilities of this model.
So we basically ended
up with three stages.
The first stage, we basically
train our text model on image
captions and images on one hand and
on text pairs on the other hand.
And the second stage, we continue
to training on this text pairs, but
include longer image captions from
another data set plus actual images.
And then the third stage we continue
training on a long captions and images
and for the text part, we actually
then train the model on images.
On our triplet data sets
where we basically have these
additional hard negatives.
Nicolay Gerold: Yeah, and especially
on the evaluation side, I want to
especially know, like, when you're
looking at the paper and you're
looking at a new model and who finds,
an existing model, what are you first
looking at when you try to assess it?
Or what are the different
things you look at and like
evaluations in the methodologies
to actually assess the quality?
Michael Günther: Yeah, so I think, yeah,
looking, of course, on the benchmarks
and how it compares to other models, but
also what you think should take a look
on is, of course how how the models are.
Are trained, is it data which
is related to this task and the
performance of the model might then
come from basically that they use
training data, which is specifically
which is specific for this task.
So they might have done a better score
on the specific benchmark for this.
But also, if it's if you basically
want to use a model or something like
that, And then also makes sense to,
of course, choose a model where the
training data has been similar to
what you actually want to use it for.
Yeah.
Nicolay Gerold: And especially
looking at the training process,
why didn't you fine tune the image
embedder as well with pairs of images?
Michael Günther: we, and the
model basically I think our model
is not specifically trained for
image to image search, I would say.
Nicolay Gerold: Yeah.
Michael Günther: Okay.
Otherwise, maybe you would have done
this, but yeah, maybe it would be
an interesting idea to also do this
and see if then you can get a model
which is better for image search.
Nicolay Gerold: Yeah.
So it likely would be that if I
create like an additional data set
and continue the pre training of the
image embedder that I would likely
have like better performance on
like image to image search as well.
Yeah.
Michael Günther: Yeah, it could be.
Nicolay Gerold: Can it be that
this might interfere or the
different types of searches?
So when I'm basically trying to specialize
the model on like text to image search
I'm better off doing your process.
And then when I try to specialize on
like image, text, image to image search,
I'm better off doing the other process
and they're interfering with each other.
If I try to do both.
Michael Günther: Yeah, I think in general,
it's and the always if you finetune
a model, it's very likely that it's
using its general liability properties
it might generalize then less good on
a specific on other tasks of course.
That you probably have to be aware of.
So I guess, yeah, when you train the
model, then only on image to image
pairs then it might not, it might
lose its text to image capabilities.
Nicolay Gerold: Yeah, and we're looking
more at like how to maybe take your model
and basically adopt it to, to a new task.
So if I, for example, had an e commerce
setting and I want to adapt the model to
actually, Be more suited to especially the
query types you often have on e commerce
sites, which is usually like very like
shorter queries with only a few words.
How would you actually go about
creating a data set to fine tune
Jina CLIP for a specific task?
Michael Günther: Yeah.
In the ideal case, you of course
have some, if you are an e commerce
company, you maybe have some click data.
You maybe have some query log
or something you could use.
And then basically try to relate the
queries directly to Descriptions of the
images and descriptions of the products
you have if you don't have this or not
in large amounts of months, or you don't
want to use it because it's a customer
later and you don't want to train on
it, then what you can also do is to take
a take a specific model to basically,
which can generate captions for your
generate queries for your products.
For example, if you have some good
product descriptions, maybe you can just
use a large language model to produce
some some queries about these documents.
And then you can use this
as a data set for you.
There's also like a method which is
called generative pseudo-labeling yeah.
Yeah.
And this is actually, they actually
have done this with a very, previously
it was a very simple model, which
is trained to generate questions and
then they have use it for only for
text retrieval purpose, but basically
they take the document collection,
constructing questions about this
documents and then they could show when
fine tuning on this data, then the model.
Performs much better on the
existing document collection.
So I think you can probably do this for
your e commerce application as well.
Nicolay Gerold: Yeah, so like I am
pulling up the research paper right now.
So what it basically does, you
generate synthetic queries.
Okay.
For the passages in the target domain.
So you basically try to do it.
I think that's really interesting,
especially if you also add
the paper on Getting 1 billion
personas from the internet.
So basically create user personas
and then use that on top to basically
create like different types of queries.
This could be really interesting and also
speed up the dataset creation greatly.
Nice.
And in this case, would you actually only
fine tune the text embedder, or would
you basically fine tune the entire model?
Michael Günther: That's a good question.
I think I would rather like I can
imagine that having a training
process, which resembles the last
stage of our training more closely
probably leads to a better results.
If you basically use the model.
For text to text and text to image
retrieval together, and then just
training the text encoder because
then, yeah, it might be the model was
a bit connection to the image modality.
Nicolay Gerold: Yeah.
And also it would save you a lot of work.
Nice.
What do you think are like for Jina CLIP?
What are some of the like interesting,
also weird use cases you have seen so far,
or that you have tested them internally?
Michael Günther: Yeah.
We don't have talked with, as the model
is relatively new, we haven't talked too
much with users, but one application would
actually always often comes out as yeah.
It's actually a commercial like product
recommendation and search because this
is also domain where you can really
see that it's, you have a lot of data
on the internet, which comes from
which is about products, for example.
And accordingly the model is trained
on similar data and therefore it
performs good and probably much
better than on a domain where you
cannot find much data on the internet.
Yeah.
Good.
Yeah, as I said, it's did a small
experiment where I tried out these
averaging when I tried to average the the
description of a product and the image
of the the embedding of the description
of a product and embedding of the
image of the product and the resulting
representation was actually quite good.
That's also interesting
that this works well.
However, as I said, it's also has some
limitations, especially if you as some
products missing a description or image.
Nicolay Gerold: Yeah.
What would you say, like people or
other engineers actually can take away
because most of them won't train their
own embedding model or even go through
such a large scale training process.
What are like the main learnings
you would give to other engineers?
That don't go like through the
massive experience you have
Michael Günther: I think, yeah, one thing
is yeah definitely to be aware be aware of
the scale of similarity values you have.
So I think one, often misconception we
had also for the text models was actually
that the similarities, the values might
not reflect a typical similarity, a
notion of similarity you have in mind.
For example, many text embedding
models, they have, produce always
very high similarity values, even
though they are not the two text
values are not similar at all.
But if you compare them, the similarity
values to other similarity values of
text, which You have encoded, embedded
together, then it makes more sense.
I think that's important also that the
similarity values between embeddings of
images and embeddings of texts and images
and texts are on a very different scale.
And yeah, so that's important to consider.
Another thing is always when you try
to optimize some model, I think what's.
What is definitely important is to have
some kind of a way to quantitatively
evaluate on how good is the model
before your optimization and afterwards.
I think that's definitely something
you should do before you start a
lot of effort to optimize something.
And yeah, I think if you consider
fine tuning, I think what's.
Some general hints would also be
that often for training embedding
models and also for fine tuning large
batch sizes are actually important.
And in order to produce large batch
sizes, you also have to enable
some performance optimizations.
For example, our Jina CLIP model
has flash attention support.
So you can use this to reduce
the memory the model consumes
during training on the GPU.
And also you can use
activation checkpointing, which
saves you a lot of memory.
Yeah, then maybe also one
thing is also be aware maybe of
the limitations of the model.
For example, if you you use a vision
model or a multimodal model, which can
get images, you have to be aware of
what is the resolution of the model.
The model scales images down and if
your images are very big and you want
to recognize something which is very
small on the images, then the model
might not be able to solve your task
because of this downscaling of the
resolution and a similar way if you
have very long texts, but before you
encode them, you truncate them into very
small chunks and into very small parts.
Then of course, maybe your, the
information you have to you want to
search for is not encoded at all.
Yeah.
So that's maybe some hands, yeah, I think,
Nicolay Gerold: the, especially
like the scale of similarities.
Is there.
Type of distribution you're looking
for typically like you're looking for
a normal distribution or anything or
are you rather looking for Are it, is
it a power law to one of the extremes?
Michael Günther: yeah, it's it's very
as though as the one thing you notice is
that many embedding models are trained
in a way that they map the all the text
values on a cone in the vector space.
And if you then calculate the
cosin similarity, then they have
all a very high cosine similarity.
But it seems like for some tasks the
models then perform better if the if
the embeddings are all on discount
and it's easier for the model to
model a certain type of similarity
model some contrastive objective.
And this is the case.
But yeah.
I think yeah, some people of course
are a bit confused that every, all
the similarity values are above 0.
5 or so if you take a
E5 model, for example.
Nicolay Gerold: In the end makes
like thresholds less usable when
you have like a distribution
that Biases to the higher side.
Nice.
What would you actually say
is missing from the space?
Thanks.
Michael Günther: Yeah.
I think one I think we just might
be missing is maybe that there are,
especially for multi model tasks
and these complex retrieval tasks
is a bit, that there are not so many
benchmarks you can actually use.
For example, we tried to find
benchmarks for a search tab where
you have a text and then you have
like products of text and images.
And you find some things in the
commerce space, which you can
more or less use as benchmarks.
Like the, there's a data set which
contains products from Amazon and
very fine kind of like category.
Descriptions you can use as queries
maybe, but in general, I think it would
be very nice if there would be a bit more
benchmarks and also like people which
maybe use these models that they can and
they noticed that something is not working
perfectly that they can maybe try to
create a small data sets where people can
evaluate and researchers can try to find
models which can work good on these tasks.
So I think that would be
something very useful.
Yeah, and also I think it's yeah,
they are, especially also in the last
months is came out a lot of vision
language models and embedding models
which support multiple modalities.
It would be very interesting to
see what kind of applications you
can solve with all these models.
Nicolay Gerold: Nice.
What's next, especially like
for Jina what's on the horizon.
You can already tease or
what are you working on?
Michael Günther: So are we
actually working on a new
version of our Jina ColBERT model
which supports more languages.
So at the moment, we the generative
model only supports English, which
is, of course, a big limitation.
And Trying to make that and then yeah,
and our team, what we also working on is
a new version of our text embedding model.
So if you look at leaderboards you
can see that the Jina text embeddings,
they are far apart from every, all
the other models in the meanwhile.
And.
And yeah, we are, we'll soon
probably publish a new model,
which is also multilingual and
yeah has better performance.
And then what we want to probably
focus on further is to actually
look into yeah, more complex models.
Multimodal search applications,
and also I try to build models for
this, maybe also, yeah, take a look
at ColPali and its capabilities and
limitations and see what's actually,
yeah what this model is maybe not so
good at and what we can contribute.
Nicolay Gerold: What is one more under
appreciated or niche technology that
you think deserves more attention?
Michael Günther: One thing which has
been done quite some research, but I
haven't seen it so popular and also
not many people putting it into.
Products is encoding of more
semi structured data and more
data with a bit more structure.
So tables, maybe also some models
which can encode things multi modal
data, which is in some structure.
So I think that would be, we have needed
direction that would be interesting.
And.
Yeah, then maybe what be would be also
interesting is that, and actually some
people would say that they want to use
CLIP models for tasks where you have
images, which are not like typical
photos, but more on images of which data
from different things, for example, from
like some plans or some images, which
represent some other type of image data.
And I think that maybe it would also
be interesting to see how actually
CLIP models perform in these different
kinds of images and what can be done to
improve the performance on those domains.
Nicolay Gerold: Nice.
And if people want to follow you, follow
along with Jina, where can they do that?
Michael Günther: Yeah, of course
we have have on our website, a blog
where we always publish some images.
I think you have probably interesting
article articles about some and
new things in the space of search
with AI models and all our models
are published on hugging face page.
We of course also have these
search foundations where you have
different API based products to
build your search applications and.
Yeah, also, you can multiple of
our, me and my colleagues are
active on Twitter or LinkedIn and
posting something from time to time.
Nicolay Gerold: So what can we take away?
I think.
First of all his thoughts on
the universal embedding model.
Are spot on.
I think also that we are unlikely.
Really to have a universal embedding
model in the terms of high performance.
I think we will have a universal
embedding model which is good
enough for a lot of tasks.
Which.
Some people might argue we have right
now with, for example, OpenAI embedd
ings, but I would say.
It isn't really, because
if you try it for search.
It works quite well, but if you
go beyond that, for example, For
anomaly detection or recommendations.
It won't work.
Well, I think for a universal model
or an embedding model to quality
as universal, I would like that it
has like a baseline of performance.
That qualifies it as like good enough.
For many different use cases.
So I can basically build.
A prototype around it.
But I think universal embedding moaners.
In terms like we will see.
Outperformance or.
Or performance on par
with the state of the art.
We.
We, I don't expect that we will see.
Anytime soon.
So I think the objectives.
And domain specific needs are so
different that you will always.
Be better off turning your own
model and constructing your own.
Datasets.
And also trying to get more.
More efficient because performance
is one thing in terms of like the.
The, for example, accuracy, but
performance in terms of speed.
Speed.
Or throughput is another.
And.
The university of . Most likely.
Very large.
Which make them unsuitable
for a lot of applications.
I think.
We are the types of.
Text image models.
Are like the three main types.
Types.
I clipped like models.
With two different encoders, which are.
Then joined at the end into a joined
embedding space, vision, language models.
Models, which basically prefix an
image before they start to generate.
My texts.
And then hybrid models.
Which.
I'm not exactly sure on
the, like what event.
Is a hybrid model.
What's the clear definition?
I think he mentioned.
And the magic lens.
Model.
A good example.
For that I have to read into it.
But then they basically allow for more.
More complex interactions
between the modality.
The.
The.
So I think it's basically separate and.
And quarters, but with additional
transformer components, Integrated.
But yeah, I think like the
three different types are very.
Very interesting.
And also Skydance like,
What's embedding model to use.
Look into paper, look into.
Into.
What benchmarks they have run.
But.
Maybe also look into like, what is
the lab researching, especially.
Especially if it's an industrial lab.
You might.
Find an industrial lab, which
put out a benchmark, which is.
Working in an area.
It's actually very close to yours.
Or on a use case.
That's close to yours.
So try to find an embedding model
which outperforms on the benchmark
If the benchmark is similar to your task
or an industry lab, which is working in
a space that's very similar to yours.And.
Also evaluate the architecture
based on the task complexity.
I think.
The most interesting insight
I would highlight is.
And that freezing the texts
encoder can hinder the clip.
Clip model training.
And I would love to know why.
But I think like if you go
into training a clip, all of
this is something to remember.
Remember.
And.
Also.
It already highlights
the original clip model.
Like if a training data.
They used.
Which is very short captions
highlights, like how.
How important it is to look
into the research paper and
also look into the data sets.
They used.
Whether they actually mirror the
data you will be seeing, because if.
You have very long caption.
So descriptions by users.
Your.
Unlikely to perform well with a
CLIP model out of the box because.
The data.
Is so different and it's
basically out of domain.
And another one is
basically large batch sizes.
That you should train and betting
on us on a large batch size.
Yes, I think that's it from
me for the main takeaways.
I felt it was really interesting
if you have any questions on.
Multimodal embedding models or would
like to see some fine-tuning scripts.
I'm working on a few right now.
Just let me know in the comments.
Or give me a ping on LinkedIn.
Otherwise we will be continuing more.
More on the embeddings next week.
So subscribe, like.