How AI Is Built

Today we are talking to Michael Günther, a senior machine learning scientist at Jina about his work on JINA Clip.
Some key points:
  • Uni-modal embeddings convert a single type of input (text, images, audio) into vectors
  • Multimodal embeddings learn a joint embedding space that can handle multiple types of input, enabling cross-modal search (e.g., searching images with text)
  • Multimodal models can potentially learn richer representations of the world, including concepts that are difficult or impossible to put into words
Types of Text-Image Models
  1. CLIP-like Models
    • Separate vision and text transformer models
    • Each tower maps inputs to a shared vector space
    • Optimized for efficient retrieval
  2. Vision-Language Models
    • Process image patches as tokens
    • Use transformer architecture to combine image and text information
    • Better suited for complex document matching
  3. Hybrid Models
    • Combine separate encoders with additional transformer components
    • Allow for more complex interactions between modalities
    • Example: Google's Magic Lens model
Training Insights from Jina CLIP
  1. Key Learnings
    • Freezing the text encoder during training can significantly hinder performance
    • Short image captions limit the model's ability to learn rich text representations
    • Large batch sizes are crucial for training embedding models effectively
  2. Training Process
    • Three-stage training approach: 
      • Stage 1: Training on image captions and text pairs
      • Stage 2: Adding longer image captions
      • Stage 3: Including triplet data with hard negatives
Practical Considerations
  • Similarity Scales
    • Different modalities can produce different similarity value scales
    • Important to consider when combining multiple embedding types
    • Can affect threshold-based filtering
  • Model Selection
    • Evaluate models based on relevant benchmarks
    • Consider the domain similarity between training data and intended use case
    • Assessment of computational requirements and efficiency needs
Future Directions
  1. Areas for Development
    • More comprehensive benchmarks for multimodal tasks
    • Better support for semi-structured data
    • Improved handling of non-photographic images
  2. Upcoming Developments at Jina AI
    • Multilingual support for Jina ColBERT
    • New version of text embedding models
    • Focus on complex multimodal search applications
Practical Applications
  • E-commerce
    • Product search and recommendations
    • Combined text-image embeddings for better results
    • Synthetic data generation for fine-tuning
  • Fine-tuning Strategies
    • Using click data and query logs
    • Generative pseudo-labeling for creating training data
    • Domain-specific adaptations
Key Takeaways for Engineers
  1. Be aware of similarity value scales and their implications
  2. Establish quantitative evaluation metrics before optimization
  3. Consider model limitations (e.g., image resolution, text length)
  4. Use performance optimizations like flash attention and activation checkpointing
  5. Universal embedding models might not be optimal for specific use cases
Michael Guenther
Nicolay Gerold:
00:00 Introduction to Uni-modal and Multimodal Embeddings 00:16 Exploring Multimodal Embeddings and Their Applications 01:06 Training Multimodal Embedding Models 02:21 Challenges and Solutions in Embedding Models 07:29 Advanced Techniques and Future Directions 29:19 Understanding Model Interference in Search Specialization 30:17 Fine-Tuning Jina CLIP for E-Commerce 32:18 Synthetic Data Generation and Pseudo-Labeling 33:36 Challenges and Learnings in Embedding Models 40:52 Future Directions and Takeaways

What is How AI Is Built ?

How AI is Built dives into the different building blocks necessary to develop AI applications: how they work, how you can get started, and how you can master them. Build on the breakthroughs of others. Follow along, as Nicolay learns from the best data engineers, ML engineers, solution architects, and tech founders.

Nicolay Gerold: Uni-modal
embeddings, take some form of input.

Inputs like text images, audio,
and convert them into a vector.

Maybe the stuff.

You're used to the most.

And.

That most applications use nowadays.

Are they.

For example, only work on texts to text.

Multimodal embeddings learn
a joined embedding space.

So you can actually
search images with text.

The assumption is that they can learn
a richer representation of the world.

So if we take.

For example.

Texts to image, embedding
models like clip.

They can learn concepts.

That are.

Very hard to be put into words or.

They are impossible to be put into words.

But we can go.

Go even further.

So for example, LIDAR radar.

ultrasonic We could even add
modalities we as humans cannot process.

We can also add human behaviors.

Events emotions or humor
and try to embed those.

But.

How do we actually train such modals?

I.

I think only a few people have even
fine tuned multi-modal embedding models

few have tried them from scratch.

Because that's mostly
reserved for the big labs.

And today we are trying
to answer a exactly that.

We are continuing our series
on embeddings and on search.

So welcome back to how I spoke today.

We are talking to Michael Guenther who
is a machine learning scientist at Jina

AI working on multimodal embeddings and.

You will be.

Especially looking at Jina clip.

A model.

Model that has broken most of the
benchmarks when it comes to text to

image embedding models when it came out.

And we will.

Be talking about.

How they trained it, how
they constructed a theataset.

How you can.

Construct a data set for your
specific use case and how you can use.

It to fine-tune your own models.

But we will also.

We'll be talking about the applications.

All of it.

Total how it is already used.

In production or how, what
use cases are they have seen?

Michael Günther: Re-rankers are mostly
used for text embeddings, right?

So it's not they're not so
much like for CLIP like models.

Re-rankers are not so common, but
yeah, for text embedding models for

text retrieval models, it's really
like that on one hand, the Re-rankers

are having much better accuracy.

So basically you put in like the
query and the document, so that

gives the model a much better way
to find a good retrieval score.

And it's easier to adjust the
relevancy based on the query.

So that makes it also working much better
for out of distribution because embedding

models are generally working much worse on
out of distribution tasks than re-rankers.

Yeah, the only thing which helps with
embeddings to get better on all of

the distribution tasks are instruction
based embeddings, of course, but this

doesn't get as close as Re-rankers
can work on out of distribution.

Nicolay Gerold: Yeah.

What do you mean exactly by
instruction based embeddings?

Michael Günther: So instruction based
embeddings are basically embeddings where

you which are trained in a way that you.

Add for each text where you want to
embed a prefix, like prompt in large

language model, which describes what
is the task it should be embedded for.

For example, for retrieval, you, it's
of course a difference if you have a

task, which is A question answering
tasks or the model should find the

answer to the question or if it's a,
for example, duplicate detection task

where the task is it should embed it
so that it's similar to to a duplicate.

And then you have also like typical
benchmarks, for example, for fact

checking where you basically have a
fact given and then you want to redrive

evidences for the fact which are the.

Okay.

Say yeah, the fact is true, but also
you want to retry things which are

completely negation of the fact, which to
basically falsify that the fact is true.

Yeah.

And it's all kind of different
objectives, which you,

Nicolay Gerold: we will ever
hit the universal embedding?

So one embedding model to rule them all?

Michael Günther: it's, I think one
thing is that as I said, you have

like very, different objectives.

And yeah, probably it would have
been a model where you can really

precisely describe what you actually
want to embedding model to before.

And then yeah, then also one, one problem
is if you're in a company and you have a

very specific data and the embedding model
hasn't seen anything like that before,

then probably it will also not be perfect.

Even for large language models, they
are universal, but still people fine

tuning them on domain specific data
if they want to employ them on a very

specific use case in their company.

Nicolay Gerold: Yeah.

Maybe connected to that, because I see
already especially for, smaller companies,

but especially startups many of those
are not, Doing like the traditional AI

or ML where they actually think about
the architecture, set up an entire

model, train it on a task, investigate
it, but rather do the pre trained model

and fine tune it on the task paradigm.

Do you actually expect that
engineers in the future will still

train entire models from scratch?

Or do you expect that for like
most tasks that are out there, we

will have a few pre trained models
and it's just like more efficient?

In terms of performance and also compute
to just take them and fine tune them.

Michael Günther: Yeah, I think it's, I
think for generally, pre training allows

a model to basically learn concepts
which are very general and probably

can be transferred for many tasks.

So I think there are lots of applications
where pre training actually helps.

Also, for embedding models, you can
see that definitely if you fine tune

in a pre trained embedding model,
which is already contrastively

trained to produce good embeddings.

It will perform much better than if
you just try to train a model, an

embedding model from scratch, even
from a pre trained transformer model

if your data collection isn't massive.

And However, there are also still, of
course, applications where you can see

either your, for example, you have text,
which is not really natural language

then probably taking a pre trained
model doesn't help you in any sense.

And also, of course, there are
some applications where you could

use I could use an existing pre
trained big model and fine tune

it, but you need more efficiency.

So maybe it's better to basically
train a very a very small I don't know,

random forest classifier or whatever or.

Even for embedding models, there are
probably some techniques which are much

more efficient to generate and things.

Maybe that's maybe for your application
that makes more sense to have a very

small model you can run on a CPU and do
lots of inferences in a very short time.

Nicolay Gerold: Yep.

Do you still, in your research,
try out different architectures?

For example, LSTMs, RNNs, or CNNs
for generating embeddings as well,

and training them on a specific task?

Yeah.

Michael Günther: We're doing relatively
applied research, I would say, and we

It's always a risky and probably in most
cases, you're not very successful if you

try to invent everything from scratch.

So what we usually do is we, try to
see what is already working and try to

do more no more smaller modifications,
where we think that we have a relatively

high probability on succeed with this.

Oh, and not I think if you're doing
research in academia, you are, have a bit.

It's a bit easier to try out
something completely new than, yeah.

Nicolay Gerold: Yeah.

What would you say?

Like on a conceptual level, what's
the main difference, especially

between the multi modal and the
single model embedding models?

Michael Günther: Yeah, the obvious
difference is, of course, it can, the

single model, the model can only accept
one type of modality as input, and the

multi model can accept Multiple types.

And then, on most of the models, they
have somehow component, which can,

of course, since the input modality
is different fundamentally, for

example, images are fundamentally
different to process and text.

You need a specific way to
transform it into an input.

You can give into your network.

So for example, vision transformers,
they take the image features and then

they put it into the split one image
into multiple small image patches, which

they can process by for text models.

Once you have a tokenizer, which
tokenizes the text sequence and then

basically map each of the tokens to a
specific input embedding representation.

So those are different techniques
and then then you usually have an

encoder component for each modality.

However in particular how the
inputs are combined by a model,

it can be very different depending
on the type of multimodal model

Nicolay Gerold: Can you
double click on that?

What are the different ways you could
combine the different modalities?

Michael Günther: as a
four text image models.

There are I think the, I would say three
rough types, probably you can find more

but which most commonly see one is this,
I would say CLIP like models that you

basically have a vision transformer
model and the text transformer model.

Text transformer model is.

responsible for encoding the text
and the vision transformer model is

responsible for encoding the image.

And then both have some projection layer
where they map the output on a shared

vector space or basically mapping it into
embedding of the same dimensionality.

And those are basically two independent
towers, which however are trained.

Transcript simultaneously to basically map
an image, which is related to a text value

on similar embedding representations.

And then I think other way are vision
language models, for example, as there

is this PoliGemma model, which also
has been used for recent ColPali model.

And they basically it's a vision language
model where you first have a, also

a vision transformer model, a vision
transformer, as I said, a bit usually

splits the images into image patches
and then process as, tokens like the

transformer model for text, and then
it generates for each for each patch,

basically one patch embedding at the end,
and those patch embeddings are basically

used then by another transformer model
in a way that is more like, Similar

to text transformer model gets first
as a prefix, all the embeddings of

the image patches, then getting some
prefix text and can then basically

further generate text based on it.

And the ColPali model, basically what
they have invented is that they take

such a kind of model, taking in the
embeddings of the image patches, pass

it through a transformer model, and
then you're getting multiple embeddings.

For each of the vision patches and then
you have a ColBERT type retrieval matching

where you basically have a query, which
is also tokenized into embeddings and

then you matching each of these token
embeddings to each of the image patches

and determine the maximum similarity
and average these maximum similarities

to get a final similarity score.

And this is specifically good for
more complex document matching.

Yeah, okay.

This was a bit long about the
vision language models except and

I think a third type is also that
you just have basically some kind of

vision encoder and a text encoder.

Which producing embeddings for
image patches and for for tokens.

And then you have another transformer
component, which gets both as

input and produces an output.

Maybe an interesting model of it recently
came out, which is using this architecture

is a magic lens model from Google and what
they basically doing is that they want to.

I wanted to have a model
for image to image matching.

But the if you do image to image matching
it's often not so clearly defined what

is the what is the objective here.

As we, Instructions for text
embedding models are used to basically

define a retrieval objective.

So they also wanted to have this for
images that you say you have an image and

then you say I don't want to have just an
image which is somehow similar but I want

to have something which shows the same
thing as a sketch or something like that.

And then you can take these
instructions and add the image and

encode it into a joint representation.

Nicolay Gerold: Yeah.

And.

Especially from an applied side, how do
you actually decide like which of these

three types might be most suitable?

Are there like some heuristics or from
your experience like some guides which you

can give or do you in the end always have
to end up testing the different options?

Michael Günther: Yeah, one thing you can
definitely look at is if you have the

papers available or some documentation
about a model, look at what kind of

benchmarks they are evaluating on.

And then you can basically look what
benchmarks Is most similar to my problem

and then you can see take a model
which performs good on this benchmark.

And this is a general recommendation.

If you don't have a benchmark, which is
some don't know a benchmark, which is

somehow similar to your problem, then
of course you can also take a look at

the architecture and say, okay, maybe
if I just have a In generic text to

image search problem, maybe I just
use a normal CLIP model, but if I have

something more complex, then maybe I use
a model with a different architecture

which can better model this problem.

Nicolay Gerold: When I have a
retrieval task, what is especially

the trade off in terms of performance?

Whether I just use multiple single
modality embedding models and run

multiple retrieval systems at the same
time and then re rank versus using

one joint model with a joint embedding
space or whatever for retrieval.

Are there any research paper that come to
mind that actually try to compare those?

Michael Günther: I don't know, I

haven't seen too much research papers
comparing this, or at least I have now

in mind, but yeah, what I'm, what is
if you so the case you're asking is

you have a text query and you want to
have search for some item which has

an image and a text property, right?

Nicolay Gerold: Yeah,

Michael Günther: Yeah,
something like this.

So then so what you can do is you can
basically just either basically use a

text embedding image embedding model,
and a text to image embedding model,

and then basically using the text
embedding to redrive similar embeddings

of just the text, and then using
basically a CLIP like model to redrive

similar embeddings of just the text.

We try for the text similar images
and then somehow we combine the

results it or you can watch what you
can use with your CLIP like model.

For example, you can.

Use the text encoder and the image encoder
of the CLIP embedding to produce two

embeddings of your, the image part and
the text part of an item and average them.

That can also work decently good,
especially with these Jina CLIP model,

which we have trained to produce
high quality of text embeddings also.

And which can also be used
for text to text retrieval.

There we have seen that on some product
search tests that actually using the

average embeddings works quite good.

However for example, one common problem
is that you might have some items

which only have a text property or only
have image property, or one of them is

very low quality and you cannot use.

And then one problem you observe
is that the scale of the similarity

values, which you redrive from.

An item which has only a text or which
has only an image is very different from

a scale from a averaged embedding if
you combine average embedding from this.

So in, in those cases, I think
then sometimes using like just two

single or two two embeddings, which
just capture a single modality

are probably a better choice.

Nicolay Gerold: Yeah.

And what is, especially when I'm
looking at new modalities, I think

like the image bind paper for example
included depth as one separate modality.

I think like the first question for me
would be like, what makes a modality?

And then how can I, based on the
modalities I include, how can I actually

figure out how to best combine the
modalities I'm adding to the model?

Michael Günther: Yeah, I think
it's, I think what's made up the

modality is, it's maybe just already
a bit philosophical question.

I don't knoW.

So if you want to add a new modality to
a model so I think one, The simple way

you could do is that maybe if you have
some model already, which can transfer,

translate from one modality into another,
for example, if you have some audio

sequence and you can just use something
like Whisper or so to convert it into

text and then use a text transformer,
maybe this already will work, right?

But might not work if you have some
audio, which is not transcribed.

Speech or something
like that, for example.

So that could be a simple option.

I would sometimes just maybe first try
out before training your own new model.

And another thing is that if you want
to train at a new modality into a

model, then what's I think one of.

Very difficult aspect is always to then
find sufficiently good training data you

can actually use to train these models.

So you need probably some data sets
which relates to two modalities in a way.

That it captures the notion of similarity
you want to capture and having those

kind of data in a sufficiently big
amount and, yeah then you probably,

if you design a model, you then have
to come up with What is a good method

to encode this data into into a
representation which can be processed by

neural networks efficiently and so on.

Nicolay Gerold: So basically first
challenge would be to figure out

a representation or train a model
to represent the new modality.

The second would be then to basically
figure out a way or create a data

set that can bring these, the new
modality and the existing modalities

into a joint embedding space.

So I can actually calculate
the similarity between them.

Can you maybe, By the example of Jina
CLIP give us an example of how the

process actually looks like for creating
a dataset for training a multimodal model.

And what especially also the
different fields in a dataset are.

Michael Günther: Yeah what we used to
train Jina CLIP was, of course, we utilize

a lot of already existing data sets.

There's this LION 400 million and
LION5B data set, which you can, which

is basically widely used to train CLIP
models and we just reuse this data.

Then we also, as we, our focus was
to create a model which is better on

for text to text search rely on our
collection of text training data we have

from our Jina text embedding models.

Basically, this was a relatively long
process to collect those data sets.

Basically, we just tried to get
a very diverse set of pairs of

texts from different domains so
that the model was very general.

We also did a lot of filtering.

For example, we filter
out duplicate text pairs.

As we filter out,

things which are not from, for
example, if you want to train the

English model, then we don't want
to have data from other languages.

And yeah, we also did some more
complex filtering methods like

consistency filtering, where you
basically take an existing embedding

model and try to filter out text
pairs where the text is unrelated.

Similar for the lion data set, for
example, they also to construct it used

an existing CLIP model and try to filter
out all the basically lion consists of

images and their captions so that you
basically can train a model to a CLIP

model to encode texts and a caption
and image into a similar embedding

representation and and basically encode,
try to train the model so that it encodes

the captions of an image of a different
image into a dissimilar representation.

And yeah, they also used a CLIP
model to basically check whether the

text where you was actually related
to the image and if it's below a

threshold and they filter it out.

That's a kind of filtering they do.

Yeah, for Jina, for the Jina CLIP model,
we've noticed that basically it's should

I get already into the training or,

Nicolay Gerold: Yeah.

Michael Günther: okay.

So the training is basically
so we basically maybe to have

a short background story.

So what we actually so at first we could
do is to basically take an encoder our

Jina text embedding models and try to
adjust existing vision transformers by

continued training to encode the images
into similar embeddings as the text

based on an image captioning data set.

And Thereby, we wanted to keep the text
encoder freezed so that we don't change

anything on the text encoder and the
model is also then compatible with our

existing Jina text embedding models.

However, it turns out this was
actually not working at all.

And we later also found out that there
are actually some papers which show that

if you freeze a text encoder basically,
you cannot efficiently train a CLIP model.

And and basically then we I thought,
okay, we try to train the model

more from without freezing anything.

So we try to train also the text
encoder part and here we thought

okay, we Basically utilize a kind
of multi task training method.

So basically train the model on pairs
of texts and pairs of texts and images

as it was usually trained together.

So training on both
objectives at the same time.

And this was working quite
well, but we also realized that.

The model has not as good
text encoding capabilities.

And we found out that this was
actually the case because the

image captions you usually train
a model on, they are very short.

So they are not very, the models cannot
learn too much from the image captions.

So basically we needed some data set which
has longer captions and more descriptive

descriptions of the actual images.

And then we basically found an existing
data set, which has a syntactic

descriptions of images, which are partly
generated by a large language model and

partly by image captioning model, which
is then trained on these synthetic data.

So that was basically how we basically
get the collected the data training.

Nicolay Gerold: And did you also
compare it to a triplet loss where you

actually also have negative examples
for the images or the text in the end?

Michael Günther: Yeah, good question.

So the actual training we,
so for text models, they are

usually trained in a two stages.

So after the normal pre training of
the text transformer model, you usually

train on text pairs, which you can
easily get in get in a very large amount.

So you can easily get one billion
pairs or something like that.

And then basically trained it in a
contrastive way, so that the model is

trained to encode two text values which
are part of the pair to be similar.

And from the text pairs in, which are
also inside the batch, you then take

all the right side of the pairs which
are not part of your pair and say the

similarity score should be low to And
usually after that you train models on

more smaller, high quality data sets
where you actually have For example,

for retrieval data sets and some queries
and the specific relevant documents

and also annotated some documents which
are not relevant to this text value.

And then you can use them as
additional negatives in your loss.

And you can also if.

If your high quality data set doesn't
have annotated negatives, you can

also mine them by, yeah, for example
taking an embedding model, look what

an existing embedding model gives
you for relevant text values, which

might not actually be relevant because
a re ranked model ranks them lower.

And I'm doing the training
of our oriJinal CLIP model.

We also wanted to do the same to.

improve the text retrieval
capabilities of this model.

So we basically ended
up with three stages.

The first stage, we basically
train our text model on image

captions and images on one hand and
on text pairs on the other hand.

And the second stage, we continue
to training on this text pairs, but

include longer image captions from
another data set plus actual images.

And then the third stage we continue
training on a long captions and images

and for the text part, we actually
then train the model on images.

On our triplet data sets
where we basically have these

additional hard negatives.

Nicolay Gerold: Yeah, and especially
on the evaluation side, I want to

especially know, like, when you're
looking at the paper and you're

looking at a new model and who finds,
an existing model, what are you first

looking at when you try to assess it?

Or what are the different
things you look at and like

evaluations in the methodologies
to actually assess the quality?

Michael Günther: Yeah, so I think, yeah,
looking, of course, on the benchmarks

and how it compares to other models, but
also what you think should take a look

on is, of course how how the models are.

Are trained, is it data which
is related to this task and the

performance of the model might then
come from basically that they use

training data, which is specifically
which is specific for this task.

So they might have done a better score
on the specific benchmark for this.

But also, if it's if you basically
want to use a model or something like

that, And then also makes sense to,
of course, choose a model where the

training data has been similar to
what you actually want to use it for.

Yeah.

Nicolay Gerold: And especially
looking at the training process,

why didn't you fine tune the image
embedder as well with pairs of images?

Michael Günther: we, and the
model basically I think our model

is not specifically trained for

image to image search, I would say.

Nicolay Gerold: Yeah.

Michael Günther: Okay.

Otherwise, maybe you would have done
this, but yeah, maybe it would be

an interesting idea to also do this
and see if then you can get a model

which is better for image search.

Nicolay Gerold: Yeah.

So it likely would be that if I
create like an additional data set

and continue the pre training of the
image embedder that I would likely

have like better performance on
like image to image search as well.

Yeah.

Michael Günther: Yeah, it could be.

Nicolay Gerold: Can it be that
this might interfere or the

different types of searches?

So when I'm basically trying to specialize
the model on like text to image search

I'm better off doing your process.

And then when I try to specialize on
like image, text, image to image search,

I'm better off doing the other process
and they're interfering with each other.

If I try to do both.

Michael Günther: Yeah, I think in general,
it's and the always if you finetune

a model, it's very likely that it's
using its general liability properties

it might generalize then less good on
a specific on other tasks of course.

That you probably have to be aware of.

So I guess, yeah, when you train the
model, then only on image to image

pairs then it might not, it might
lose its text to image capabilities.

Nicolay Gerold: Yeah, and we're looking
more at like how to maybe take your model

and basically adopt it to, to a new task.

So if I, for example, had an e commerce
setting and I want to adapt the model to

actually, Be more suited to especially the
query types you often have on e commerce

sites, which is usually like very like
shorter queries with only a few words.

How would you actually go about
creating a data set to fine tune

Jina CLIP for a specific task?

Michael Günther: Yeah.

In the ideal case, you of course
have some, if you are an e commerce

company, you maybe have some click data.

You maybe have some query log
or something you could use.

And then basically try to relate the
queries directly to Descriptions of the

images and descriptions of the products
you have if you don't have this or not

in large amounts of months, or you don't
want to use it because it's a customer

later and you don't want to train on
it, then what you can also do is to take

a take a specific model to basically,
which can generate captions for your

generate queries for your products.

For example, if you have some good
product descriptions, maybe you can just

use a large language model to produce
some some queries about these documents.

And then you can use this
as a data set for you.

There's also like a method which is
called generative pseudo-labeling yeah.

Yeah.

And this is actually, they actually
have done this with a very, previously

it was a very simple model, which
is trained to generate questions and

then they have use it for only for
text retrieval purpose, but basically

they take the document collection,
constructing questions about this

documents and then they could show when
fine tuning on this data, then the model.

Performs much better on the
existing document collection.

So I think you can probably do this for
your e commerce application as well.

Nicolay Gerold: Yeah, so like I am
pulling up the research paper right now.

So what it basically does, you
generate synthetic queries.

Okay.

For the passages in the target domain.

So you basically try to do it.

I think that's really interesting,
especially if you also add

the paper on Getting 1 billion
personas from the internet.

So basically create user personas
and then use that on top to basically

create like different types of queries.

This could be really interesting and also
speed up the dataset creation greatly.

Nice.

And in this case, would you actually only
fine tune the text embedder, or would

you basically fine tune the entire model?

Michael Günther: That's a good question.

I think I would rather like I can
imagine that having a training

process, which resembles the last
stage of our training more closely

probably leads to a better results.

If you basically use the model.

For text to text and text to image
retrieval together, and then just

training the text encoder because
then, yeah, it might be the model was

a bit connection to the image modality.

Nicolay Gerold: Yeah.

And also it would save you a lot of work.

Nice.

What do you think are like for Jina CLIP?

What are some of the like interesting,
also weird use cases you have seen so far,

or that you have tested them internally?

Michael Günther: Yeah.

We don't have talked with, as the model
is relatively new, we haven't talked too

much with users, but one application would
actually always often comes out as yeah.

It's actually a commercial like product
recommendation and search because this

is also domain where you can really
see that it's, you have a lot of data

on the internet, which comes from
which is about products, for example.

And accordingly the model is trained
on similar data and therefore it

performs good and probably much
better than on a domain where you

cannot find much data on the internet.

Yeah.

Good.

Yeah, as I said, it's did a small
experiment where I tried out these

averaging when I tried to average the the
description of a product and the image

of the the embedding of the description
of a product and embedding of the

image of the product and the resulting
representation was actually quite good.

That's also interesting
that this works well.

However, as I said, it's also has some
limitations, especially if you as some

products missing a description or image.

Nicolay Gerold: Yeah.

What would you say, like people or
other engineers actually can take away

because most of them won't train their
own embedding model or even go through

such a large scale training process.

What are like the main learnings
you would give to other engineers?

That don't go like through the
massive experience you have

Michael Günther: I think, yeah, one thing
is yeah definitely to be aware be aware of

the scale of similarity values you have.

So I think one, often misconception we
had also for the text models was actually

that the similarities, the values might
not reflect a typical similarity, a

notion of similarity you have in mind.

For example, many text embedding
models, they have, produce always

very high similarity values, even
though they are not the two text

values are not similar at all.

But if you compare them, the similarity
values to other similarity values of

text, which You have encoded, embedded
together, then it makes more sense.

I think that's important also that the
similarity values between embeddings of

images and embeddings of texts and images
and texts are on a very different scale.

And yeah, so that's important to consider.

Another thing is always when you try
to optimize some model, I think what's.

What is definitely important is to have
some kind of a way to quantitatively

evaluate on how good is the model
before your optimization and afterwards.

I think that's definitely something
you should do before you start a

lot of effort to optimize something.

And yeah, I think if you consider
fine tuning, I think what's.

Some general hints would also be
that often for training embedding

models and also for fine tuning large
batch sizes are actually important.

And in order to produce large batch
sizes, you also have to enable

some performance optimizations.

For example, our Jina CLIP model
has flash attention support.

So you can use this to reduce
the memory the model consumes

during training on the GPU.

And also you can use
activation checkpointing, which

saves you a lot of memory.

Yeah, then maybe also one
thing is also be aware maybe of

the limitations of the model.

For example, if you you use a vision
model or a multimodal model, which can

get images, you have to be aware of
what is the resolution of the model.

The model scales images down and if
your images are very big and you want

to recognize something which is very
small on the images, then the model

might not be able to solve your task
because of this downscaling of the

resolution and a similar way if you
have very long texts, but before you

encode them, you truncate them into very
small chunks and into very small parts.

Then of course, maybe your, the
information you have to you want to

search for is not encoded at all.

Yeah.

So that's maybe some hands, yeah, I think,

Nicolay Gerold: the, especially
like the scale of similarities.

Is there.

Type of distribution you're looking
for typically like you're looking for

a normal distribution or anything or
are you rather looking for Are it, is

it a power law to one of the extremes?

Michael Günther: yeah, it's it's very
as though as the one thing you notice is

that many embedding models are trained
in a way that they map the all the text

values on a cone in the vector space.

And if you then calculate the
cosin similarity, then they have

all a very high cosine similarity.

But it seems like for some tasks the
models then perform better if the if

the embeddings are all on discount
and it's easier for the model to

model a certain type of similarity
model some contrastive objective.

And this is the case.

But yeah.

I think yeah, some people of course
are a bit confused that every, all

the similarity values are above 0.

5 or so if you take a
E5 model, for example.

Nicolay Gerold: In the end makes
like thresholds less usable when

you have like a distribution
that Biases to the higher side.

Nice.

What would you actually say
is missing from the space?

Thanks.

Michael Günther: Yeah.

I think one I think we just might
be missing is maybe that there are,

especially for multi model tasks
and these complex retrieval tasks

is a bit, that there are not so many
benchmarks you can actually use.

For example, we tried to find
benchmarks for a search tab where

you have a text and then you have
like products of text and images.

And you find some things in the
commerce space, which you can

more or less use as benchmarks.

Like the, there's a data set which
contains products from Amazon and

very fine kind of like category.

Descriptions you can use as queries
maybe, but in general, I think it would

be very nice if there would be a bit more
benchmarks and also like people which

maybe use these models that they can and
they noticed that something is not working

perfectly that they can maybe try to
create a small data sets where people can

evaluate and researchers can try to find
models which can work good on these tasks.

So I think that would be
something very useful.

Yeah, and also I think it's yeah,
they are, especially also in the last

months is came out a lot of vision
language models and embedding models

which support multiple modalities.

It would be very interesting to
see what kind of applications you

can solve with all these models.

Nicolay Gerold: Nice.

What's next, especially like
for Jina what's on the horizon.

You can already tease or
what are you working on?

Michael Günther: So are we
actually working on a new

version of our Jina ColBERT model
which supports more languages.

So at the moment, we the generative
model only supports English, which

is, of course, a big limitation.

And Trying to make that and then yeah,
and our team, what we also working on is

a new version of our text embedding model.

So if you look at leaderboards you
can see that the Jina text embeddings,

they are far apart from every, all
the other models in the meanwhile.

And.

And yeah, we are, we'll soon
probably publish a new model,

which is also multilingual and
yeah has better performance.

And then what we want to probably
focus on further is to actually

look into yeah, more complex models.

Multimodal search applications,
and also I try to build models for

this, maybe also, yeah, take a look
at ColPali and its capabilities and

limitations and see what's actually,
yeah what this model is maybe not so

good at and what we can contribute.

Nicolay Gerold: What is one more under
appreciated or niche technology that

you think deserves more attention?

Michael Günther: One thing which has
been done quite some research, but I

haven't seen it so popular and also
not many people putting it into.

Products is encoding of more
semi structured data and more

data with a bit more structure.

So tables, maybe also some models
which can encode things multi modal

data, which is in some structure.

So I think that would be, we have needed
direction that would be interesting.

And.

Yeah, then maybe what be would be also
interesting is that, and actually some

people would say that they want to use
CLIP models for tasks where you have

images, which are not like typical
photos, but more on images of which data

from different things, for example, from
like some plans or some images, which

represent some other type of image data.

And I think that maybe it would also
be interesting to see how actually

CLIP models perform in these different
kinds of images and what can be done to

improve the performance on those domains.

Nicolay Gerold: Nice.

And if people want to follow you, follow
along with Jina, where can they do that?

Michael Günther: Yeah, of course
we have have on our website, a blog

where we always publish some images.

I think you have probably interesting
article articles about some and

new things in the space of search
with AI models and all our models

are published on hugging face page.

We of course also have these
search foundations where you have

different API based products to
build your search applications and.

Yeah, also, you can multiple of
our, me and my colleagues are

active on Twitter or LinkedIn and
posting something from time to time.

Nicolay Gerold: So what can we take away?

I think.

First of all his thoughts on
the universal embedding model.

Are spot on.

I think also that we are unlikely.

Really to have a universal embedding
model in the terms of high performance.

I think we will have a universal
embedding model which is good

enough for a lot of tasks.

Which.

Some people might argue we have right
now with, for example, OpenAI embedd

ings, but I would say.

It isn't really, because
if you try it for search.

It works quite well, but if you
go beyond that, for example, For

anomaly detection or recommendations.

It won't work.

Well, I think for a universal model
or an embedding model to quality

as universal, I would like that it
has like a baseline of performance.

That qualifies it as like good enough.

For many different use cases.

So I can basically build.

A prototype around it.

But I think universal embedding moaners.

In terms like we will see.

Outperformance or.

Or performance on par
with the state of the art.

We.

We, I don't expect that we will see.

Anytime soon.

So I think the objectives.

And domain specific needs are so
different that you will always.

Be better off turning your own
model and constructing your own.

Datasets.

And also trying to get more.

More efficient because performance
is one thing in terms of like the.

The, for example, accuracy, but
performance in terms of speed.

Speed.

Or throughput is another.

And.

The university of . Most likely.

Very large.

Which make them unsuitable
for a lot of applications.

I think.

We are the types of.

Text image models.

Are like the three main types.

Types.

I clipped like models.

With two different encoders, which are.

Then joined at the end into a joined
embedding space, vision, language models.

Models, which basically prefix an
image before they start to generate.

My texts.

And then hybrid models.

Which.

I'm not exactly sure on
the, like what event.

Is a hybrid model.

What's the clear definition?

I think he mentioned.

And the magic lens.

Model.

A good example.

For that I have to read into it.

But then they basically allow for more.

More complex interactions
between the modality.

The.

The.

So I think it's basically separate and.

And quarters, but with additional
transformer components, Integrated.

But yeah, I think like the
three different types are very.

Very interesting.

And also Skydance like,
What's embedding model to use.

Look into paper, look into.

Into.

What benchmarks they have run.

But.

Maybe also look into like, what is
the lab researching, especially.

Especially if it's an industrial lab.

You might.

Find an industrial lab, which
put out a benchmark, which is.

Working in an area.

It's actually very close to yours.

Or on a use case.

That's close to yours.

So try to find an embedding model
which outperforms on the benchmark

If the benchmark is similar to your task
or an industry lab, which is working in

a space that's very similar to yours.And.

Also evaluate the architecture
based on the task complexity.

I think.

The most interesting insight
I would highlight is.

And that freezing the texts
encoder can hinder the clip.

Clip model training.

And I would love to know why.

But I think like if you go
into training a clip, all of

this is something to remember.

Remember.

And.

Also.

It already highlights
the original clip model.

Like if a training data.

They used.

Which is very short captions
highlights, like how.

How important it is to look
into the research paper and

also look into the data sets.

They used.

Whether they actually mirror the
data you will be seeing, because if.

You have very long caption.

So descriptions by users.

Your.

Unlikely to perform well with a
CLIP model out of the box because.

The data.

Is so different and it's
basically out of domain.

And another one is
basically large batch sizes.

That you should train and betting
on us on a large batch size.

Yes, I think that's it from
me for the main takeaways.

I felt it was really interesting
if you have any questions on.

Multimodal embedding models or would
like to see some fine-tuning scripts.

I'm working on a few right now.

Just let me know in the comments.

Or give me a ping on LinkedIn.

Otherwise we will be continuing more.

More on the embeddings next week.

So subscribe, like.