Nicolay Gerold: Most LLMs you
use today already use synthetic
data in its training process.
It's not just a thing of the future.
The large labs already use the
larger models they have, like GPT 4.
0, to generate training data
for a smaller one, like GPT 4.
0 mini.
This lets you build faster and
cheaper models that in most cases
are specialized on a certain task.
And this is called distillation.
But the vision for synthetic data.
It's much bigger.
You want to enable people to train
specialized AI systems without
having a lot of training data.
And today we are talking
to Adrien Morisot.
an ML engineer at Cohere.
And we talk about how Cohere uses
synthetic data to train their models,
their learnings, and how you can use
synthetic data in your training and how
especially you can take what they do
at large scale to a smaller scale with
smaller models or for specific use cases.
And.
We are slightly diverging from our
search focus, but I wanted to create a
deeper dive into synthetic data after our
episode with Saahil and in search, you
can use synthetic data in a lot of places.
You can generate hard negatives.
You can generate training samples for
classifiers or re rankers and much more.
And before we start with the podcast,
I would love to know whether you
have actually used synthetic data
before and whether you found any.
Use in it and also leave a, or
a review on the platform you're
listening on like YouTube or Spotify.
It helps a lot in continuing the
podcast and make it bigger and better.
And that's enough from me.
Here's my conversation with
Adrien Morisot on synthetic data.
Adrien Morisot: I think it's
a, so there's kind of two arcs
to to my synthetic journey.
The first arc was maybe 2021 ish when
we had models just generative models.
And I was trying to get them to
create data for themselves or create
data for other embedding models
or search models or classifiers.
And at the time the
models were really bad.
And so I tried quite hard, tried fine
tuning them but it just didn't work.
Nothing worked.
Training on the model outputs
would always make things worse.
And so I gave up.
And then I did other stuff for a while.
And then relatively recently, we got
models that achieved kind of step
functioned up to the point where they
were good enough to start being usable for
kind of training themselves in small ways.
And yeah, it wasn't it was
an accidental rediscovery.
We had a very pressing customer
ask that we really had to fix.
And so the default way to fix
some kind of model issues is to
run annotation jobs with humans.
So you say, okay, like here's the
task and you write some, create
some data to, to fix this issue.
And there was just the customer ask
was so pressing that we didn't have
time to run a normal full fledged
annotation job with human beings.
And so we're like, okay, we
have to accelerate this somehow.
And so the kind of path to that
was using the model and to have
the model kind of do as much of the
work as possible, somewhat guided by
humans but still mostly model work.
And this, it was a very intense
month of work, but we ended up doing
developing lots of useful techniques
to figure out how to get the model to
improve itself on a variety of axes.
And then we realized, okay,
this is very promising.
Let's turn this into a full fledged team
and kind of more, more coordinated effort.
And we've been doing that since then.
Nicolay Gerold: Nice.
And how do you think synthetic
data fits into the evolution of
AI or like software at large?
Adrien Morisot: yeah, I think it's
actually very fundamental or maybe it's
because I'm like, I'm clearly biased,
but I think it's part of a grand arc.
It's like the beginning of a
grand arc of the evolution of AI.
And so one analogy I've been
thinking about a lot recently
is you know how the beginning of
computing was like very hands on.
You tell your, you tell the
machine exactly what to do.
So it's okay, here's, there's like
a binary number in this register,
binary number in this register.
We're going to have them flow
through this particular circuit.
And then at the end, there's going to
be, this will be like the subtraction
of these two binary numbers.
And you can program an
assembly code, this stuff.
And it's, it works, and it's the
closest thing to the metal that there
is, but it's also incredibly slow,
incredibly difficult, incredibly painful.
It's not something that you do for fun.
And then you fast forward through the
history of computing, and then you build
all these abstractions around it, and
at the end you have Python, which is oh,
which is, very close to English oh, if
this do this you define, and then you give
it a pleasant function name, and then it
just adds two numbers, multiplies them
and so you have this kind of software 1.
0 evolution from something
incredibly painful to something
quite pleasant and easy to use.
And I think there's this second
arc of programming, which is not
about, which is not software 1.
0.
It's more about telling a
neural network what to do.
So you have a big neural network,
and then you have a bunch of
inputs and a bunch of outputs.
And you say, okay, for these
inputs, map it to these outputs.
And The original way to train these
neural networks was very painful.
You harvest data usually manually and
it's, stressful and difficult and painful.
You have to find tricks, you find data
augmentation tricks, all of this is
not particularly pleasant or easy.
But I think and in the
past you couldn't use them.
A bigger, smarter model to create data.
You still had to either get humans
to do it, or find clever ways to
scrape the data that you needed.
But now the biggest and most competent
models are powerful enough that you
can just tell them, okay, I need
data of this specific format mapping,
help me map this, and this type
of input to this type of output.
And then give me a bunch of
examples of these inputs and
correspondingly these outputs.
And then I will use this data to
train the neural network on and
it works reasonably well now.
And so it feels like it's still painful.
There's still pitfalls.
It's probably still in the C
world rather than assembly.
But the path to having kind of the
Python of being able to program neural
networks is becoming more and more clear.
I think there's more and more of an
ecosystem around, creating synthetic
data, and it will only get better and
the models will continue improving
too, so that will help a lot.
So yeah, I think it's quite quite
fundamental, like the, if you have models
that are capable of creating data for
themselves, it feels very powerful.
Nicolay Gerold: Yeah.
And I think it's really interesting
because I think the software 3.
0 is touching it from different levels.
Like you're on the one side, you have
the readable code where it is interpreted
or compiled into a machine program, but
at the same time you have the, like more
data driven side, like neural nets and
stuff like that, which is turned into a
statistically based machine program and.
Now you have the synthetic data
generation, which takes over the second
part, but you can also generate code very
well, which takes over the first part,
so it's like double trouble in the end.
Adrien Morisot: Yeah, it's it's
quite an interesting synthesis, yeah.
Lots of strange things coming together.
The future will be very exciting.
Nicolay Gerold: Yeah, and how does
synthetic data Is it like, what was
the struggle so far where we couldn't
basically curate the data sets or
get the data sets just from the
internet or from existing documents?
Adrien Morisot: Yeah, there's always
so yeah, for a long time, you can
always kind of be clever about your
data harvesting or data gathering.
You could there's a,
there's many examples.
So the internet is like a huge place.
So you can find lots and
lots of random stuff in it.
And so if you needed a data set,
odds are you could twist and kind
of cleverly reformat parts of
the Internet to suit your needs.
So maybe one example of this, there's
a let's say you want to train a model
that's that's good at recognizing that
two questions are similar to each other.
Then you can think very hard and
be like, we're on the Internet.
Are there instances of of questions
that are similar to each other
that are labelled as such that
I can figure out how to scrape?
And then if you think hard
enough, you realise, oh, okay.
On forums there's often a little
tag that says duplicate sometimes.
If you go on Stack Overflow,
it'll say sometimes oh, this is
a duplicate question to this.
Or on Quora, you'll find, oh, this
is a duplicate question to this.
And so if you just scrape those
websites, then you have, you can
gather a bunch of question pairs.
And then cleverly come up with large
quantities of data around, questions
that are, not phrased exactly
the same, but similar semantics.
And so that you, lots of machine
learning for a long time was about this.
You had an idea for you, like you, you
want some stuff you want some data in a
specific format and you go on the Internet
and you try to harvest it that way.
The yeah the.
This is not trivial.
This is a day's work, two days work.
It depends how good your
scraping capabilities are.
It depends most of the time you'll be able
to find something on the internet that
kind of loosely matches what you want.
But sometimes it's just impossible.
And the way, like the reason
synthetic data is useful,
you can just ask the model.
So you can say, okay, hello, I
would two similarly phrased, Two
questions with similar semantics,
but different kind of phrasings.
And then the model will
just say sure, how's this?
And then it gives you one and you're
like, okay, you should iterate on your
prompt to figure out what's a, what is
reasonable and what it's not reasonable
until you find yeah, until you find
something that satisfies you and then
you generate large quantities of this.
Yeah.
And then you have.
Yeah, and so it's your trading.
Yeah, you're doing everything
in English rather than in
beautiful soup and requests.
Nicolay Gerold: And I'm really curious,
how does this actually look like?
What are you feeding into the model to
actually get the good output that you get?
Like in the end.
You can generate, for example, questions.
What do you feed in?
Do you feed in like a user
persona, similar questions?
What are like the
different inputs you use?
Adrien Morisot: Yeah, so I think
the so yeah the user persona that
you're referring to is it from
that 1 billion personas paper?
Is that a reference?
Yeah.
For the audience it was
this very cool paper.
Okay.
So maybe to zoom back out a bit.
If you just naively ask the model, give
me Two similarly two, two questions with
similar semantics but different phrasings.
And you do it once, the model
will say, sure, here's here's one.
And then you're like, great, happy,
let me get a thousand of these.
So you inject this pro, or like
you, you regenerate an answer
to this prompt a thousand times.
That doesn't work because
the model has no kind of.
recollection of what it said
across different chat calls.
And so you don't really know like
you, you are, you're asking for a
thousand different ones, you might
get, 15 different ones because
the model might just repeat.
And Ac this is pervasive across
all synthetic data generation.
And so the way the clever way around this
is to inject more noise into the prompt
such that you get a diverse output.
And so one cool technique that we've that
we'd actually discovered internally long
before this personas paper was published
is, okay you go over the internet and
then there's there's the internet is huge
it's crazy there's all sorts of different
stuff in it and then you ask the model,
okay here's a page from the internet.
Tell me something about the person
who would've generated this page.
Give me get very specific about it.
And so maybe it's a cooking blog.
And then they have some kind
of detail about their family.
It is like a Filipino dish or something.
And so the mall's okay, this is
a, this is Filipino chef who loves
their family, blah, blah, blah.
And so that's one persona.
And then you do this in parallel
across, large parts of the,
or like a billion webpage.
And then you get okay, you have the
Filipino nurse, and then you have the,
Haitian sprinter, and then you have the
the French swimmer or something, and then
you've accumulated very precise personas.
And then when you, and then the where
synth stuff comes into play is rather than
asking just can you give me, two questions
with similar semantics, but different.
You have can you give me two
questions with similar semantics
and different phrasings?
By the way, you are, a Filipino
chef and you, this is your
very precise personality.
Use this to create the questions
and then the questions will be very
particular to that, Filipino nurse's
life or that French, swimmers life.
And then.
This kind of averages out
to a very diverse data set,
also versus the Internet.
So that's a key synthetic data
trick, is to inject as much diversity
using the Internet as possible.
Nicolay Gerold: What are other options
you guys have explored to generate like
that diversity, but within the constraints
of the actual tasks you want to solve?
Adrien Morisot: Yeah so
we've explored a lot.
It's a central problem.
One of them, which is, which dates
back a long time ago, people were using
this on top of GPT 3, was just Using
injecting a random number in the prompt.
So at the top of the, you have
everything the same, the prompt
is the same, but at the top of the
prompt you have a random number.
This works okay.
It helps a little bit.
But it's not great.
The personas and using the diversity
of the internet is definitely
better than using the diversity
of a random number generator.
And yeah, I think.
I think we've tried other stuff, but
the most successful stuff so far has
been yeah, using the innate diversity
of the Internet to to create richer
and more diverse synthetic data.
There's some other kind of narrow
techniques okay, if you have, if you want
to guide the model more specifically.
Towards generating maybe like 10
different types of things let's
say you want either A or B or C or
D or, or, criteria 1 through 10.
What I often do is I put those in
a Python list and then random subs,
random subsample 3 and shuffle, and
then just have those in the prompt.
You get this three choose
10 possible prompts.
And if you just work your way through
those, the model will be decently diverse.
But I think the finding has
been mostly that the intranet is
diverse enough for most purposes.
Yes.
Nicolay Gerold: like at your guy's
scale, who are building mostly like
more tasks specific data sets, how
would you actually go about it?
If you have you're building a financial
models, how would you go about actually
controlling the synthetic data?
To stay within the domain, but still be
diverse enough to cover a broad basis.
Adrien Morisot: Yeah, I
think yeah, so maybe another.
Another example that we haven't
touched on yet is few shotting.
Usually few shotting is dangerous because
it the point of few shotting is you
narrow the diversity of your output.
You tailor it much more precisely.
Maybe maybe in this case, that
would be a good thing, right?
Because if you're trying to generate
either analyses of financial models
or or to create financial models in
the first place, then then having it,
then it's probably important to have a
very precise structure for your output.
And so if you add examples of the
desired output in the prompt, maybe two,
three, four times then you'll be golden.
Nicolay Gerold: And how much manual
creation is actually still involved?
Like how much do you actually go over
the data set after you generated it to,
to check whether it's of high quality?
Adrien Morisot: yeah so I think a lot.
And that's the reason we're still
in C or C land and not Python.
It's, yeah it's basically.
So yeah I talked earlier
about the software 1.
0, where you like write code to tell
the machine what to do in software 2.
0, which is training neural
networks to, to do something.
The way you program the
network is with the data.
And so in the same way that you have
to reread your code obsessively and
add tests and you get PR reviews.
Maybe one person, maybe two people look
over the code and step through it and
try to understand deeply what's going on.
The same is true for software 2.
0, but what you have to do is
not read code, but look at data.
So look at examples of input output pairs
and just be like, okay, this makes sense.
This makes sense.
Yeah would you trust Would you trust a
PR review if it stated at the top that
it only read, 1 percent of the lines
of code and that they looked good?
Probably not, right?
So you so it's the same for some stuff you
have to read over the output obsessively.
Make sure that the inputs make sense,
that they're close to the distribution
of inputs that you're expecting.
That your outputs make sense, that they're
exactly what you want with no confusion.
So you have to be pretty obsessive about
just bathing in the data at all times,
just swimming in it, being like, okay,
this is interesting, this is neat this,
and yeah, data has to be both there has
to be both large quantities of data and
also high quality, meaning often diverse
inputs, diverse outputs and yeah, that's
the way that you get a performance model.
Nicolay Gerold: Are you.
Automating this.
So are you evaluating all the different
inputs, outputs you generate and trying
to create like the thing on GitHub
all the time, like test coverage, 98%.
Adrien Morisot: Yeah so the equivalent
versions of that are just measuring,
or there's some automated ways, right?
There's measuring diversity.
So you know some smooth brain, simple
ones like, okay, how many n grams to
overlap across all your data points.
We have some clustering techniques like,
okay, can you, maybe they're different.
Maybe n grams don't overlap,
but maybe they're they're all
very semantically similar.
And so we are, so we cluster them using
our embeddings and then see the overlap.
So that's one flavor of automated metric.
And there are more, but the, a very,
like a very fundamental thing is
still to just manually read the data.
Oh, another thing you
can look at some axes.
You can look at interesting axes.
So if you take the extrema of different
axes, you'll often get interesting things.
So if you take the longest data point.
It might be weird if you take the very
shortest data point, the shortest 5
or the shortest 10, it might be weird.
If you take the ones with the highest
likelihood according to the model the
lowest likelihood according to the
model, that might be interesting too.
And yeah, various automated measures.
If you have code, like how many
how many tests pass for something.
If you're generating tests
whether something compiles or not.
Things like that.
Nicolay Gerold: And what do you think
now is the right time to actually start
to fine tune models with synthetic data?
Adrien Morisot: Oh, because
we've tried it and it works now.
Whereas before, we tried
it and it did not work.
I think, Yeah, I think, If now is the
time, In the sense that it's like so one
thing that I think all the AI, I, like
the big AI labs do is have very large
quantities of human, Labelers, human
annotators, so we have, At Cohere, we have
an internal annotator army plus vendors.
This is hundreds of human
beings labeling data.
And most companies in
the world cannot do this.
It's just very unreasonable to hire a
bunch of human beings to label data.
It's very painful.
It requires much more expertise
than maybe you'd naively expect.
It's a whole muscle to build
to run a human annotation job.
It's very difficult.
Especially yeah, especially if
you want to do it well but I think
now the synthetic data has gotten to a
point where you can start doing some of
the tasks that an annotator army would
do and do it for the cost for not human
cost for kind of GPU cost which is one
or two orders of magnitude cheaper.
Nicolay Gerold: Yeah.
Do you actually, because in, when
you're running human annotations,
you often use multiple evaluators
on the same type of data and look
how well they basically correlate.
And what you often find is basically
there's quite a drastic difference.
Even if you look, if you do like more
of a classification task and stuff
like that, do you actually try to
use like multiple different models?
To generate the synthetic data and
basically then cross correlate.
Adrien Morisot: Yeah, so we mostly use.
Our own models, just because
we have them and they're good.
Recently Nvidia released Nemotron, which
is like targeted specifically at, or
the way they marketed at least was was.
You can use this model fully
open source to create large
quantities of synthetic data.
So we might start using that.
There's someone working on
converting it to our info right now.
And yeah you'd intuitively think
that this would make sense to
increase the diversity, right?
Because if a core challenge
of synthetic data is lack of
diversity, then you'd expect using
lots of different models to help.
I think.
I, I have by that I think it's just the
way in which the models are, or another
way to inject diversity, which is much
more coherent specific is just like
training the models in different ways.
So you have, different recipes for
training models, and those recipes tend
to produce quite different behaviors.
And so if you just get diversity that
way, by, by prompting, by putting the
same prompt into three differently trained
models and harvesting different outputs.
That works too.
Yeah.
Nicolay Gerold: And
Adrien Morisot: yeah.
Nicolay Gerold: I think you can flip the
next thing in two different directions.
So basically behavioral cloning for one,
how do you actually avoid behavioral
cloning with synthetic data that you're
just generating more of the same stuff,
but also how do you actually use synthetic
data to Overcome behavioral cloning
in training models because you often
lack enough training data to get like
a robust or diverse enough coverage.
Adrien Morisot: Wait wait, what,
maybe, could you expand on what
you mean by behavioral cloning?
Nicolay Gerold: So behavioral cloning
basically means that you your training
data set, because often in especially
like terms of AI there are a lot
of different ways to answer the
same question, but you do not have.
All of the different ways or most of the
different ways in your training data set.
So the model basically over fits on what
it has seen and isn't able to do stuff.
It hasn't seen before.
Adrien Morisot: Yeah, I think maybe
some attempt at an answer would be
a warning, like if it's not, so the
Synthetic data is not a panacea, i.
e.
if you create synthetic data and it's
wrong or bad and you put it back into
the model will definitely get worse.
So you have to make sure that the
outputs of the model are correct
before you feed them back into it.
If the outputs of the model are correct
and you have some guarantees of this,
then then, It's probably just okay.
If you show the model examples of okay
data, okay, in a diverse way, then
the model will probably get better.
So maybe one more concrete thing
is let's say that you want your
model to get better at very precise
flavor of, um, math or something.
Let's say you want it to
be really good at calculus.
Then.
If you ask it casually for the integral
between zero and two pi of sine x
over x or something, and it just
spits something out and you have no
idea whether or not this is correct.
If you feed it back into the model
will be like, okay, this is definitely
the correct derivation in the future.
I will do this.
But if you have some way
of verifying the output.
Maybe with a symbolic solver or something
like you, you run, you extract the stuff
from SciPy and then you you, yeah, you
turn it into code and then you make
sure that the models that answer indeed
matches up with reality, then you can
put up that answer back into the model.
So yeah, there's no.
There's no free lunch and you need to
do work if you're not confident that
the model can do the task, you need
to do work to filter the bad data
points you, yeah either using humans
or using code or some combination yeah,
there's no kind of crazy free lunch.
Nicolay Gerold: Yeah.
And because he touched.
On it before on the, especially like
diversity of synthetic data, which
is basically like N grams, cluster
analysis, semantic similarity.
How do you go into the opposite?
So basically if you want to train a model,
or if you want to do a domain adaptation
of a model, how do you actually try to
ensure that the model or the outputs
and inputs stick to the domain and
how the words are used in that domain?
Adrien Morisot: Yeah, maybe if you make
the question more pointed, pick a domain.
Nicolay Gerold: So what I've done often
is like mostly, especially in finance.
And the lingo in finance is a
lot different than the lingo
on like most data or websites.
The models were trained and if
the model just doesn't have the
right probabilities, the log props
aren't fine tuned on that domain.
Adrien Morisot: Yeah, so maybe maybe
one thing is to just fill up your
context with kind of correct documents.
So either the model is capable,
zero shot of of just speaking like
a financier or a finance person.
If it's not, you can
probably help it along.
Since context links are so big, you
can pick your favorite piece of.
Finance writing that has all the correct
lingo that you want and just stick it
in the prompt and add some part of the
prompt indicating that you want the
you want the model to behave in the,
or speak with the same style as as the
text above, maybe you add a glossary
of common finance terms that the model
might not know, but that feels, Okay.
Closer to, to assembly and early see in
the sense that I would expect the model
to get all models over time to get much,
much better at modeling finance lingo.
Maybe yeah, like within a few years,
I'd be shocked if the model couldn't
didn't perfectly understand all of the
finance lingo just because, there's,
there has to be so much of it on the
internet and, All the big LLM companies
are doing a better and better job of
scripting the internet and figuring out
what should be read, what's important.
And obviously finance is such a huge,
obvious use case for LLMs that yeah, I'd
expect models to get much better at that.
At least understanding the log
probes and the lingo of finance.
Nicolay Gerold: Yeah.
Adrien Morisot: Are you finding that
the models are not capable of dealing
with kind of finance lingo properly?
Nicolay Gerold: Yeah.
Especially, I think.
So most of the financial use cases we
do is really heavy and like information
extraction from financial documents.
And.
If you look, especially at like annual
reports and stuff like that, the, it's
very specific and there is so much
boilerplate, which you usually can't
just ignore, but this boilerplate
actually, in my opinion, really
fucks up the lock props, especially
because it's always identical.
Or mostly identical.
And then the more important bits
are very different depending on like
the industry are in, but also what
accounting rules you have to comply with.
And even the like different accounting
firms have different styles.
Like whether you use the law ID, why
we're in Bergen and stuff like that.
Adrien Morisot: I'm familiar.
My girlfriend works at KPMG.
Yeah.
Nicolay Gerold: And that's it's quite
an interesting challenge, but I think
it's very similar to legal in the end.
The, each country in legal
is different because you have
your own rules laws and rules.
And it has the same challenge, like so
much boilerplate in contracts, which
you should just ignore for the models,
because it's pretty much irrelevant.
Adrien Morisot: Yeah.
Until that one close.
Nicolay Gerold: Until that one class.
Yeah.
And what's, can you walk us a little
bit through, like, how does your method
or way of working look like for like
improving the synthetic data set?
Like only not even touching like the
model, but really looking at like
inputs, output, fine tuning the model.
Evaluating it and then basically iterating
Adrien Morisot: yeah, it
probably starts with evals.
So let's say that you care about a
specific issue like model pathology.
Then you want to have a reliable
way of measuring that pathology.
Ideally in an automated way, otherwise
things take much, much longer and
are much more expensive to fix.
And, but usually you can break down.
Yeah, for a long time, model evaluation
was like using terrible techniques like
blue scores, which are terrible or just
like simple classifying classification
evals which don't work very well.
Now models have gotten good enough
that they can pull out information
reliably from their own generations.
And so you can compose
an evaluation suite.
It's okay, make sure that this property.
Like for a given prompt, make sure
that this property is maintained.
This property is maintained, this property
is maintained, and then you can fully
automate your entire eval process once
you have that going, then you can just
create synthetic data and, or yeah run an
annotation job, fine tune the model the
fine tuning recipe is pretty standard.
Usually there's only really two options.
There's there's.
Behavioral cloning stuff where you
tell the model, okay, for this input,
this is exactly the type of output that
you want, or DPO slash some sort of
preference loss where you have a good
completion, a bad completion, and you
want the, you're telling the model,
okay, this is a good one, this is a
bad one, make sure that you do more of
this and less of this in the future.
And depending on your setup, you run.
Training and then you see if the measure,
if the metric moves or not, and if the
metric doesn't move, but you're confident
your data is good, then you go look at
the specific outputs for your specific
test cases and see whether whether,
um, yeah, there's like ways to figure
out why your model is still misbehaving.
Maybe the task is too hard.
It's possible, right?
If you if you ask the model, if you ask
a small, if you're trying to distill
like very complicated concepts into us,
into two smaller model, even if your
data is perfect, it won't work because
the model has no capacity, like asking.
A three year olds to like, be very good
at quantum mechanics, it's probably not
going to happen, so if you try to teach,
a 1 billion parameter model, something
too complicated, it will just fail.
But but yeah, otherwise,
hopefully the metrics just go up.
Nicolay Gerold: can you give
me an example on like the
properties you're evaluating on?
Adrien Morisot: Yeah, there's
a suite of them often.
So we're quite, enterprise heavy.
And so we work a lot with
enterprise customers.
And then the stuff that they
tend to want is just clean
syntax often length constraints.
Yeah so one thing is like you have a
GUI and then, and you want you, it's
like very rich, there's lots of stuff
going on and you want the LLM output to
go in this little box, this like very
precise box in this very precise GUI.
And the designer said, okay,
you have a box of this size.
And so a box of this size
fits, 150 to 200 words.
And if you go under, it looks ugly.
And if you go over, it looks terrible.
So it's very important
that you stick there.
And yeah, so that's it's like
one, one type of constraints that
you want there, there's more.
So yeah there's repetitiveness,
there's not hallucinating.
You can do that too.
There's yeah different flavors
of criteria that kind of span.
Yeah, if you've worked with LLMs,
you'd recognize them, but the kind
of, you basically just want the.
If you have a constraint in your prompt
oh, write it between, 100 and 200 words,
or, don't be repetitive, or, make sure
that that you format it in Markdown
you want all of those constraints to
be adhered to very precisely, since
you bothered to put them in your
prompt, and so you can just pull it
out and then ask the model, hey, is
this Markdown properly formatted?
Actually, for that one, you
could probably just run, do run
a piece of code to, to do that.
Or you can say, oh, is
this text repetitive?
The models are pretty
good at detecting that.
Is this text referring to the right thing?
Stuff like that.
Is this in the right tense?
Nicolay Gerold: Yeah, are you also
using LLMs as a judge in there?
Adrien Morisot: Oh, yeah.
Nicolay Gerold: How do you actually
evaluate whether the LLM as a
judge is well correlated with
human judgments or actual results?
Adrien Morisot: Yeah, we a different team,
a team at Cohere that I used to work on,
the RAG team, found that if you it's,
it seems to be the industry Standard now
to use LLMs as a judge, like quite often
just because humans are so expensive
and not necessarily that's reliable.
And yeah, if you basically the path to
getting this to work well is if you.
Break down the task enough.
If you just naively ask,
Hey, is this answer good?
Is it adhering to the prompt?
Then then you're not going to get a
particularly compelling response, but
if you go through manually and you say,
okay you just ask one thing of the LLM,
it's like, Hey, is this thing repetitive?
You, and then you kind of sanity
check by writing one repetitive thing.
And then if it says it's
repetitive, then you're like, great.
And then copy paste Barack
Obama's Wikipedia page and you
say, Hey, is this repetitive?
And it says, no, then you're like, great.
Or you do some simple sanity checks to
make sure that your prompt is working.
But otherwise the models, like
we shouldn't trust the models
to do quite simple things.
And.
Assessing whether or not, a piece
of text is in the past tense or
the present tense is a simple task,
measuring whether something is
repetitive or not is a simple task.
And so if you just, if you have lots
of constraints and you just break
them out into one individual prompt
so the model has all of its intellect
to focus on this one prompt, then
then it works usually quite well.
Nicolay Gerold: Have you actually
evaluated what works best, whether to
put out basically a score on a rubric,
like zero to 10, whether to put out
like classes, like bad, good, okay.
Or even like just binary classifications.
Adrien Morisot: Yeah, yes.
It's not super clear.
It's not super clear.
And then we ended up kind of
half fine tuning for a specific
way that we've used since.
So yeah it's not obvious,
which is the most correct.
It probably just depends on
the setup, but we use a mix of.
Of all of that, yeah, like
true, false, or zero to ten.
I most I personally never use zero to ten.
It feels a bit sketch.
Like how do you tell
between four or five or six?
It's not too meaningful.
I prefer the binary ones.
Nicolay Gerold: Yep.
I think if you do like a scale
the way Prometheus does it.
So the paper, they basically have
you define the entire rubric and what
the different scores should mean.
Which I think is probably the best way.
If you use scores, I prefer binary
evaluations as well, because especially
if you have some more hard coded
stuff, most of that is binary as well.
If you're assessing where the code runs.
It's true or false, if you're assessing
whether the markdown is formatted,
probably it's also true or false.
Adrien Morisot: Yeah.
And it's, yeah, everything becomes
simpler when it's binary, yeah.
But yeah, recently someone did, um, yeah,
like a scale out of five and then add
one point if this criteria, add another
point if this criteria, add another point
if this criteria, and that works okay.
Yeah, I think we're still in the
early days of this technique and the
models are still evolving so fast that
we probably should not never be too
obsessive about any particular technique.
We should just be flexible.
Nicolay Gerold: yeah.
What do you think people or
regular AI people should take away?
Because nearly no one is at cohere scale.
Most work on like smaller domain specific
language models or build wrappers.
How can they use the stuff we
just talked about or what you're
researching in their AI systems?
Adrien Morisot: Yes, I think there's
just there's a lot to be done.
It's the kind of, I really do think
of this as the beginning of a new
kind of paradigm and in computing
where where you can move from terrible
programming of neural networks, which
is so painful, so uncomfortable to
something where it's Closer to, to how
humans think we can basically guide
program, the neural network purely
using English and and a Python for loop.
And so tangibly what I would
do, I would get the models to
first kind of evaluate stuff.
So or yeah, if you have a system that
requires that would benefit where right
now a human is looking over it over
or sampling from from some outputs.
Just have the model of synthetically start
evaluating stuff, giving back quantities
measuring how well you're doing.
Maybe this is not super tangible,
but it's a big class of problems.
So the models are good enough
to evaluate stuff now according
to your criteria, of course.
And then maybe if you're if you're the one
training neural networks at your company
maybe you have some custom search thing
or you have some custom kind of small
generative models that you're fine tuning.
And yeah, the way to program those
things effectively is probably no longer
going through the internet and scraping
and finding data or human annotation.
It's probably just politely asking
the model to create large quantities
of data for you, plugging them into
your system and then fine tuning and
then seeing if that works better.
And it really, it probably, it should.
And as the models get better and
better it should get simpler and
simpler to get this working well.
And the benefits of this are,
right now you have these very
big, very expensive models.
And it's and they're slow.
And so if you have a specific thing
that you want to do and you only want
to do that and you want to do nothing
else, then you can achieve very large
Cost savings just by creating, just by
it's a, it's a weird thing, but it's you
have this Python function and it's fine.
But it's slow and you want to distill it.
You want to have the fastest
possible program without dealing
with the difficulties of, um.
Yeah, you want, without the Python
memory overheads and all the things
that make Python slow and clunky.
The way to do that is just create
synthetic data using a very big,
very powerful model and distill it
down into a tiny one that does the
thing that you want really well
and really fast and really cheap.
Nicolay Gerold: Yeah.
And especially, I think like with LLMs
creating data sets for classification
now is so easy and so cheap.
It's crazy.
Adrien Morisot: Agree.
Agree.
Yeah.
Nicolay Gerold: What are
like your favorite tools?
We talked about you would love to
see more like small classifiers
that they are a little bit of an
underappreciated technology nowadays
because people just slap an LLM with
some structured generation on top of it.
What are like your favorite
libraries, your favorite model types?
You use nowadays.
Are you still T5 and BERT?
Adrien Morisot: No.
For, it depends what for.
But Yeah, I basically use the whole
Cohere stack to do all sorts of stuff.
We do, we weirdly train classifiers
and we have generative models
and embedding models, re rankers.
That's basically all you need to
do lots of very useful things.
And yeah, I think my my favorite library
is one that hasn't been built yet.
I really want someone to build it.
I'm you can see these kind of nascent
trends towards creating synthetic
data to train neural networks.
One thing that hasn't been built
yet is a whole ecosystem around
tooling to make that happen.
Get, so when you write code you get code
review that happens in GitLab or GitHub.
And when you create data for models
that happens in in GCS buckets
or something, or AWS buckets.
And it's opaque.
There's very, the tooling around data
is very poor, especially like internally
we, we have something it's fine, but.
There, there is not yet a GitHub for
data, and there really should be,
because it's so critical and it's clear
that in the future, um, that whoever
builds this will be rich and famous.
Nicolay Gerold: Have
you checked out Lake FS?
Do you know it?
Adrien Morisot: No, I don't know.
Maybe I shouldn't.
Nicolay Gerold: It's a library and they
are basically building what you said.
And they are basically having
or have built a metadata
engine to manage data versions.
And they are exactly, they frame
themselves as you mentioned, get for
Adrien Morisot: for data.
Okay.
What's this again?
Nicolay Gerold: Lake FS.
Adrien Morisot: Lake FS.
I need file storage.
Is the FS for file
Nicolay Gerold: Yeah.
Yeah.
And they're basically like
building on a different data lakes
on object storage, especially.
And for me, they seem more like basically
like a table format in the end as well.
And using that for versioning
of the different files.
Nice.
And if people want to start
building a stuff we just talked
about, where would you point them?
Where should they go?
Adrien Morisot: They should use all
Cohere tools, because they're great.
And yeah, we've also opened, we've
released the weights for some
of our models too, which I think
most people haven't that's cool.
Nicolay Gerold: Yeah.
And if people want to
follow along with you.
Where can they do that?
Where would you point them?
Adrien Morisot: Great question.
I recently got back on Twitter.
But I have a kind of tortured
relationship with that platform.
But I'm at Kale Divergence.
K A L E Divergence.
Nicolay Gerold: So what can we take
away when we want to use synthetic
data in our training processes?
I think, first of all, like quality is
key, which Isn't a surprise, but even
though synthetic data seems like it's
completely automated, it really isn't.
They are building a lot of custom filters
to keep the quality high, but also make
human checks on the samples to actually
ensure that the training data is good.
Because like bad synthetic data is
likely worse than having no data at all.
And there are a lot of.
different ways they use, like pulling
out the key information, for example,
from the different samples and analyzing
whether it lines up with what you are
training for checking semantic clusters,
running auto checks for diversity.
Which can also be done over or
through semantic similarity,
but also Ngram overlap and other
like more traditional NLP metrics
and also looking at edge cases.
So for example, you can look at length.
Of the different texts you're generating
and just see okay Do I have any outliers
like really short text or really long
text and manually inspect them and filter
them out if they Don't have like the
passing quality And for that basically
you try to automate as much as possible
And then try to flag As, uh, like the
potential bad examples for human review,
which you can't filter out automatically.
And in, in case of the big labs,
they for sure have like additional
classifiers they have trained.
So likely a few hundred people
just label examples where, They are
good synthetic data samples or not.
And then they train a classifier on it
to basically classify the rest of them.
And then through that, they
already have a filtering algorithm.
Since you likely are not working
at your scale, you have to do it
by hand, but it also means that you
have to check a lot of less examples.
And,, the second part is like
implement a quality control flow.
So use a synthetic data
to train a new model.
train it, but actually check whether
it improved on your test data, which
should come from the real world.
So make sure that your test data still
is real world data that you actually
have seen in production, um, because
only through that you can actually
be sure that you're not overfitting
On a certain set of cases and,
the testing approaches, maybe to go into
that again, I think they had a bunch of
interesting mentions, for example, use
code for basic checks, like, for example,
markdown formatting, valid chase and
formatting, use AI judges to rather judge
simple things like Is it repeating itself?
And does it have, is
it like well formatted?
Um, and not in the terms of like markdown
formatting, but rather text structure.
And break it into like a lot of simple
yes or no questions over having it grade
something complex on a scale from 1 to 5.
And also never trust the AI
judgment without any sanity checks.
Um, so you should always add an
additional layer on top of the AI
judge so you actually can control it
as well., which basically, you can
imagine it like a Swiss cheese model.
Uh, you have a lot of different layers
and when in one layer you're using AI,
you have a layer on top of it, which
is more like a heuristic hardcoded or
like a human review, which actually
validates the results of the AI.
Even when you're using the AI.
for evaluation itself.
And for domain adaptation, I think a
few of obvious advices are like fill
the context window with a lot of domain
specific examples, documentation, texts,
add glossaries term definitions for terms.
That are used in other fields as well, but
have a different meaning in your field.
Match writing styles to the target domain.
What are you actually seeing?
How do people write?
I've came across a case where you
actually People use search engine
with like really abbreviated
shortcuts and these abbreviations
were different for each company.
So you actually had to do a
domain adaptation to actually
help the model understand what
these abbreviations mean and which
abbreviations are actually similar.
And also, Check the outputs, whether
they stick to the domain rules.
So often you can define a few of, a
few rules, which it has to comply with.
So you can actually use them
as well to check the data.
And yeah, that's it.
I think it's very interesting.
I'm really excited whether we will see it.
See that paradigm that we can use
synthetic data generation to automatically
train a new model on a certain
task, deploy the model and use it.
It's from my experience at the moment,
I don't think it's good enough yet.
That you actually can rely
on purely synthetic data.
You always need like a few samples
or like 50, 50 50 percent real world,
50 percent synthetically generated to
ground the synthetic examples in reality.
But yeah I'm really looking forward
to see like the evolution of the
field because it could speed up the
entire ML and AI life cycle by a lot.
Otherwise.
Let me know what you think and whether
you want to try synthetic data.
I would love to hear what are the
use cases where you actually want to
use it or where you think it could
be of use to you at the moment.
And otherwise, leave a share it with
your friends, leave a comment, leave
a review, even if it's a bad one.
I, I always want to know
what can we improve as well.
It helps out a lot, making
it better for you as well.
And otherwise I will catch you next week.
Then we will continue our series on
search and because it's Christmas,
I already wish you Merry Christmas
and I will catch you after Christmas.
Have a good one.