Nicolay Gerold: RAG isn't just one thing.
It's two.
It's retrieval and it's generation.
It's search and it's prompting.
And we need to treat those
two components separately.
So we can optimize each part on its own.
And when something breaks, we actually
know where the mistake is coming from.
And usually, you want to first
worry about the retrieval before you
start to worry about the prompting.
And how to get the retriever right is what
we talked about in this entire series.
We want to be methodological, we want
to create test sets, use simple metrics,
the search metrics like precision at K
to measure the search quality, collect
user feedback, and continuously improve.
Our search system and optimize it for
what our users are actually looking for.
So we want to make sure we are actually
finding the right context before we are
optimizing how we present it to the model.
And once we have our retrieval
system working, we can start to
engineer the model's context.
And the second part, the
generation and how to present this.
This is what we will
be talking about today.
Today we are talking to John Berryman.
John worked at a lot of stations.
He started at Eventbrite Search,
then at GitHub Code Search,
then went into GitHub Copilot.
So he has a lot of experience at the
intersection of like first search and
now more of the Retrieval Augmented
Generation or AI driven part of search.
And he recently also wrote a book
on Prompt Engineering for LLMs.
And we will be going deep on how
to actually present the context
to LLMs for RAG, but also.
In general, when you're tackling
a problem with LLMs, so we'll be
talking how to format it, like for
example, using markdown or structure
finance data like SCC reports.
So make it, so it is a familiar format
to the LLM, but also ranking information
by importance or add content until
the quality plateaus and then stops.
So more isn't always better.
And maybe we'll be looking at a lot of
different mental models you can actually.
Apply to your problem, how to think
about prompting and presenting
information to the LLMs, and then
also concrete techniques that
you can implement in your system.
Let's do it.
Why did you move from search
into prompt engineering?
What gave you like the nudge to move on?
John Berryman: Okay.
So search was just a
coincidence of chance.
Anyways I was an aerospace engineer
early on and I I love the math and
I love the software, but I didn't,
I wouldn't geeking out like everyone
else around me about the satellites.
So I got into the search
just to get into software.
It was a pretty good fit, but all along
the way, I was clawing my way back towards
more math, more data science, more ML.
So a few hops later, I went to Eventbrite,
did their search, went to GitHub,
did their search for code search.
And then that project was wrapping up
and I got the opportunity to jump to
data science, which was where I always
wanted to be in ML and stuff like that,
but it's data, this hop was data science.
And I found out that data science
was mostly like PowerPoints and SQL
queries is this is not quite what
I thought I was signing up for.
And then there was a spot
opened on the co pilot team.
And I had no idea at that point,
really, what a large language model was.
I had no idea how impactful
it was going to be.
And I was like, yeah, this
is something different.
Feels like ML.
And I did it without any
extra criteria than that.
And yeah, it was a pretty good fit.
It turns out search fits in
pretty well in that domain too.
So I get to lean back on it occasionally.
Nicolay Gerold: What do you think
are like the processes practices
that you use the most often from
search in working with LLMs?
John Berryman: If we're talking
about RAG everyone talks
about RAG as if it's a thing.
RAG's not.
A thing.
RAG is two things.
And if you keep them separated,
it's a lot easier to like mat,
cognitively manage the problem.
It's search and it's taking search
results and sticking it into a prompt.
And so the search problem is not a
particularly special search problem.
It's just, search.
So you get to use as much of search
background as you like for that.
Optimize search and then
later optimize the prompt.
Nicolay Gerold: Yeah.
And when it comes directly to working with
LLMs, do you think like the systematic
approaches from search suit themselves
to LLMs as well and prompt engineering?
John Berryman: Yeah.
Gotcha.
Sure, absolutely.
So any way that you could
evaluate a search problem, you can
evaluate the search problem that's
hiding inside the rag problem.
Because, RAG problem, you're having
either yourself finding the search
results or you're having the model choose
to look for, choose to make a search.
But from that point inward,
it's just a search problem.
There's nothing particularly.
Special about it in context of rag.
And once the data all comes back,
then you're optimizing how it's
presented to the models so that the
model can understand it and all the
typical prompt engineering stuff.
But it still breaks down
into two different problems.
It's not a rag.
It's search and it's prompt engineering.
Nicolay Gerold: Yeah.
And when it comes to prompt
engineering, can you maybe touch onto
the little red riding hood principle?
I think it was, this was probably my
favorite metaphor in the entire book.
John Berryman: Yes, I'll have
to give full credit to Albert
Ziegler, my co author on this one.
He's really good with.
These little metaphors.
But so if you're thinking about Little
Red Riding Hood, what's the story?
It's the little girl's supposed
to go to her grandmother's house.
The mother says, you
need to stay on the path.
If you get off the path,
you're gonna get in trouble.
And so the metaphor is that whenever
you're dealing with prompt engineering,
if you can stay on The path of whatever
the models have seen in training, if you
can make your prompts mimic, whatever
whatever has been in the training data,
then it's going to be a lot easier for the
models to understand and anticipate what's
happening and and to do the right thing.
And for for the original models
that like just pure completion
models, the way this would come
out would be I just use an example.
If I just type in a question like,
Hey, can you help me figure out how
to solve this math problem, then a
likely response to that is going to be.
Another question it's going
to be like, Hey, can you help
me solve that math problem?
Hey, because in, in the training data
for these original models, before they
were instruct tuned and before they
had reinforcement with human feedback,
you didn't see this Q and a stuff.
But, if you could actually format the
document so that it looked like something
that it's seen in training, like it
looked at the top, it could literally
say, this is an FAQ for, this company
or whatever problem space you're in.
And at the beginning of
the question, if you put Q.
colon and ask your question.
And at the start of the prompt, you
put a, at the start of the, assistant,
the response, you put a colon, then
this is a pattern that the models have
seen over and over in their training.
And they would typically respond in kind.
Now is time has moved forward.
That's such a useful paradigm that these
frontier models have trained them with.
A new type of training set.
They've trained them so that with
instruct models, you have the Q and
A type interface and with the chat
models, you have something that's
much more robustly, this is the user
talking, this is assistant talking.
So I've had some people say.
That, the Q and the Little Red Riding
Hood idea maybe that was most useful
early on when we had just the completion
models, but it's actually still really
important because even though the
models are now lock you into here is
the user asking the question, here's
the assistant with a follow up whenever.
You present information into the model.
It's still good to use motifs,
patterns found from the internet
found from the training data, right?
Things in Mark down models
seen tons of Markdown.
If you have an idea to present to
the model, like financial formation.
Use the nomenclature, use the formatting
of SEC reports, something that the
model has seen over and over again.
And that will mean that there's
much less you have to do as a
prompt engineer to explain all this
extra details to the model just
so that it can start processing.
It's, put it in a place
where it's used to working.
Nicolay Gerold: Yeah.
And I think the same goes,
especially in coding.
I feel like you can.
A little bit steer the quality of the
output by using certain best practices.
So if you put doc strings at the top
of a file, you bias towards better
answers, in my opinion, because well
written code bases have the same
structure that they are well documented,
very commented and stuff like that.
Do you have.
Any techniques that you use to actually
figure out, okay, which parts of my
prompt actually fit into the little
red riding hood principle and which do
not, or do you rather go by gut feel?
John Berryman: A little bit of both.
I think
early on gut fill is actually going
to be very important and I'm, I
suspect this will come up a few
times during our conversation,
everyone says Oh, five testing.
You can't be just five testing.
You can't be just five testing.
But right out of the gate, you
better darn well be doing a
decent share of vibe testing.
You need to write in put, assemble
the prompt see what the output look,
looks like and start like jotting down,
cataloging as a human with pencil and
paper the things that are going wrong.
And over time you build up kind of an
intuition for how these things can fail.
You build up an intuition
for how you can test them.
And so slowly you're going to start.
You create all these tasks for yourself,
but slowly, as a pattern emerges,
you start firing yourself from these
tasks and backing off and having
LLM as judge and stuff like that.
But there there are other things that you
can do that and again, this is something
that you should have Albert on your show
to explain this bit because he did more
of the research in this but he found
that you can look at things like the the
probabilities of the tokens and figure out
how comfortable the model is with a model.
Okay.
Particular portion of the prompt.
So for example, he did a lot of neat work
with few shot prompts where you stick in
a bunch of examples to prime the model
to provide a certain, similar output.
And he did some really interesting
work where the obvious question
is how many examples do you need
before it's not worth continuing
and dumping more and more examples.
And and.
He found out that you can
basically look at the average of
the probabilities, not the logic.
He's actually doing the exponent.
So you look at the average
of the probabilities.
It's not quite the perplexity.
It's a little bit more naive than that.
But as you start dumping more and more
examples into the prompt, you can look at.
The probabilities of the example tokens
as you're printing them there, and
eventually, it's a very noisy sample,
but if you average across, multiple
iterations of this, you'll see that
initially the model appears surprised.
perplexed as you start adding in these
few shot examples, but eventually
it levels off and they you're no
longer gaining any more interesting
information out of the examples.
Another kind of related thing to that
is if you you look at the initial
Probabilities when the model first
begins its completion, then if the
average of the first, I don't know,
five or 10 tokens if their average is
low, then it's is indicative that the
model is going to make up something.
It's, not necessarily hallucinate, but the
model is not as certain starting out and
it's not as likely to have a good answer.
It's another thing that
he was able to find out.
Nicolay Gerold: Yeah.
So this basically means that you
have a lot of possible tokens,
which could have been taken.
So you don't have a token, which
is a very high likelihood, but
probably multiple with like a.
Medium one or a low one, and I would
expect in a few shots, I would want
the model to basically be surprised
or basically have a low probability
for the average words when I give
it input and in the outputs, it
should in the beginning also have low
probabilities, but it should level off.
So basically it's.
It's learning with the outputs
of the few shots, like what the
pattern is and how to generate it.
So it isn't surprised anymore.
So it has learned the pattern, how
to generate it, but the inputs all
the time still bring a little bit of
surprise though, so they actually add
something new to the few shots as well.
John Berryman: It's funny.
There's actually a good metaphor.
I feel my own brain doing this
as you're asking me questions.
I'm like, how the crap am
I going to answer this?
And then as I talk, I'll like
figure out, okay, I've got
a good narrative going here.
I think I'll stick with it.
Nicolay Gerold: Yeah.
And I think like when it comes to product
engineering what's most complex for me as
well is it's, you're always adding stuff.
Because that's like the easy default,
but it's a little bit more about balance.
Like you have to add stuff,
you have to remove stuff.
And this is like where
the measurement comes in.
But how do you actually decide, okay, when
is it worth to add additional rule sets?
Or when is it actually
necessary to remove stuff?
There is too much context or there
is too little context as well.
John Berryman: Yeah.
So the, we dealt a lot with
this in the middle of our book.
How should I tackle this problem?
The There's a bunch of stuff that you add
to a prompt, and like the, maybe if we
could build a framework for it, for our
conversation here there's static pieces
and there's dynamic pieces of your prompt.
The static pieces are the things that
are not going to change at all from one
instantiation of this prompt to the next.
Code completion would be
a great example for this.
With code completion, you always
have, the same structure, the same
boilerplate plate that puts the code
what, whatever the code is going to be
allocates positions in the prompt for
it, but then the dynamic stuff is things
that are going to change every time.
If you're looking at a new
file, then obviously the bulk of
the prompt is going to change.
As you change position of the
file, as you open new tabs that
we're gathering information in.
And so a big part of the game is
especially early on, but it's still
carries on today is how do you pack
as much useful information into
the prompt without overfilling it.
Now initially this was
absolutely critical.
It was game over.
If you overfill the prompt it's game
over now if you overfill the prompt, but
you've got a hundred thousand tokens.
When we first started with
Copilot I think the models were.
Like as low is a 2k tokens.
And so within 2k of tokens, you got
to fit the context useful to make,
your completion, but you've also
got to fit the completion as well.
And the big game that you play
here is how to figure out.
What pieces, all the content that you
gather together, you're looking at
all this stuff that might be relevant
to this prompt, you've got to find
some way of scoring it and figuring
out this is, tier one relevance.
I basically got to include this and
this is tier two, it's really nice to
have this tier three, I'll fit it in
if I get some room, but if it didn't
land in there, it's going to be okay.
And there's no one there's no one size
fits all type of approach to this, but
you have, when you're really concerned
about filling the prompt, overfilling
the prompt, you've got to generate
an engine to look at all of these.
Chunks, consider their sizes,
consider the priorities.
And then usually what you're
doing is, you organize them
by highest or lowest priority.
And you take as many of the
high priority things as you can
until you fill up your prompt.
Now it's even trickier than that because
when you're shoving the things in there.
You have to make sure they
fit with the boiler plate.
If you've got a child that's high
priority you got to make sure
that it's parent is in there too.
So you can't, some things
are like order dependent.
And so it's a real juggling act to figure
out how to fit all these pieces in there.
Now, fortunately today, we don't
have as much of a problem with that
because the prompts are so very long.
You're not going to, you're
usually not going to hit the
hard wall and just stuff crashes.
But you still have to be cognizant of it.
Because the more information that you
shove into the prompt than the longer
latency, you see the higher cost.
And also we know for sure that you can't
just shove arbitrary stuff into these
prompts because they do get distracted.
Attention is a good metaphor and
they will lose their attention
and not be able to, to as easily.
Satisfy the goal.
If you just got, every possible
thing inside the prompt.
Nicolay Gerold: And is it the same for
completion and chat models, in your
opinion, when it comes to prompting?
And filling like prompt templates
and even, do you think regular
completion models still have a place?
John Berryman: It's a, that's
a really good question.
When chat models first came out, I
was like, ah, this, they're already
fixing us into a paradigm and
they're closing down the options.
And so I was pretty bearish on chat
completions for the first month, not
that long, but first a little bit,
because think about what they're doing.
With completion models, what you're
doing is you're taking the first
part of a document and you're
having the model figure out what
the rest of the document looks like.
So in principle, you can pretend to
be any type of document you want.
This is, you could make it.
Just to demonstrate it's flexibility, you
could make it you could say at the top
of the pseudo document, this is a FAQ.
We're going to talk about this stuff.
And it's, Q and a, then
you've got an instruct model.
You've just, you could say, this
is a transcript between a A support
agent for this tech company.
And here's the, the dialogue
between a user and a person.
And you've, you've got the model
that basically became chat, but
with the chat models, they said,
Nope, that's all you're getting.
And so initially I was grumpy about
that, but it became so obvious.
So quickly that the chat models were
going to be just super flexible.
So initially every initially at co
pilot with GitHub, the completions
were the completion model.
I left there in May, so I don't
know what's changed since then,
but I've, I've been using cursor
I, I get the feeling that.
Everybody, even for completion
type experiences is moving away
from pure completion models
and moving towards chat models.
And surely I'm true because basically
the the frontier frontier models
are almost not providing the
completion models anymore anyways.
So I think at this point
they're very generic.
They're, they are the way
that I think we should go.
Nicolay Gerold: I think
the completion models.
It's, I rarely see them because most
use cases you implement are similar to
what other people are doing as well.
So it's in the instruction
tuning data set, what you're
doing or what you want to do.
So you're better off using
a chat completion model.
But I think there are some cases where
actually using a completion model.
Might be valuable, but I think these
are more like the classical generative
AI cases where you actually have to
generate many options to get actually a
useful one out of there, but they are way
more raw and the likelihood of getting
something, a real outlay or something
really interesting in there is very high.
I think Linus Lee from Notion had
a really cool demo on, on like
writing, where he fine tuned a.
Completion model on his own writing
stuff and it got way closer to his way
of writing than a chat completion model
did Because you would have to train
over like the entirety of what it has
seen in its instruction data set so far
John Berryman: The, it
is interesting, right?
You do lose some stuff
when you go to chat.
For example, if I'm just,
if I really do want it to.
Make the bottom of this document,
I've got the top of the document.
Then one of the early frustrations
we had with the chat models is
that you'd say, you'd have to
say, now please complete this.
And then it would say, okay,
here's the completion of this.
And then you're trying to figure
out how to pick out the completion.
And you've got this interim
text that you don't know what
it's doing to the probabilities.
It's probably messing it up.
And then you've also got really cool
things that you used to be able to do
with models that you can't do with,
you can't do with open AI anymore.
You can do it with the anthropic models.
But it was called inception
another good little metaphor
where basically you start talking.
On behalf of the assistant.
So if you can, if you own the whole
completion document, the pseudo document,
you can say, this is a transcript and you
can start the answer for the assistant.
And for example you'd say if you
wanted your answer to be only yes
or no, and you don't want this.
Extra commentary, then you
automatically start the assistant
response as if I had to say it in one
word, yes or no, it would be quote.
And then you know exactly
how to pick out the word.
But again, it's like the reinforcement
learning with human feedback.
Stuff, these models have gotten really
well aligned with humans at this point.
And like structured output
it's become very dependable.
It was a pain in the butt initially.
And so I see it being, I really see
it being less and less of a problem
as the models get smarter and more
aligned with what we typically want.
Nicolay Gerold: Yeah, i'm still i'm always
wondering whether we are converging to
the average user or we will still get
something interesting out of it just
I'm just, due to sheer probabilities,
I've been maybe moving a little bit
more onto like the corporate side.
When you're looking at, especially
companies, how would you go about looking
at use cases and determining whether
a use case is a good fit for a LLM?
John Berryman: Do you so point me
in the right direction of this.
Do you have any kind
of examples in mind or?
Before I completely go off the deep end.
Nicolay Gerold: I want
to get you in deep end.
I think that's the more interesting part.
John Berryman: So it's it's generative.
There's a lot of cool things that
you can do with these models that
you couldn't do with like tabular ML.
I still think, I think there is probably.
An interesting asterisk
on this in a second.
Think that tabular ML is going
to have its place for a long time.
If you're, if you've got a very
specific problem that you want to
predict or classify or, find an
estimated value or, something like
that for those things sometimes
it's oftentimes you're still going
to want to use like a classical model.
They're going to be cheaper, smaller,
embeddable but it's not always the case.
If for example if you're doing, if
you're doing pure numerical features,
then it's probably still the case.
But if you're doing something where
it's like, Here's a document and I
want you to classify this document.
That becomes much more towards LLM.
It's a very generative type task.
If you're summarizing, if you're providing
content towards, users like customer
support, stuff like that's very LLM.
And I could probably just keep
iterating the individual things
that feel like a good fit for LLM.
But then if we push it too far I think
we get into tasks that LLMs are not
quite cut out for yet basically you,
you are assuming the LLM is going to
be exacting and super duper accurate.
And it's responses.
And so I've seen a couple of companies
at this point where it's we're going
to code generation is a decent one.
We're going to say here is the
specification for the code and we want
it to just provide the correct answer.
And really that's stuff like that
is going to be super challenging
to get right immediately.
You're going to have to do a
lot of breaking it down into
workflows and task and steps.
And a lot of these still
need the human interaction.
So I think there's something
to be said for LLMs.
If you've got an LLM task.
Always make sure at this point to
still pull the user into it because
it's not assumed that it will
always be correct at this point.
Nicolay Gerold: And you already
mentioned like the two different
types of applications, like
workflows and assistance.
Can you maybe lead?
Quickly break down the difference
between the two and then you would
actually opt for one over the other.
John Berryman: Both are very
useful in very different regimes.
So to define them first an assistant
is going to be anything where the
human is front and center in the loop
with With the, avatar, the assistant.
And one thing that I was just
talking about with these models is
I don't really, I don't trust
them yet to go off on their own
and do whatever they want to.
So you really have to
have the user in the loop.
And so the way that you do that with
assistants is the user is talking to the
assistant and then when the assistant's
I'm going to try this, how's this?
The user can, provide the correcting
force to always keep it on the track.
So that's how you achieve Useful
work with an assistant by the human
kind of correcting it and giving the
assistant, the tools that needs to
reach out and do interesting things.
Now with workflows you still
need users in the loop.
Workflows are typically something
that you would have the.
Model go and do on its own.
But if you say, here's a giant
task, solve it for me, come back.
That's just not going to work.
Instead, the way you keep the user in
the loop, a different way of keeping the
user's hand on the steering wheel here.
Is at the beginning, you have a designer
for this workflow and they say, okay, I'm
going to try it and see if it works and
it won't because it's too big of a task.
So then a humans looks at the workflow,
figures out, if I was to do this, how
would I break it up into steps and how
would I make sure that these steps are as.
fail safe as possible, something
that ideally is not going to break.
And then when there are failure modes,
then how can I be, become aware of them
and do some feedback, or maybe even
still pass it up to a real human and
say I've got myself in a situation.
Can you figure it out?
So there's the, I think the two main.
Paradigms for how these things are
going to work for a lot of applications.
And where's one better than the other?
The assistants are obviously better
when you have users that want to
interact with these things to, they
really want to keep their hands on the
keyboard and and direct these models
to achieve something faster than they
would have achieved it by themselves.
Code generation, cursor,
that's a great example.
User interact, support interactions
is another great example.
You're directing the assistant
to look up information for you.
Maybe, I haven't personally used these
yet, something in the future that does
like trip planning and can not only
look up information, but actually take
actions for you in the real world.
That's all good stuff for assistants.
Workflows is for big blobs of work that
humans just don't want to do anymore.
And so it would be things like scrape
the web and for each one of these,
web assets that comes back, provide,
perform a set of steps to glean
structured information out of it.
Because at the end of the day, I
want to process all this stuff and
have a table full of information for
each website or, something like that.
And so those are the types of things
that I think workflows are going to be
useful for streaming or batch processing.
That, that involves a lot of.
Natural language
expertise.
Nicolay Gerold: What are
your thoughts on agents?
John Berryman: You put air quotes.
So you're priming me to say,
you're priming this conversation.
Subliminally, no one knows what
they're talking about with agents.
There's no re there's no
commonly accepted definition.
If you ask anyone what an agent
is they make up something on the
spot is I shall do for you now.
The I think in a lot of time,
when you hear the word agent.
And it's conflated with tool usage.
And I think that's a
reasonable part of it.
You give a model the ability to reach
out into the world and to do a thing.
But tool usage, in my opinion, doesn't
imply agency because, I could I've used
tools for emitting structured content.
That's gotten a little bit easier
with a lot of frontier models have.
Structured outputs right now, but one
of the ways you used to be able to
do that, you can certainly still do
it to say, read this and call this
function to provide a structured output.
That's not an agent.
That's just like question and answer.
So when I think of agency, I think of it
to go back to our previous conversation.
I think of it as being in
this spectrum of assistance.
To workflows.
And I feel like that is, even though
it feels like two different things,
I feel like it is a spectrum.
One of the spectrum, you have
an assistant, which is acting
as an agent on your behalf.
It is.
It has its own volition.
That's an important part of an agent.
It can make its own decisions, chart
its own course to a certain extent.
But it's bounded by the user, nudging
it back in the right direction, saying,
no, please don't empty my bank account.
We're going to, we're going to watch
the way you're using your tools,
but it keeps it on the course.
On the other end of the spectrum is the
workflows where the agent is deciding.
The agent being the LLM, but also
the, basically the DAG around it
the instructions about, the steps.
These, the combination of
the LLM and the instructions
about what you should do next.
is charting a course through some
sort of space to figure out an answer.
It's running into a problem that's,
coming back and doing stuff.
And so both of these are one in
a round under their own volition,
to some extent to solve a problem.
And they do merge occasionally because
you can have an assistant that you're
just chatting with a thing, it has some
basic tools, but if it gets you can
design them if you wanted to so that
when it gets a certain Type of problem.
It's ah, I've got a workflow for that.
And the assistant can bop over
into, I'm going to do this workflow.
It's got a set of steps, but
it's much more likely to succeed.
I know this, come back from the
workflow to put it on the spectrum
over here is sometimes these workflows
if they're too complicated, you could
have allow it the capability to reach
back out to a user and say, I've
got to this point of the workflow.
You told me to tell you when
I got here and here I am.
And so you work with me and
then you can do, all sorts of
mixing and matching in between.
You might have a task inside
your workflow that is really
in two agents working together.
It gets a little bit risky to do that.
But you can have the.
User proxy and the assistant and
the user proxy tells the assistant
everything that it thinks I would have
told it, but I don't have to be the one
standing there telling it until a user
assistants like, ah, crap, I'm in trouble.
I need the real version
of me to pop back in.
I don't think agency
is really well defined.
I think it, but I think it has
to do with allowing these little
creatures to investigate the world.
Almost always holding the
hands with a human so that
they stay on the right track.
Nicolay Gerold: Yeah, I would
place agents above that.
I think agent is something we
should be striving for, not
something we have right now.
Because if it's getting assistance
or it's well defined in a workflow,
for me, this is not really agentic.
Also, if I have I'm using an LLM
to control the flow of a graph.
So basically, I'm using it in if
conditions to basically for the
control flow where it should be going
next or making like tiny decisions.
I think this isn't something
agentic because I could do the same
with a rule based systems, with
heuristics or with a classifier.
For me, that's not really agency.
I think we should really define agents
as they are in reinforcement learning,
as something that inspects the state.
Determines the next action
and then basically executes it
and then repeats the process.
And it learns over time, basically the
policy, which determines, okay, when is
it best to do a certain action or not.
John Berryman: So you have a hard
definition of agency it like, or as I say,
I think agents probably are still holding
hands with the human you explicitly strike
that it's like it's only agency if they
can navigate an unseen world without
any humans around them and learn from
their landscape and interact with it.
Nicolay Gerold: then we basically
have a term that's actually usable.
Rather than we use agent as
a catch all term for anything
that's with LLMs and uses tools.
John Berryman: Make sense?
Nicolay Gerold: Yeah.
We've talked about workflows
already in your book.
You also go about breaking
down problems and you had two
different approaches for that.
Basically breaking it down horizontally
and breaking it down vertically.
Can you maybe go into the different
approaches, what they mean?
And also like, when do
you know which to use?
John Berryman: Sure.
Whenever you go about to set up a
workflow, probably the best thing to do
at the very start is to give the entire
thing to the model in one fell swoop.
Just say, here's the task I
have, here's all the content.
That you might need, solve this
for me and then get back with me.
And then it'll come back to you with
something that looks like a Picasso.
It's probably not going to make sense if
the workflow is sufficiently complicated.
That's when it's time to
start breaking stuff down.
So the first way I would usually
think of breaking things down, I'd say
horizontally but it's basically the steps.
You've got a workflow and you can look at.
At this initial run of your agent and
figure out where it's went off the rails
and you say, okay, it's getting confused
now, since I've, as a human have had
time to think about it more, I think
that we need these five steps, each one
of these steps is going to be, this is
some of an ideal, everything I say right
now is going to be somewhat of an ideal.
It's a format for thinking to it.
Each one of these five steps
is going to be well contained.
It's going to have an input and an output
that's well defined, like we have the
schema for them, and it's not going to
talk with the next step in the process.
And if we are able to do that, which
is a bit of an asterisk, if we are
actually able to do that, then it
becomes much easier to reason about.
Because we don't have to
say, why did this fail?
And it could be for any
bazillions of reasons.
But we can say, rather than Going on a
random search through this giant space.
We've gone through a guided search
and in each one of these Substeps
are effectively waypoints and we
can say did we make it through that?
Do we make it through that?
We can write evaluations for each one
of These little things we can find
You know instead of it having just
the one end result, we might find
out that there's like error states or
something that we just really have to
account for, we can, start creating
this graph of, instructions about,
if you've run into this type of error
state and go to here, and the workflow.
can become a bunch of steps effectively.
Now that didn't always work either.
And so a different way of thinking
about it is slice it vertically.
And scaling could sometimes
be an issue with this too.
This is still a way of thinking through
it, practicalities might still run you
over when you start scaling vertically.
But if you have this one task,
solve the problem, but there's
really five sub problems within it.
Then instead of, expecting this one
task to do all five different verticals
well, you say, okay, maybe I'm actually
dealing with five different things.
I'll have some sort of task to figure
out which state I'm in, and then
I'll traverse down this set of steps
because I'm in this special case.
This is all very vague.
Let me see if I can ground it
a little bit with an example
from a previous customer.
I was working with a startup
that was doing SOX compliance and
It sounds terribly boring from the
outside because you're, you have these
companies that are looking at controls,
these set of rules that they have signed
themselves up to follow so that they
don't So that they provably are not
engaging in fraud and stuff like that.
And then auditors can come in and say,
this is the rule you set up for yourself.
And I can see from this evidence
that's, you've met these rules.
So that doesn't sound necessarily
very interesting, but actually once
we got into the problem of setting up.
agents to do this, workflows
to do this, it was really neat.
Because the problem does break down into
steps at first, and then vertically.
The initial part of the problem is
you've got to look at the control
document, which is like the rule set.
It has a bunch of
attributes that must be met.
And once you absorb the meaning from
this, you can take that meaning, go
to the next step, and you can say,
okay, Here are documents that have some
evidence that might pertain to this.
So you have another step that's gleaning
the important stuff out of that,
summarizing and something like that and
then you can say, all right, now since we
know what we're trying to find, we know
about these documents that probably have
pertinent evidence, pull out the evidence.
Maybe you have some tools that go into
the documents and actually grab cells
from a spreadsheet or Something like that.
Find places in the PDF that
you can draw a line around.
And then, now since we have all
the evidence, assemble a evidence.
Report, which takes all this stuff
and shoves it back together in a way
that the auditors can look at later.
So something that was super
impossible to do in just 1 hit.
Once you started thinking through it,
you can break it down into 2 different
steps that make it more reasonable.
The steps can be fairly well isolated.
You can.
You can optimize every one of them
in isolation, but that's horizontal.
You still have to do a lot of times, and
I don't think they've completely solved
the problem yet, but a lot of times you
need to also figure out this worked really
well for this very common type of audit.
That's 20 percent of the things
that we see but there's other.
Types of audits that if we happen
to know that we're in there,
then we need to specialize.
We'll ask more Surgical type questions
because we know how to deal with it.
Or this is a type of document you
know I've gotten to this step and
this could be a PDF with evidence or
a Excel spreadsheet or an app photo of
an applicant or something like that.
For each one of these things
Maybe we give it different tools.
We specialize it in some way.
So it still becomes tricky
because if we're horizontally and
vertically doing it, then it becomes
difficult to scale at some point.
But at least thinking through this way
is a way of making your problem something
that you can approach rather than I'm
going to just throw LLM and hope it works.
Nicolay Gerold: Yeah.
I think what would be really interesting
is how do you think through integrating
LLMs into existing applications,
especially from the perspective
of GitHub, because I think we see
more and more patterns emerging.
We have the autocomplete, which
is basically just suggesting in a
lightweight form that the user can accept.
Then you have the user triggered.
Interaction with, which is either
chat or even like the edit in
cursor, you basically give it
a command and it executes it.
And you can then review it, but you also
have like automatic context gathering,
which is just suggested in a sidebar.
How do you actually think through
the different ways how you can
actually integrate in an assistant
into a workflow of a user in the end?
John Berryman: It's a very broad question.
Our Like from a technological
perspective or we're thinking
like a product perspective
Nicolay Gerold: Rather
user interface perspective.
So if you tackle a certain task for a
user, how do you think through, like
how do I best use an LLM to serve him?
John Berryman: Really depends upon
whatever the original product was.
I think you'll have some products That
have been begging for a long time for
some for technology like this, and I
for example, anything that has customer
support in it I would at this point, I
think I would love to jump on the website
and know that I have instant contact.
To an LLM that has full
access for the documentation.
So you have a lot of easy
low hanging fruit like that.
But then in other things I don't know,
like code editing, for example, that's
been a really interesting eye opener.
We've slowly learned over time how you.
Need to integrate these things.
It people have been writing code
for forever in their IDE, but we
need to like the first step of Oh,
it makes sense to have something
that completes the code for you.
This is actually surprisingly comfortable.
I don't think people realize
that it was going to be a fairly
comfortable user experience until we
actually started working with that.
And some people, some people originally
were like, that this is distracting.
I don't like it all, but I
suspect they're probably using
Code completion at this point.
And then as chat GBT emerged, people
started figuring out more and more
use cases for the, these things
as like an assistant type flow.
And you can think of what are,
what workflows out there need?
Need an extra per, need, need a
pairing session, brainstorming.
So that's again with code editing,
that's come up a lot, but I think
you'll start to see that in other
domains like writing text copywriting.
You're starting to see this in the
frontier models with OpenAI's Canvas,
with Anthropx artifacts, where they're
starting to encourage people to say,
just come in here and we're going to
brainstorm, pair with you, work on that
thing on the right side of the screen.
And I don't think, your
question is very general.
I don't think we're going to be to the
end of all the possibilities anytime soon.
One of the easiest places, it's not a
user experience thing, but one of the
easiest places to start integrating
an LLM into your work is backend.
As you're getting, as your company
is getting familiar with how these
things work and start offloading
backend tasks summarization pulling
out structured content from stuff.
And I think there's a lot of back
store uses for the, for these things.
And we're just going to keep
learning more and more as we go.
I think,
Nicolay Gerold: Yeah.
And I'm really curious, what are
to date your favorite tools where
you would say this is what I expect
from tools that integrate with LLMs.
John Berryman: Probably
an obvious answer, but.
One that I should be ashamed to say coming
from my background and working at Copilot.
I really love Cursor.
It's really good.
They it feels like one unified experience.
It feels like
they make it easy for me
as a user to understand.
What the model sees and it's actually
maybe it's even easier to just personify
this I feel more like I am pairing
with a person I can see By the way,
i've entered in the information.
I can see what lines and more than
that What files they're looking at
I can see the thought process when
they make a recommendation for me
for something to change, it's easy
for me to say, I will elect that one.
I will, deny that one when it's
making the change it's, I can
see the, this isn't LLM stuff.
This is just like great user experience.
I can see the red green diff
on the screen and immediately
know this is, I approve this.
This makes sense.
And just a lot easier to navigate it.
it feely, it feels a lot like
I'm working with a human.
It's a, it's really cool experience.
Nicolay Gerold: And what would
you say is missing from the
space or what is something that
you would love to see built?
John Berryman: I haven't given
any thought to that recently.
I don't know.
I don't know.
I don't know.
We're gonna have to edit this part
out because all I'm going to be able
to say for a moment is I don't know.
Can we, let's come back
to that one at the end.
Let my subconscious stew
on that one a little bit.
Nicolay Gerold: Yeah.
And.
I think like one of my favorite questions,
by the way, is always I think like
LLMs agents are like the big trends.
What do you think is like a
niche underappreciated technology
at the moment that more people
should use in data and AI?
John Berryman: L& M's is my, it's
my bread and butter and I guess
search is my other bread and butter.
So I'm nailing the things that people
are typically thinking about right now.
I had, I've been thinking
about something for a while.
And I don't know if this
is possibly true or.
I'm going to, I'm going to answer
your question with this slightly
different question a slightly
different approach, but I,
I wonder if it is possible for
these LLMs to be used in place
of traditional tabular AI models.
So this is different from something I
said earlier in the conversation and I did
say asterisk beside it, but if you have.
A tabular data set, bunch of numbers,
bunch of features, then it makes sense
to stick that into any of the traditional
classification algorithms or estimation
algorithms, something like that.
But if you're really trying to
get out, not just a number, you're
trying to get out answers about a
space like the horse racing stuff
that we're talking about earlier.
Then I wonder if you could
take these models and.
train them with all the conversations
about the questions that you want to ask.
So I know what the inputs are to
I know what the input features
would be to figuring out.
What horse is going to win a horse race?
I know what the input features are
going to be to figure out, what is the
likelihood of mating these two horses
together to lead to the best out outcome.
And you can think of all sorts of
other, I know what the input features
would be to figure out how well.
This particular horse would perform at a
certain racetrack with a certain weather.
And just think about, rather than
predicting a very specific having a model
predict, to predict a very specific thing
here, and a different model to predict
a very specific thing here, and here.
What if you could trade in
these large language models to
generalize over all of that?
Train it with, you've got your
training set and your holdout
where you know the truth.
And you can use that information to
have conversations about what you think.
And eventually, these things
have shown to be able to model
statistics well enough to model
human speech with very high accuracy.
It's surely enough flexibility in
the transformer model to model this
information that I'm talking about, if
only we have enough data to feed into it.
I've torn the original question to shreds.
That is something that I would
like to research, is how well
these models can generalize and
be used in place of traditional
tabular AI for certain categories.
Nicolay Gerold: And I think it, it uses
LLMs in a context of where they are
untrained in, but it uses them in a way.
Where they are actually useful because
it would be really inefficient to
train a model on every single task
you might want to run a hypothesis on.
So you basically use them as an engine
to come up with a hypothesis and test it.
And because you can come
up with like infinite.
Different combinations, there is value in
using an LLMs because you can do way more
when they have, when they are like robust
enough to actually give a useful output.
It's a really interesting idea.
And if people want to start
building with LLMs, what are the
resources you would recommend them?
John Berryman: Honestly I wouldn't worry
about tooling nearly as much as just
getting your hands dirty with the ideas.
The best thing to get started with
building with LLMs is have access
to one of the frontier models, grab
an API key grab a Jupyter notebook
and just start trying things.
I like typically when I'm
prototyping something, I'll
have a Jupyter notebook open.
I'll have A file in the background that's
like I'm building up a library so that the
Jupyter Notebook doesn't get too messy.
And I'll just build whatever I think is
interesting, try it out, see how it works.
And that will get you from
0 to 1 really quickly.
A lot of the recent blog posts that
I've written I've been building
little prototypes and it's all super
quick stuff built using that pattern.
And then after you do that,
then you start reaching for
the more sophisticated things.
You start getting into I need
to evaluate these things.
I've seen 10 examples of
it working pretty good.
How does that really general?
If I run this a thousand
times, is it really going to?
Have the same feel and then at that
point, I think you start reaching for
tools as you productionize stuff, you
absolutely have to log the prompts and the
completions and, all the metadata around
that and start getting that evaluative
feedback loop going, but initially.
Just get your hands dirty.
Nicolay Gerold: Yeah.
And if people want to follow along with
you and your work, where can they do that?
John Berryman: I recently put
together a website, Arcturus labs.
com A R C T U R U S dash labs.
com that's my consulting
site.
I've got A blog there, which it's only
got three posts right now, but I hope to
be adding a lot more content over time.
I hope to keep the content pretty fun.
It's going to be a lot of
prototypes hands on stuff, things
that you as the reader can use.
I've actually got a little embedded
prototypes that you can play with.
Yeah.
And besides that, you can always find
me in the book that that I just wrote.
So I hope that Nikolay will
stick a little, uRL for that.
It's all about all information you
need about building applications
from large language models.
Nicolay Gerold: So what can we take
away when we are building applications?
I think, first of all, what we
touched upon in the very beginning,
that RAG is not just one thing,
but two separate components.
And if you keep them separated,
it's a lot easier to actually
cognitively manage them.
And we have well established
practices and processes to
actually optimize search systems.
LLMs Aren't so advanced yet.
And I think that's another reason to
actually keep them separate because
you can really focus on the search
component, optimize it, and then.
Once you have nailed that one down,
you can actually play around with
the other lamps way more because you
know the errors you're encountering
are unlikely from the context you're
feeding in which you just retrieved.
And
the prompting, since it's downstream,
it's very difficult to optimize it
when your retrieval has a very low.
Or mediocre success rate.
And in the prompting, you then can
basically try all the different
techniques we just talked about.
So try different formats and different
ways to represent the information.
And I think this often comes down
to the nature of the information
you're presenting, but also the
task you want the model to do.
I think like one mental model I
would like to have is, okay, the
correct answer for similar tasks.
In which type of documents?
might the answer be?
And how is this document formatted?
So like for financial information often
it's like, for example, SEC documents.
So you format your context like that.
And also what do you actually, or how
do you want to construct the context?
So if you have Information
that's chronological.
You probably want to do something
like a timeline so that the
ordering actually reflects that.
And this is often, it's still a very
open area and you probably won't get
like an advice that it's like the solve
all, but rather you want to test a
lot, try different formats and see what
works best, always test it first just by
Establish quantitative tests through
LLM as a charge or heuristics
to actually see whether there
is a quantitative difference.
For me, it's rather, it's
often like an escalation.
I start with wipe checking
because it's way faster.
And the quantitative metrics often
you have at some point a quite
large test that you want to run.
So it gets more and more expensive.
So I actually want to already
have some confidence that this.
change that I'm doing to the prompt
is actually something meaningful.
So I'm first wipe checking and
then running my test suite.
And in that, since you keep the
components separate, you basically
also have two separate test suites.
One basically for the retriever quality
and one for your LLM and your generation.
And this often makes it
easier to pinpoint issues.
And especially for the LLM generation
part, I actually often include
wrong retrievals in my test set
just to see how well my model
handles mistakes or wrong contexts.
I think what John talked about in the
Little Red Riding Hood principle is very
interesting, that you actually want to
mimic whatever has been
in the training data.
And since training data is
like web data to a large degree
think about, okay, how is this
information represented on the web?
So study common formats, documentation
sites, QA forums, technical
books, technical documentation,
and just note basically down for
your specific use case, how is the
information structure in the wild?
Or, but also, how could
I rephrase the tasks?
So it's similar to a common
task that is given to LLMs.
So they are always trained on Q& A.
But also a lot of people do like
simple writing tasks or throw in
a webpage and ask a few questions.
So often you can also rephrase or
reframe the task in a way that's
more suited to an LLM because
it has been trained a lot on it.
And also, match the conventions
that are present in the domain.
So for finance, structured like SCC
filings, analyst reports, for code,
use docstrings, comments, and good
patterns when you're feeding in code.
For medicine, follow like
the medical standards.
And also in medicine, you
often have different lingos or
different classifications that
are used in different fields.
And models are very sensitive to that.
Because they pick up on
these subtle details.
For legal, for example, you, depending
on the task as well, you can mimic, for
example, case briefs, or legal memos, or
really proper legalese, like you would
say in C and D Legal textbooks, I'm
not sure how to say it, to be honest.
And test the format variations
and just see what works best.
And I think vibe testing is
something that has become a meme
and then has become frowned upon.
Now that a lot of people are
really pushing evils down.
For me escalate with your
confidence in complexity.
So first start simple, do vibe checks.
When you see it.
It doesn't work, just disregard it.
You don't have to run your tests.
When it doesn't perform any better
than your current prompt or it
performs worse, I often think
like it's not worth pursuing.
When you find something that actually
already shows you, hey, it has a
way better, Performance on a few
examples, then you can decide to
actually run your test suites.
I think one more part, which is like an
important point, which John mentioned,
is document the failure modes.
This is often really connected
to your logging and monitoring.
And it's actually hard to nail down, okay,
when does the model fail, since a lot.
What we are doing is
actually really fuzzy.
It's like text we are putting out and
delivering to the user So we have to
infer from the user context and what
he's doing with the information we are
presenting to him whether it failed
or not and try to figure out a way how
you can actually identify failures and
generation and Then once you have a
decent chunk of failures, you can try to
figure out categories or commonalities
in the inputs or the outputs which you
can put, for example, a classifier on top
of to, to basically see whether any of
them is occurring or put guardrails in
place to target specific failure modes.
And I think like lastly, one really The
interesting approach he has is like his,
he's categorizing the information he's
feeding into the model in different tiers.
Like tier one is really the critical
context which must be included.
Tier two is like the helpful context.
And then you have like
nice to have context.
And I think that's really
interesting to talk about.
It's less relevant right now because
the hype is really long context models.
But I think with that frame of mind, I
would classify everything you're putting
into the context in these three tiers.
And then actually evaluate like, when
you leave tier 2 and tier 3 away.
Is the performance the same or even
better because the model is more focused
and I think like just the frame of mind
of classifying different information or
pieces of context you're feeding in into
the different tiers or different levels of
importance gives you already like hints.
What could you do with this information?
And also, like, when you decide to
actually include all of it, it also
allows you to leave parts away when
the length of your context window would
be too long to feed into the model.
But this would also mean
that you're feeding over.
32, 000 tokens or even
100, 000 tokens into it.
For me, I never had the best results
by just cramming the context window,
but rather also optimizing the
retrieval and focusing more on that
and then going for really high quality.
And yeah, I think that's most of
what I would actually take away from
the episode when I'm building stuff.
Let me know below what you think
of the episode and whether you
want to have more book deep dives.
For especially the upcoming season
on MLOps, I've read through a bunch
of different books to catch up with
a little bit more of the state of
the art and what people are saying,
what are like the best practices.
I'm also very open to do book deep dives
and then after it interview the authors.
Let me know whether you would be
interested in stuff like that.
Otherwise, if you liked the episode
and you're still listening to the
10 minute outro now leave a leave
a review, it helps out a lot.
Leave a comment when you have any
feedback, also the critical or negative
ones, it helps me improve this.
When you have any suggestions on
topics or guests, also let me know.
Just send me a request on LinkedIn,
Twitter, BlueSky, whatever.
Very happy to hear from you guys and
otherwise I will catch you next week.