Free Form AI

In this episode of Free Form AI, Michael and Ben explore the evolution of prompt engineering and the rise of auto-optimization in AI. We discuss how agents can execute tasks autonomously, why human oversight remains essential and how defining success criteria shapes model performance.

Tune in to Episode 19 for a wide-ranging conversation about:

• Why prompt engineering still matters in an era of automation

• How auto-optimization could replace traditional data annotation

• How AI is augmenting business analytics and software workflows

If you’re interested in where AI agents and automated optimization are heading, click through!

What is Free Form AI?

Free Form AI explores the landscape of machine learning and artificial intelligence with topics ranging from cutting-edge implementations to the philosophies of product development. Whether you're an engineer, researcher, or enthusiast, we cover career growth, real-world applications, and the evolving AI ecosystem in an open, free-flowing format. Join us for practical takeaways to navigate the ever-changing world of AI.

Michael Berk (00:00)
Welcome to another episode of Freeform AI. name is Michael Burke and I do data engineering and machine learning at Databricks. I'm joined by my co-host. His name is...

Ben (00:08)
Ben Wilson, I build tools that allow people to build better agents at Databricks.

Michael Berk (00:13)
Yes, and that is actually the first topical thing that you do that I think you've ever said, and we're very excited about it. ⁓ So today we're going to be talking about exactly that. ⁓ There was a post that was sort of clickbaity that Ben and I were chatting about earlier that said prompt engineering is dead and the future is auto-optimization. And I'm very curious your take, Ben. I think I agree with this sentiment.

somewhat, but ⁓ I think manual prompt tuning still has a role to play for the coming years. So taking a step back, an LLM is just a basically text input and text output machine. And the way that you influence behavior, there's a variety of ways, but the most common one is changing a prompt. And when you say prompt engineering, you typically were referring to the system prompt. So the system prompt is the high level instructions for this

LLM call effectively. So if you say talk like Shakespeare, it will hopefully start using Shakespearean-like language. If you say you're an expert at extracting medical information from a website, it'll hopefully be better at that. You can also define the structure of the output, sort of the logic or reasoning steps that it should take. And there's sort of a continuous scale of having a really, really long system prompt that does a bunch of complex tasks in one call.

versus agents, which delegate and break up different components of a task and have that be its own LLM call. So many LLM calls versus one giant LLM call. That's sort of the continuum. So I think enough of my talking. Ben, I'm curious what your take is on this, just very high level take, on this debate between prompt engineering versus auto optimization.

Ben (02:02)
Yeah, I think the big change that's happened in this industry that unlocks the ability to do more advanced things is this concept of agents, which an agent is, you know, to carry on from your explanation of LLMs, ⁓ it's basically given LLM the ability to just execute stuff, like call what we call tools, and it can go and fetch information

If anybody's been playing around with any of the chat interfaces, you're like chat GPT-5, you're talking to Opus on Anthropic, you're interacting with an agent, something that has the ability to search the web and fetch data. Or if you're using one of the coding CLI tools like ⁓ OpenAI Codex or Gemini CLI, or my personal favorite, ⁓ Claude Code.

you're basically interacting with the system that has the backing of an LLM. For instance, for Claude code, you're either talking to Sonnet or Opus. Opus is much better. It's a more advanced reasoning model. And what these things will be able to do is in the CLI, they can basically run Linux commands. ⁓

to say like, hey, I'm in this repository and I need to be able to fetch code to get contextual awareness of what this user is actually asking me to do. So you're saying like, I need this feature built in this repository. And you provided instructions of like, here's where I want this code to be and here's the systems that it interacts with. If you explicitly tell it to do something, it's still going to go

on and do its own things, particularly if you enable it to do that. And it, in my experience, can find things that I would miss just due to the sheer ability of it to ingest a large amount of information in a short period of time. It's not doing that in a single contextual window though. It's breaking up the tasks that it's going to be doing in sort of a decision train and saying,

Okay, the first thing that I need to do is understand this concept. It'll pull that data. It'll summarize it in its own format for it to make judgment decisions for its reasoning logic. And it'll continue iterating before it'll say, okay, now I understand this. So I'm going to now write some code. It'll either ask for human prompting or you can allow it to like automatically do that. ⁓ and I think one of your, your,

Your initial takes that you mentioned about, Hey, I think manual prompt engineering is, definitely something that's not going to go away. I a hundred percent agree. It can't go away. And we can talk more about that. Why, why I feel that way. ⁓ but for my interaction with agents, I allow them to do certain things autonomously, but I never let it do everything autonomously. I've done tests where I've allowed it to do that.

And usually the outputs are, they're not garbage. They're, they're just not what I'm looking for. And if I allow it to do too much without intervention and like sort of re-guiding it to how I want this thing to be, it can build in my opinion, sometimes like true abominations. Or it's like, okay, you're overreaching your authority. You're overstepping your bounds. You're touching stuff that you shouldn't touch.

you need some sort of guidance. ⁓ But the beauty of these things is they can churn out so much content so quickly. And that's why I love these things. Like, I hate typing. ⁓ Not like hate it, but I don't enjoy, particularly in the prototyping phase, typing out thousands of lines of Python.

or Scala or Java, and then only to find that what I'm trying to test out isn't the right direction, and then I have to delete all that and redo it or try to edit it and make it work. I've got better uses of my time than just hammering on a keyboard.

Michael Berk (06:15)
So yes, long story short, yes, there's been a lot of advancements in agentic flows. They tend to be pretty good at writing code for simple things, but they still need a lot of human in the loop. ⁓ And so my question to you is where we sort of draw this line. So traditional prompt engineering, you have ⁓ basically a string. It's like a body of text. And you play around with it. You say, all right.

Give me a JSON format. Give me a JSON format. Please give me a JSON format or else I'll break your fingers. Whatever it might be. And then you run it through the LLM query and look at the output. And then you could do this in a supervised fashion where you have a truth set and compare predicted to observed. Or you can do this just manually reviewing. And those are sort of the two flavors of manual prompt engineering, whether it be supervised or unsupervised.

seems to be pretty, good at injecting context that is in someone's brain. So if a human works at a company and they know that task X should always follow this structure or this format, they can put that down into words and the LLM will typically follow it. But the next logical step is, all right, well, A, putting your entire knowledge into a system prompt is sometimes time consuming. It's also challenging.

And so the holy grails have the LLM learn via examples or via instructions, just like if you were onboarding to a new role, they would teach you, ⁓ have the LLM actually learn via those examples. it's the, like initial thing that sparked this discussion is that prompt engineering is sort of out and auto optimization is in. ⁓ Prompt engineering definitely has a place, but Ben, I know that you have been working on some

Ben (07:48)
Mm-hmm.

Michael Berk (08:02)
explorations at least for auto optimization ⁓ based tooling. Can you say a little bit about that?

Ben (08:08)
Yeah, like we tested out stuff with using some of these CLI tools. Like we use them every day for everything that we do at Databricks. And most people in our echelon of R &D companies that are building software are using these things every day. We don't give it a couple of sentences.

to build a feature and say, go do this and file a PR and we'll just merge it. ⁓ Bad things happen when you do that. And that's not good engineering practices. What we do though, is work with these assistants in such a way that we can guide them. Sometimes, I think I talked a couple of episodes ago about like how I do it, where I'm like, hey, we're going to do a design first. We're going to create a markdown file that is something that maybe isn't

super useful for humans to read because it could be verbose in areas that we don't need it to be verbose in, but it's also very truncated in other parts. Like it could be ambiguous for us to understand, but there's almost like a shorthand notation that these agents can use that they understand when they see this very brief sentence or this bullet point, how to interpret that properly.

And we've done testing within our team to basically say, well, can we, for those instruction sets for this design, can we have it go through its first iteration of a loop of building example code without actually generating the code in the repository? Just saying like, hey, build a, write a test that validates the behavior and also write a script that would implement this behavior.

I put it in the repo route, name it, whatever, and then make that test pass after I accept that that test is accurate. And as you're doing that, update your instruction set, not your Claw.md file, but that project design for this PR. Optimize that before we start actually writing code. And that loop, that human in the loop optimization, where it's basically optimizing its own instruction set for the task that we're working on.

It does a really, really good job of getting that design to be pretty solid, almost to the point where once we're good with that, we can actually say, okay, now write the PR. And it might generate like a 2000 line PR with tests and everything. The feature is actually implemented correctly because we spend a half an hour going through with it or an hour or so making sure that the design is what we want. And it

captures that context properly. So why not extend that to the next logical layer, which is, can it actually do not just implementation of framework code, but applied engineering and having it use its ability to track data and evaluations of the quality in such a way that it can empirically get better over time?

Michael Berk (11:08)
Okay.

Ben (11:08)
And that's

kind of like my take on the optimize of a prompt. And what we're talking about is like the system prompt, the thing that is the instruction set for your application that's interfacing with an agent or even with like a chat LLM.

You can create a test data just like you would with traditional ML. That's like, here's my validation set of ground truth. Now in the world of LLMs and in Gen.ai, your ground truth is something that requires human feedback, but can also be supplemented by LLMs themselves. basically the dev loop is

You write code for an agent. You're ready to test that in a development environment. So you start interacting with the agent, ask it a bunch of questions that generates traces, like span information for each stage within your agent application. Those get logged. In our case, we're logging them to MLflow. And MLflow has APIs that allow you to store those in perpetuity as well as

the ability to add feedback to them. Whether that feedback comes from an LLM judge or from a human, the best practice process for that is run it through LLM judges first. Perhaps set a bunch of expectations of like, hey, with this input, I'm expecting that the tone is X and that it

accurately answers the question and it mentions these topics in its answer. You you put anything in an expectation. Whatever is actually related to the functionality of that agent and how that input should respond. And then you get outputs from it, which is I'm executing this input against my application and here's what the robot told me.

as an answer. So now you have this inputs, outputs, and expectation data set associated with them. You can then evaluate that, send it to an LLM and say, based on what you see here, this was the input question, this is what I got as a response, and here's what my expectations were for the response. Basically, rate this. Tell me how good this was. And it'll evaluate that and it'll provide its own feedback attached to that data.

and that gets stored in MLflow as well. Then you can have a human go in and look at that evaluation that the LLM gave and basically give it thumbs up, thumbs down, or reasons why its own adjudication of the quality of that was a little bit off, because it's not going to be perfect, because that judge in and of itself has its own prompt. Now, historically, those prompts have been static, right? They're just like...

here's a named score that I have that does these things and has an instruction set for the LLM to follow. Maybe that's not super optimized. Maybe it needs a little tweaking. So you can go in and create your own score and spend that manual loop of like, okay, I need to make sure that this catches this nuance and I'm going to add some more instructions here. Maybe it got confused on this line. I'm going to remove that. So you would go and edit that and then

run evaluation again until you get ⁓ better results. There's nothing stopping you from automating that, from just using that LLM adjudication, the human feedback as well, and then looking at how aligned those are, and then having an agent actually optimize that so that there is better alignment between the automated system and a human.

When you have that and you have pretty good alignment, you can then take one of these agentic tool systems like Cloud Code or Codex or Gemini CLI. And you can just connect that to your source code for your agent, the thing that you're actually testing and have it basically improve itself. Say like, hey, look at the source code, look at the results from the evaluation, figure out

some recommendations of what we should change in the code and then go make a couple of those and then run an evaluation and see now that you have this sacrosanct data set that has been vetted, you can then make some code changes, test it and see did it improve? And when you're looking at that measurement of improvement, you're no longer looking at human alignment because you already got that. You know that the judge is good.

it's accurately measuring the quality of your application. Now its task is just get the like make sure that the outputs match the expectations.

Michael Berk (15:39)
All right, may I begin my saucy takes? ⁓ So first before that, just setting some context, this is a concept that's been around since the concept of agents, basically, this eval loop. There's common patterns. ⁓ This one, I forget exactly what. And it has different names, but it's basically a supervisor-based flow where it will keep iterating and improving based on an objective criteria until success is met, then it will exit.

that can be human in the loop. Ben was talking a lot about the eval prompt itself. That's really, really important to think through. But the key hopeful innovation is that you can plug in basically these really robust CLI based code optimizers, and they can test a broader range of hypotheses. So instead of just a system prompt with an LLM call, you can leverage that coding agent or agentic system to not just do one code generation, it can actually run the code.

can leverage tools to go get additional context. That's where this MCP concept came in. And it also fits really, really nicely into a framework like MLflow, which traditionally manages this model lifecycle. So that's sort of where this vision is headed. ⁓ Question number one that I don't actually believe in. But Ben, you said something very scary that the

LLM will actually change its system prompts and even with the evals. So the truth set, the truth sayer that says this is good, this is bad, you're going to allow changing that system prompts. What if you set it loose and it is told to make paper clips and it then kills the universe?

Ben (17:18)
You always, like we've already tested stuff like that. And that's what I was talking about with like framework changes. If you just give it vague instructions and you don't, you're not involved in its iterative loop of optimization, like creating new versions of code and creating its own tests. And if you're not looking at that stuff, you're just going to create garbage, like absolute slop. And you'll

you'll maybe file a PR and then the people on your team will look at it and be like, what are you thinking? Did you just let this wild with like an LLM and have it just generate this? Like, why would you do this? So the key is keeping the human there to leverage the fact that

They don't like the way that we think about it is don't think that because of the seemingly magical capabilities of these things, these LLMs and agents, because they seem like they're superhuman. Don't think that they're intelligent. They don't leverage them for what they are really good at, which is following discrete instructions and being able to rapidly change direction to build what you actually want.

So if you approach it in that way as like, Hey, this is my, my typing assistant. It's really good at typing and it's really good at understanding context of how to type what I want it to really like as I want it to. And it can respond to feedback fairly quickly. ⁓ and it has the ability to execute code and, you know, validate that things are working correctly. It enables you to just accelerate your development process.

but you still need to be there. You have to be in the loop checking everything. And the way that we think about it is, how would we behave with an intern? Four years ago, you would have some interns that come in, they're maybe sophomore or college and undergrad, ⁓ and you're gonna give them a task.

that's not going to be like, build this whole new backend API for us. You're not going to give them something like that. Cause they're going to be like, what the hell? How do I do this? And I have all these questions and I'm now panicking because you're expecting me to do this in two weeks. You wouldn't do that. You would say, here's this one part of this backend implementation we need to do. Can you go and like just build that? And you would expect that there's going to be questions along the way. They're going to be filing.

a number of PRs that you're going to have to review very carefully and make comments and feedback to. It's no different than interacting with a coding agent. Where it's basically with each iteration when it comes to a termination point, think of that as they just file the PR. Now you have to go in and look at what they created and tell them what needs to change. So you can get this sort of what used to take

maybe two weeks of time, I can now do that in an hour. And I don't need to waste somebody else's time. I can just do that. And it's basically that intern that's just churning out code in a much ⁓ smaller compressed time. The other benefit is we can give these same tools to our interns too. So they can actually work on the smaller scale projects, learn along the way, get exposure to, you know,

architectural styles and development practices from effectively a pretty good coder, which these agents are pretty good at that. They can kind of see stuff and they can learn through example and figure out patterns that work and understand how the future of this profession is going to be without having to just like, go figure it out and ask a bunch of people a bunch of questions and take forever to get something shipped.

So it's a win-win. if you approach improving your applied engineering work, like you're building like an agent using these tools, if you approach it in that way of saying, I'm supervising this thing in the development of this and making it better, all of the practitioners that we've talked to, the people that are in industry right now that are doing this, they're shipping agents to production right now.

almost universally across the board, their feedback has been, I don't want to waste time building a super accurate judge. I don't want to waste time in evaluating the results of thousands of traces, which traces are huge, like the chain of information that's stored about every interaction with these things. I don't want to have to go through and sift through that to find out, well,

Based on my eval, I have these top 10 problems. Which ones are most prevalent? How am going to fix all these things? What should I try first? Because of the complexity of these systems, they're very different than traditional ML. And also the people working on them are different as well. know, like software engineers who are, they're expecting determinism in everything that we do. And now you're, you're

basically working on a project that is inherently non-deterministic, how do you address that? And it's really challenging. It's really time consuming. So if you can leverage these tools to help out with that in an interactive environment, you're going to ship stuff faster, of higher quality.

Michael Berk (22:34)
Cool. Agreed on everything. One of my actual takes though is that I think that...

Creating a supervised problem is actually sometimes very challenging, especially when you don't quite know what good looks like. So question to you on the R &D team, what percent of the valuable problems can easily be condensed into a supervised problem with a clear success criteria? And for those that cannot be condensed into a supervised problem, how would the dev flow look when it's more exploratory?

Ben (22:55)
He

So when you're at a point where you have a version of your code or your project that you would be like, hey, I'm ready to move this to staging and get feedback on this, that's now what I would consider.

Michael Berk (23:22)
Got it.

Ben (23:24)
It's kind of ready for supervision of like...

Michael Berk (23:26)
So this is a

staging to production type of tool. It's not a idea to staging production or idea to staging tool.

Ben (23:33)
It's, it's the boundary between dev and staging is where you would use something like this. Like you have an implementation. You think it's pretty good. Like you've tested it a little bit, like you've interacted with it and you're like, yeah, this seems right. But a lot of these applications that you're building are highly specialized to some part of a business or some actual use case that you, an individual who's building this thing, probably don't have that knowledge set to properly evaluate.

like the quality of this, you're not able to like, your customers that are going to be interacting with this, you can't really jump into their shoes and simulate how they're going to interact with this thing. So what you could do is get a corpus of questions that people would be asking this. Like you could go and fetch them, like ask them like, what would you ask this agent? Like for real, like in your day to day, what kind of questions would you ask?

You don't need to gather all of the outputs from them. It's great if you could do that based on people we've talked to. Most people can't get that data because it's super time consuming for somebody to be like, well, I would have a question on this. And then you're like, well, how would you answer it? And you're like, well, it would take me a couple of hours to go and look all this information up. And then I could give the actual real answer. I don't have time to do that. Sorry. So what you can do is just get the questions.

and then run it through your tool and see what the actual outputs are. But then at that point, you're like, I don't know if this is right or not. Well, if you just have that base prompt for the judge, you could use that. But because it's static and you don't really know how to adjust that in order to get it to evaluate whether the question is right or wrong, it's like an NP-hard problem at that point.

You're almost like at a crossroads. like, well, what the heck do I do? Like, where do I go next? I need subject matter experts to evaluate this. So the goal is get that process to be a little bit simpler so that people could, who actually know the domain can go in and in a very simple UI. Mark these things as like, yep, that's good or nope, that's bad. And here's why. But it's just like a brief sentence explaining why it's junk. And then at that point, you're now at the

After you've made your fixes and gotten that to be as good as possible, now you're ready to actually stage this and get it into a beta mode. And at that point, you can move from that interactive mode of, hey, I don't really know if my judge is good. I don't really know if my app is good. You now are at a point when you're ready to go to staging where you're like,

No, the judges are good. I can trust them. So that part can be automated. I can just create this as like a static thing that's evaluating this. And when I open it up to beta testing and have users actually interact with this, I can score all of those responses and then go back into dev optimization where I'm saying, all right, now I can have one of these coding agents analyze those results from real world interactions and tell me how to make this thing better.

Michael Berk (26:38)
So that's really interesting. I was hoping for something like more Holy Grail-esque, but if you can define the problem to be constrained enough to this sort of iterative loop, that makes a lot of sense. But Ben, as you alluded to, some of the most time consuming and mentally taxing pieces of a project are going and collecting the data. you said like seeing what users would ask against a chat agent or whatever else it might be.

For a lot of the applied projects that at least we work on in the Databricks fields, we have to do a lot of discovery. We have to leverage our human brains and subject matter expertise, collaborate, and then figure out what the actual problem is and then what success looks like. So you're telling me that is not going away whatsoever.

Ben (27:23)
No, that's the thing, people that are working on agentic systems right now, they're kind of approaching it as they would a software development project where your source of truth is a unit test or an integration test. You know what you're expecting out of your software system and you need to know enough about what you're trying to validate in order to write that test, but it either passes or it fails.

And it's pretty straightforward. It's like, it's, it's all just bully and stuff. But if you were to take people that come from a data science background and you're like, well, I'm building this solution, this project to solve this problem for this like prediction or something. When you get the results out of that, you don't really know because you don't have, you usually don't have domain expert knowledge of what you're trying to build. You're just the person who's building a model that's generating this data.

You typically have to go and talk to subject matter experts and be like, Hey, can you look at the outputs here? I know that based on validation data that it's it's mathematically as optimized as it can be, but is this actual useful data? Can you act on this data? Does this solve your problem? So we've, you know, we've coming from that background that's been our lives, right? Like that's how you, you work on projects. You need to get those people involved and to look at

the output to your model. So that's the big shift right now with some of these practitioners that are working on these projects. They're like, man, I need to like talk to all these people and get feedback and, this takes so long. And then meanwhile, the people on the data science side are like, yeah, welcome to our world. Like, this is how we do, this is why our projects take so long. Cause you have to do that, that validation loop, which is people have to be involved.

Michael Berk (29:05)
Got it. So that makes a ton of sense, taking something from dev to staging. I think that has a very nice constrained success criteria as a feature. But just double clicking into this, I think one of the most valuable things an LLM can provide in these code generation tools is it allows you to rapidly test hypotheses. So if you have, let's say you're building a...

an ML model and you want to try a gradient boosting algorithm and then random forests. And then let's throw a linear regression in there because you know why not. And you have all these different optimization strategies, all these feature engineering strategies. ⁓ What are your thoughts on this aspect of the dev cycle, the exploratory phase? Can we have something like this tool go and build a bunch of prototypes and collect feedback, iterate, and then return to

Ben (29:31)
Thank

Michael Berk (29:57)
a recommended approach.

Ben (29:58)
I mean, that's basically how we develop stuff nowadays. ⁓ When we're doing greenfield stuff, that, so like our dev process now, we're doing like some major new initiative that is probably, it could be dozens of backend APIs. It could be entire REST interfaces. could be, know, database tables we need to create and design.

something that is fairly complex is going to take months to build. historically, we would build like a hack together prototype because of our knowledge of these systems. We would just cut corners in that hack together prototype. We're like, I don't, I'm not going to worry about serialization right now. I'm not going to worry about deserialization. I'm not going to worry about wire protocols. I'm not going to do all that stuff for this prototype. I'm just going to make it work in the simplest way possible.

And that still takes, I mean, historically that took, you know, up to a week, maybe two weeks to just build that thing out. If it, if it's entirely new greenfield thing, like we have this idea, we want to just see if it works, if we can make it work. Now we can instead write a design, which we would have done anyway, right? I'm like, this is what I want this thing to do. Here's like the must haves, the could haves.

And here's the design philosophy. Like here's the APIs that I want and the interfaces that I want. I can take that doc and I can basically tell one of these coding agents, like, Hey, let's create a branch and it's just going to be a prototype. So it doesn't have to be pretty, but make this work and then let it run for a couple of hours. And then you come back and you look at what it's created. You ask it to change.

couple hundred things. You're like, I don't like what you did here or hey, this actually doesn't work the way that we want it to. Can you change these things? Let it run for another couple hours. By the end of the day, you probably have a working prototype. Does that mean that that prototype code is going to be anything like what is going to be in production? Hell no. But it built something that we can test and we can play around with. We can hit the APIs.

Maybe we built some very minimalistic UI component to go with that. And it wrote all of that like React code.

That's a good reference point for us to understand what it is that we're supposed to be building when we go into the building phase. But we know that that prototype is going to be missing tons of stuff. Or it might not interact with all of the parts of our framework that it needs to. the implementation is going to have to fundamentally change. And you also wouldn't file that prototype as a PR, because it's massive.

Everybody knows that humans are really bad at when you get over about a thousand lines on a PR the review quality goes down because it's just too much to keep in your head at one time. So you can take that prototype to chunk up your work and say like, well, this project is probably going to be 75 PRs. So on PR one, let's create like the primitives. Let's start with the building blocks. PRs one through 30 is just going to be like build these

discrete elements so that we can get them reviewed, merged into a feature branch. And then we can start manipulating things, making sure that they talk to one another properly, that we can build the higher level APIs on top of that and things still work. Throughout that process, you're still making adjustments. You're, you're fixing bugs. You're making sure that things like feel the way that you want to. But we're using the agents for that process too, but it's just, it's a more constrained space and the context is much smaller.

so that it can be sort of more laser focused on what we're trying to do with that particular work stream.

Michael Berk (33:29)
Right. So yeah, that makes sense. There are sort of two modes of development. The first is bringing an idea into a production-ish ready state. And then prior to that, sort of this open exploratory analysis of what is possible? What does the problem even look like? How should we define it? What are the potential solutions? That's another area that the LLM-agentic flow can really, really help, especially with a

Ben (33:47)
Mm-hmm.

Michael Berk (33:55)
dynamic loss or a dynamic system prompt that evaluates success. So that's super cool. ⁓ Question to you though, Ben, this is really changing your day to day. You become more of a code reviewer and less of a code monkey. How are you liking that? Cause there's, yeah. Okay.

Ben (34:11)
I love it. I love it.

And everybody that I work with right now, we're all doing it. Everybody's got, ⁓ what did I hear yesterday? Cause we asked this question to each other on our team meeting two days ago. I just brought it up. was like, Hey, for the group. How many terminals do you have open for different providers right now?

And everybody had a different answer. Like one person was like, well, I've got eight TMUX sessions open. And then somebody else was like, well, I've got, you know, I've got four open at any given time, but I use this tool instead of this other one. And I'm like, well, I've got four I term terminals open. Three of them are busy right now. One is for like debugging for like customer questions and I'm not using it right now. So.

We're all heavily using this stuff, but everybody's using it in slightly different ways. And we share ideas with each other. Like, ⁓ I'm going to share with like this, like my cloud config for, for doing like this repo work. And you go and look at it you're like, ⁓ I see these like four things in here. I don't have that in mind. I'm going to, yeah, I'm going to copy that over and see if it changes the dev environment. And then after a couple of tests, you're like, these are awesome. Yeah.

I'm gonna delete these two lines from mine, because that was actually annoying me.

Michael Berk (35:32)
Okay, cool. How's context switching going though?

Ben (35:35)
It sort of rewires your brain. Like when you first start doing it, it almost feels overwhelming. But what helps me is like keeping just a document on my computer. That's sort of a summary state of what it is that I'm working on and also tabs associated. Like in Google Chrome, I'll have tabs that are colored that are related to ⁓ different work streams that I'm working on. And

that helps me sort of organize the fact that I'm like, hey, I'm actually gonna be doing development work on four different things all at the same time that are not related to each other. So I have like the actual terminal windows named so that I can, know which one to go to.

that like that's, I used to be really bad at context wishing. It used to stress me out. Cause if you're the one who has to focus entirely on not just what the heck am I doing, but how am I doing it? And if you're working on completely different work streams, that's a huge cognitive load. If you're building like framework code, it's like, well I'm working on this backend API right now. And then there's this other thing where I got to like update the UI switching from like Python to react.

in the span of a minute is jarring. So it's like, now I don't have to worry about that. It's you're operating at a higher level of like more efficient sort of parallel operation because somebody else is basically doing the actual typing.

Michael Berk (37:01)
Yeah. Anytime I hear the word context switching, reminds me of my first internship ever. And my manager, who's probably one of the smartest people I've ever met, he told me that for his PhD project at Stanford, it was too easy. So he would basically have a timer and set it every five minutes to make him have to context switch to something else so that he could practice context switching.

And when he was telling me this, he was standing with one foot on a balance beam while writing code, while talking to me. And it was just like, all right, some brains are built different. But yeah, it's been interesting for me to hold multiple contexts in my head. It's a pretty different structure. And in the field, you definitely do have to context switch a ton. Sometimes you have up to like seven or eight different customer projects running in parallel. But for this,

It's an interesting balance between, I'm just on one customer right now. It's an interesting balance between being very focused on the overall problem statement, then thinking about the sort of conceptually Jira epics that are workstreams contributing to that overall goal, and then sub-tickets, and then bugs in each ticket. So it's this hierarchical structure that definitely changes how your brain works. And it's not the most graceful transition always, but...

You can always just slow down if you need to.

Ben (38:24)
Yeah, I don't know if I would be able to do that today anymore. I used to have to do that when I was like in the field, like, I've got four customer calls today. They're all talking about completely different things. I would have to have like basically tabs in a Google sheet where I was writing notes down and 15 minutes before the call, I would have to like review that and be like, what's the state of this right now? Okay, I'm going to read these like

200 different cells worth of content information to remember what the context was and they're like, okay, now I'm prepared for this meeting.

Now I'm not really, I don't need to do that. I can keep in my head these four different work streams about like what the state of them is right now. It's like, okay, work stream one, I'm on PR seven of 14 and I'm working on like the rest interface. Okay. I can, I know what's going on and I have the historical context for that particular work stream. It's just in code.

So I kind of know in my head like where we are, where we need to go next. But I also have for those larger projects like that, I'll have PRs one through 15 little markdown design docs in the repo route that's keeping track of that as well as one master file. That's like, I'm telling Claude to like keep that up to date every time we're like.

getting ready to cut a new PR or I tell it, okay, PR number six just merged. I'm going to merge the changes of that into, you know, the PR we're working on right now. Go run all the tests, make sure we didn't break anything and then continue what you're We're like, Hey, we're ready to start PR eight. I'm going to cut the brand, like create a new branch for this. And here's what we're working on. Go read this file and start developing.

Michael Berk (40:09)
Okay, Ben, I have one final question for you. We've talked a lot about how code agents and sort of tools and all this fancy development is helping the power users such as yourself and theoretically me, where we're shipping a bunch of code.

99 % of the workforce doesn't operate on that level. They won't have four I terms and running their own cloud configs. So...

Given you're working on this so closely, I'm curious your outlook on when there will be something generally available for the public that can be as simple as define a problem and it goes and like the agent and goes and solves it. And to add a little bit more flavor, ⁓ referencing a prior job again, my manager at the time was like, the most valuable thing you can do for me is take a monkey off my back. And I was like, what the hell does that mean? And he was like,

Ben (40:58)
Ha

Michael Berk (41:00)
Basically, I have a bunch of monkeys on my back, metaphorically. And if you like offer to take a monkey and like feed it, give it water, whatever, like entertain it for a bit, then give it back to me. That's not nearly as valuable as you going and taking that monkey and just throwing it away. And what that means conceptually is like you go, you own a project, you deliver it end to end. And I, the manager never ever have to think about it again.

With agentic systems, we are not in that phase. We still have to have each right now step reviewed. And overall, the problem statement has to be defined. And then you have to do a final eval of the success. Think about how it fits within the broader organization. Arguably, you're making more monkeys because you can work on more of them in parallel. So they are lighter monkeys, yes. So Ben, I'm curious, when are we going to reach the stage?

Ben (41:43)
They're lighter monkeys.

Michael Berk (41:50)
with a simple UI and have an LLM take a monkey off your back.

Ben (41:55)
dude, you're talking about AGI, Like you need something that is just as capable as you, but faster, potentially smarter with the ability to fetch more context than your brain can hold. We're years off, like many, many, many years in my opinion.

It's not just, let's give the existing LLMs like more tools and bigger context windows. I think it's a, you need a different architecture for these things. Like what is the next evolution past transformers? Maybe then we'll get closer, but for right now we're, we're not even remotely close. Not to a point where I would be able to be like,

Michael Berk (42:31)
Well then let's die.

Ben (42:34)
Hey, I can write three paragraphs of text, like a Heilmeyer design of like, here's what I want and here's what I want it to do. Here's what it shouldn't do. And here's how, and here's the repo that I want you to do that in. Now go do that and have it file those like 22 PRs grand total, like 90,000 lines of code committed that our review of that is basically a bug bash. We're just validating, like, did it build what we wanted it to?

Let's check it out. Let's like use it as a user would. Yeah, I think really, really far off from that.

Michael Berk (43:05)
Let's reframe the question then, because if the monkey is small enough, it can do that. So if says, if the monkey is rewrite this email for me, all right, it can maybe do that. Yeah. So in the next year or two for actual industry applications that you've seen via working with products and customers, what are the monkeys that will be taken off of people's backs in the next year or two?

Ben (43:13)
we're already there. Like, yeah.

Stuff that's already being taken off customer service like when you're you're looking to Issue an RMA for a shipment you bought something it's broken it arrived all messed up like to your doorstep You if you're talking to Amazon right now, you're talking to a bot that is pretty darn good It pulls your order history. It pulls like what this

exact product was you don't even need to tell it like, here's the SKU and here's like the order ID number. It just looks it up. It knows who you are because you're logged in and it's like, like, Hey, I had a damaged product on that arrived recently. It'll go and fetch your order history and it'll be like, was it this? And you're like, yeah, it was, it was that. And then it'll be like, well, what's damaged with it? And you're like, well, the

Like the entire thing just doesn't work. It's broken in half It doesn't need to go and phone home and be like well I need to get a human to authorize this it'll just ship another one to you in seconds and It'll be there tomorrow and Then it'll tell you what to do with the old thing like if it's large they'll be like when we get the delivery We'll have this picked up. Just please put it on your front door if it's small and it's like low cost they'll be like

keep it or throw it away, recycle it.

Michael Berk (44:49)
But restating the question, what's the frontier? Like, yeah, I agree that that's pretty done. ⁓ What is the next year of hopefully somewhat reasoning-based tasks? Like, will it write SQL for me? Will it solve a problem with SQL? Will it augment the business analytics process? Will it augment the consulting process? Where do you think the next job descriptions are that will be severely augmented from it?

Ben (45:16)
I I like how you said that augmented and not replaced. So I think.

people that currently do like business analytics are going to have a huge boon to their productivity where they're not having to like search for needles and haystacks. They just ask the question in plain language or they're there to do advanced analytics and like answer really hard problems with these agents to help them out with like banging out thousand line SQL queries and like complex, you know,

munging of data into something that could be visualized and can tell a compelling story. So they can work on more problems faster without that drudgery of like, my SQL syntax is messed up on line 347. ⁓ and then having to go and fix that and rerun the query. They can have these agents just bang that stuff out and then they can just ask questions about it. Like, well, what do you think about the data here and provide additional real world context to the agent?

so that it pulls the data that they want, that they need in order to solve this problem. Software engineering is getting a huge boon, like productivity for an individual is, by my estimation, code output for a lot of people on the team is 8x what it was before these coding agents. So we're able to just push out a lot of commits of code that's not.

Michael Berk (46:33)
Peace.

Ben (46:39)
absolute, like absolutely terrible. Cause we have that human in the loop thing. We're doing internal reviews ourselves before submitting and then fixes for things, ⁓ like bug fixes in code. That's largely automated. Now we just tell the, like one of these agents to be like, Hey, we just got this report, go diagnose like what the root cause of this and then file a fix and stuff like that. We're just correcting a, like a design flaw. That's

We're already there. It's pretty simple for these things to do that. Building new stuff, that's hard. And we need to be heavily involved in that. But yeah, I imagine there's a lot of industries where...

the fundamental nature of their job is going to change quite a bit in the next five years, where every day they're gonna be interacting with assistants all day long. It's like, you may not trust them, you may not like doing that right now, but anybody I've talked to who's actually started using these things that are designed for helping them out, they're just like, yeah, these are awesome. It's all it is, is a...

a monkey extractor.

Michael Berk (47:39)
Yeah, hopefully we can get to the gorillas pretty soon instead of the little baby guys, the capuchins. ⁓ OK, cool. Any closing thoughts before I summarize?

Ben (47:46)
Mm-hmm.

No. I mean, just like stay tuned for some of the stuff that we're building. Like we're, we were shocked with some of our prototypes at what they could do.

Michael Berk (47:55)
Yeah.

Yeah. And I'll be really excited when it's a simple interface where you can sort of give it a simple description. Cause all this like system and like prompt engineering is really valuable for power users. But I think democratizing this would be really, really exciting. So I'm curious when that happens, but yeah, really cool episode. We tried to talk about prompt engineering for our, like auto optimization. And of course got off topic, but, ⁓

Ben (48:16)
Hmm.

Michael Berk (48:27)
At a high level, some key takeaways are humans still very much need to be in the loop for a lot of tasks. And the way that you can think about how much a human needs to be in the loop is this continuum of task complexity. So LLMs don't think like a human exactly. They're very good at certain things. They really, really struggle with other things. And you just need to think about where GenAI is in terms of advancement. And this will obviously get better over time.

determine what you can actually delegate versus what you have to be heavily involved in. And the one-liners for that is if it's something new or if it's sort of an exploratory project, you probably want to be heavily involved. And if it's a constrained problem that involves following instructions and you can write those instructions down, the LLM is probably going to do a bit better. Keep the requests simple, clear, and concise. And then as we just alluded to, IDATA MVP is a very different flow than Dev to Staging.

and they need to be treated differently. And then finally, try to start small. You can often increase scale, but if you give an LLM a do my day job type of problem, it probably won't do very

Ben (49:31)
Yep. It definitely won't do it very well.

Michael Berk (49:32)
Anything else?

Depends upon your day job, but yes.

Ben (49:37)
Yeah.

Maybe another episode will cover ⁓ what industries are definitely not going to be affected by this. This Advent. There are some that are like, they're not.

Michael Berk (49:43)
Wait, yeah, I'm super down.

Ben (49:45)
There will be no interaction between the robots and humans. And the robots will, it'll be decades if not centuries before we hit our end.

Michael Berk (49:53)
car manufacturing,

like factory. Right. But no gen AI in it. Or actually is there.

Ben (49:54)
mean car manufacturing is largely automated right now.

yeah.

Michael Berk (50:01)
Okay. Nevermind then. Well, a plumber, true.

Ben (50:04)
Plumber, electrician, a bricklayer,

landscape artist. There's tons of stuff.

Michael Berk (50:08)
All right,

for next time.

Cool. Well, until that next time, it's been Michael Burke and my co-host and have a good day, everyone.

Ben (50:14)
Ben Wilson.

We'll catch you next time.

More episodes

Chapters

What is Free Form AI?