Free Form AI

Ever wondered what senior engineers actually talk about behind closed doors?

In this episode of Free Form AI, Michael and Ben open up the conversations developers usually only hear behind closed doors. We're talking how real engineering teams review code, manage dependencies, keep tests reliable and prevent their codebases from turning into chaos.

Live and in real time, they break down the habits and workflows that make software durable: using reusable components to avoid reinvention, building integration tests that catch silent failures, choosing versioning strategies that won’t break downstream users, and writing documentation that actually accelerates collaboration.

Tune into episode 25 for a wide-ranging conversation about:
• What code reviews really accomplish
• Why reusable components reduce long-term friction
• How dependency management goes wrong (and how to keep it stable)
• Why integration tests are the backbone of reliable software
• How versioning choices shape releases
• The role of clear documentation in team velocity
• Why internal utilities need user-centric design
• How clean codebases speed up onboarding and feedback

If your work touches code, this episode gives you the kind of insight you’d normally only get sitting next to seasoned engineers at the office.

What is Free Form AI?

Free Form AI explores the landscape of machine learning and artificial intelligence with topics ranging from cutting-edge implementations to the philosophies of product development. Whether you're an engineer, researcher, or enthusiast, we cover career growth, real-world applications, and the evolving AI ecosystem in an open, free-flowing format. Join us for practical takeaways to navigate the ever-changing world of AI.

Michael Berk (00:00)
Welcome back to another episode of Freeform AI. My name is Michael Burke. do data engineering and machine learning at Databricks. I'm joined by my co-host. His name is...

Ben (00:09)
Ben Wilson, I investigate CVEs and patch them at Databricks.

Michael Berk (00:13)
Yeah, seven in the last ten days.

Cool. Today we're gonna be doing a little bit of a different format. So we're gonna be starting a YouTube channel so that we can gain enough following to monetize so that I can make this tax deductible. And for this YouTube channel, we were thinking that it would be actually really great to be more transparent about what development looks like at Databricks. I switched

to a new role about two weeks ago and I'm more internal facing, which is a really nice change of pace. And Ben has been a mentor for how to write non-crappy code. And we were going to just have a working session where you guys get to see what it's like to have a live code review and also just discuss strategies around building repos and writing code. Sound about right?

Cool. So.

Ben (01:06)
I'm gonna

preface to that. I don't think any self-aware software engineer would ever say that they're an expert on anything. So just wanted to preface that. I am self-aware and most certainly don't make all the best calls at all the best times, but we do processes like this. We usually do it remote, right? In my team.

You'll file a PR, people will make comments, ask for changes. They see things from a different perspective because you put blinders on when you write your own code to a certain degree. And that can even be further complicated if you're using Gen.ai to assist you. Just because it can generate so much stuff so quickly, you can kind of look at it and be like, yeah, it looks pretty good. And then another person who doesn't have that context comes in and is really looking at it. Like, did you think of this or?

We might not want to do that. Let's try this instead, or, Hey, I think you're off base on what you're doing here. There's a simpler way to do this. So what, what you and I usually do is, is do like the old school live review, just like ask questions. then sometimes I'll ask you leading questions. Like, what are you trying to do here? Or, Oh, you definitely don't want to do this thing. And a lot of that comes from personal experience. Anything that I say that's like that.

It's because I've screwed it up and I had to learn the hard way.

Michael Berk (02:27)
Cool. So let me set the stage for what this initiative is. The new role that I took about two or three weeks ago is called reusable IP. Conceptually, Databricks field goes and builds lots of solutions and they figure out challenging things. And in theory, there's a large corpus of information out there and tribal knowledge that should be made reusable. If one person figures out something complex, no one else in the field in theory should have to figure that out as well. So.

We have six work streams and one of the work streams is the repo that we're going to be going through today. And it's called components. Components are small modular pieces of code or utilities that typically can be reused between deployments, but might not be super easy to write for the first time, especially if you don't have a lot of context. Ben, do you understand? Cool. So.

Ben (03:17)
Yes.

Michael Berk (03:20)
For full transparency, this repo is private, but we're going to open source it eventually. So you'll be able to see this at some point, but this is live. This is my actual work desktop. And we're going to go through a review that Ben has not seen yet. And yeah, it should be interesting. It's a new format for sure. All right.

Ben (03:44)
Yeah.

Michael Berk (03:46)
On the screen, as you guys can see. ⁓

We have some key directories. So just starting top level and then I'll let you drive with some questions or I can continue to drive. I use Claude religiously. I have settings that have basically an allow list. And I have several commands which I'm playing with. ⁓ One that we chatted about last week was a setup command where basically it will create a new directory with dependent repos.

because as we build components, we want to know what is the best practice. And often Googling around is not the most efficient way to find what a source implementation actually is. And so if you just clone the repo into a set of reference repo or into a reference repo directory, often Cloud Code is able to parse that a little bit better than actually Googling docs and a few other things like that. So that's the dot cloud and we can definitely revisit. ⁓ Let's hold off on GitHub workflows.

The actual implementation here is a components directory. And within components, we have ⁓ just one implementation and that's model serving. Basically for this implementation, if you deploy a model serving endpoint, so it's a VM that exists in the Databricks environment and you want to query multiple things at once or concurrently, that's not super hard, I guess, but it's also not super trivial. It's sort of like an intermediate ⁓ technical question. And there's

ways to do it well and ways to do it poorly. And so there's a big readme of what the hell is threading, what is multiprocessing, how does all this stuff work, how does it work within model serving and then additional like gotchas and FAQs, including two implementations. One is a quick start. So this is just for knowledge transfer purposes. If you want the most lightweight template, you can leverage this. And the next is a best practice, which is ⁓ more of a lift and shift with complex examples.

For instance, authenticating against Databricks resources and that type of thing. So finishing up the tour, ⁓ requirements, pretty straightforward. It's just a list of ⁓ TXT files with PyPy references. ⁓ Integration tests is the thing that I would like to chat about first and then tests as well. So pausing there. And this is just like code that shouldn't even exist in the main branch.

Look at me. ⁓ So Ben, ⁓ I can definitely drive with starting off with CI and integration tests, but anything you wanted to call out before then.

Ben (06:22)
Yeah, so requirements, great to have those defined. Can you go into the requirements real quick? ⁓ Yeah, dev requirements. OK, so explicit pins. ⁓ That's good to create a reproducible environment for whatever. These two I would classify as test requirements.

instead of dev requirements, dev would be like, I need these external libraries in order to make CI work or local development work. And your mileage may vary about how you want to approach that. Whether it's explicit pinning, you know that this is the exact set of libraries to install in a local environment to guarantee that you're going to get a consistent dev experience. You're not going to have like weird stuff happening from library collisions.

But if you're pushing something out that other people are going to use, what we choose to do on like the MLflow side is explicit next major version pinning. So we wouldn't do it for rough. Rough, it's very, important that you explicitly pin a version there. Same thing with PyTest, but if you had other things that are like, ⁓ I want to make sure that

People are using one dot X of a package, but not two dot X. Usually that two dot X hasn't even been released yet. So we'll do that less than two cap to the installation. And that makes sure that we're not guaranteeing in our library that we support a new breaking change that a major library is releasing. That protects you from

just maintenance headaches of like, hey, I have something released out in the wild people are using and Panda's just upgraded to version three and everything is broken. Like nothing works when they install your library. If that's a dependency in your PI project, that's Tamil. You just broke your library a hundred percent. Like nothing will work because it'll install the latest version when installing your stuff. So that's a little tip.

Michael Berk (08:31)
Yeah.

And package versioning follows ⁓ basically a three-part namespace conceptually of major, minor, and patches. Minors should not be breaking APIs, but as Ben was referencing, as soon as there's a major change, there could be breaking changes to the APIs. So that's why you would want to pin the major version and hopefully allow the minor version to evolve naturally.

Ben (09:03)
And not everybody, even though everybody who's serious about open source development generally wants to and makes the best effort to adhere to semantic versioning rules, ⁓ we're all human and we make mistakes. ⁓ We may break an API accidentally, which will then trigger us to release a patch release that fixes that.

It's not intentional if something breaks, but it can break. And explicit pinning avoids that, but it also makes it so that you're kind of out of date with the rest of the world. So somebody may have a requirement of using a particular version of NumPy or something, or they're installing a much more recent version and they need functionality in that library to use it. They now can't install your library because you have an explicit pin.

you may pin to 1.18.1 or something, and then they need something that's in 1.4. And oops, now they can't actually install your package, so they move on. And they're like, well, this is broken. Or they just file an issue, and they're like, hey, can you fix this? I need this functionality. So that's why most people don't do that. You try to maintain forward compatibility as much as possible. And that'll get into our discussion probably later about CI.

Michael Berk (10:27)
Yeah, later as in right now. So ⁓ one of the key things that is challenging with the Databricks documentation is there's no tests, ⁓ at least no automated tests. Basically, if there's an issue that a customer or someone in the field or anyone else reports, the Docs team will go and fix that. But testing all docs, and this includes open source MLflow, is pretty challenging. So there typically aren't test runners that

Ben (10:29)
Hahaha

Michael Berk (10:54)
make sure demo notebooks work or make sure code snippets work. ⁓ So I was taking a different approach and Ben, I'm very curious your take. So this example right here with the quick start, we have the classic MLflow model from code that defines your model conceptually. So if I have XGBoost, let's say it's going to exist in this Python file. And then in MLflow, you can log it and then you can load it.

And then you can run it pretty straightforward. I want to know in theory that this is still up to date. So I have, well, actually this serves two purposes. The first purpose is to know that the notebook is correct. There's no syntax errors. The end to end flow actually works. And the second is we could run this in a scheduled fashion to ensure that it stays up to date because with let's say early APIs that are not potentially stable.

⁓ and this happens quite a bit. As those APIs evolve, ⁓ the documentation becomes out of date and we want to flag that in an automated way. So what I did is I set up a quote unquote CI system.

I'm going to rename this right now it's under tests, which is just not great, but, ⁓ conceptually they are the way that Databricks manages resources is there's two options. The first is API based tooling. when you wrap an API, can be, via a Python SDK. It can be via Terraform or whatever it might be, but there's set up and tear down API methods. Not all resources can be governed by APIs. Some need Python set up and tear down scripts.

So in this example, a Delta table, that's not really governable via an API call because, and that would make sense. There's a lot of data transactions. You wouldn't probably design an API to handle a data table creation and life cycle. So in this, I have some animal facts. An axolotl can regenerate entire limbs, heart, and even parts of their brain without scarring. And then I create a data frame, write it.

and then have some very nice Claude code generated checkmark that says it either was created or already exists. And then the teardown is simply drop the table. So this can then be used in a parameterized ⁓ Databricks asset bundle ⁓ where we have, maybe this is in a separate PR. As you can see, it's very work in progress. ⁓

but we have set up and tear down master scripts.

which I am failing to find, must be in a separate PR. But conceptually we have a of parameterized workflow that will, if true for a given parameter, set up your Delta table. If true for the vector search parameter, set up a vector search index. If true for sync tables, it'll set up a sync table. ⁓ That will basically run prior to the integration tests. And then from there, we simply have a Databricks asset bundle that will run our notebook of interest. And this should be

very simple. It's just install dependencies, press play on log model and test inference, and then we're good to go. So pausing there, Ben, I'm curious about this overall design, code smells, feedback, anything standing out to you other than the crappy nature of the current repo.

Ben (14:09)
Mm-hmm.

That's...

I mean, it's not that different from our integration testing setup that we do on our side. So MLflow of course works with Databricks and we have a special test Databricks workspace that is a prod environment. So we have an account with Databricks that's not like an internal account. It's like, as a customer would have one. And we do basically the same thing. ⁓

create an asset bundle, we run a bunch of notebooks and make sure that our current master build, like our master branch or main branch, whatever you want to call it, that that does a bunch of activities, stuff like logging a model, you know, and setting a bunch of params and metrics and doing hyperparameter tuning and a bunch of core functionality that we want to make sure that we can register models and load them and deploy them to model serving and run inference against them so that

As much of our product workspace, like work surface can be vetted against a production workspace. So we get, we get signal while we're doing development, before we do a release that we made a broken something. So if those CI checks run like fail, we, whoever's on call has got to go and look into it. Like, what did we just merge that broke this and fix it? We don't put those as part of.

CI for merging code though. And the reason we don't is because of the volume of merges that happen. We might merge 30 PRs in a day. And if we're kicking off end to end integration tests 30 times a day, it's just super expensive. And we don't have a CD component. That would be super dangerous if we did.

Michael Berk (16:00)
Cut it.

Ben (16:14)
So it's not like every time there's a PR that merges, we release a new micro version of MLflow. That would be obnoxious. And we would probably ship a lot of regressions if we did that. So instead we run this, this suite nightly and our release pattern for minor releases is generally once a month.

sometimes a little bit shorter, sometimes a little bit longer, but that's what our goal is. So we have, depending on when a PR merges and integration tests kick off, we always have at least seven days to fix something. And it's because we use a release candidate that allows us to kind of lock in a branch version that's being cut so that we're not pushing stuff right before we release the full version on PyPy. It just protects us.

Michael Berk (17:08)
Right.

That makes sense. in CI, actually, you're right. I don't have this running. I should be more clear with naming. Right now CI is just linting. And then I have a GitHub workflow that runs end-to-end integration tests and will probably schedule this at some point. But I am hitting a real Databricks workspace. It's a temporary workspace that can be provisioned by field engineers. ⁓ And that seems to be working. But question for you, what the hell is a release? ⁓

like semi-genuinely.

Ben (17:40)
So it releases the concept of you have your main working branch, which you probably call it main as evidenced in the upper left-hand corner. Back in the day, older repos would call it master. We don't use that anymore. It has negative connotations. So you have this main branch. Yeah, exactly. So you have this main branch that is your

Michael Berk (17:59)
Actually, interesting.

Ben (18:07)
kind of your working state of your repository and you're pushing code to it. You're fixing bugs. You're adding new features. You're adjusting things, whatever you need to do, just stay in the life of a repository. When you're ready to cut a release, you effectively freeze the state of your main branch and you create a new branch that exists only in your upstream repo. So in this case, we're looking at GitHub.

So we would create a new GitHub branch that's in not the fork of this repo, but in the actual legit, like this actual repo. And this branch, most people will title it something like branch hyphen. What is the release number? ⁓ we do minor releases in Moflow and that's pretty standard in Python. So our branch for the initial first release might be 0.1.

⁓ And that state of the branch is what is going to be used as a deployable artifact to a shared repository of packages. So in the case of Python, that's PyPy. ⁓ In the case of our Java SDK, that is Maven. In the case of R, that is CRAN. In the case of our TypeScript library, that is NPM.

master repository. So each of these versions, that code that was in master when you locked it applies to that branch for that initial minor release. If you have a bug that shipped or regression or something that went like, no, something went badly. You don't cut a new branch for that from main. That's risky. You could introduce new features in a micro, like a patch.

You definitely don't want to do that. That violates semantic versioning. A patch release is meant to only fix bugs or regressions that are based on that minor release. So what you would do in that case is you have some fixes that go into main because all your fixes first go there. And there's some, sometimes where you don't do that, but it's pretty rare. But you will want, you always want to fix to go to main.

And hopefully the fix is applicable to the branch that you're working off of so that you can take your fix and do something called cherry picking, which means you, you open a new PR on your local fork of this repo and you should always have a fork of your own repo. Even if you're the repo owner, you should never push directly to the main fork or the main branch. ⁓ you don't have the ability to

Michael Berk (20:51)
Why?

Ben (20:58)
differentiate permissions between yourself and a CI system, or how repository management is supposed to work. I know a lot of people do it. I've done it in the past where I'm like, I'm the only one contributing to this thing, to this particular repo. I don't need to fork my own repo. I can just push directly to it.

Michael Berk (21:05)
Got it.

Ben (21:23)
But if there's multiple people working on it or if it's something that's going to go out to the public, I always create a fork. It's just cleaner. It looks nicer. You can see it commit histories and like, it's definitely recommended. But for this patch release situation, let's say we're patching something that was released a month ago and we've had a hundred commits to the main branch since then. And this one bug.

Michael Berk (21:33)
Mm-hmm.

Ben (21:52)
We know we have to patch it in the main branch, but we don't want anything else that has been in that commit history. we create a new branch in our fork from the upstream release branch. So we would say, checkout upstream slash the name of the release, so branch like hyphen 0.1. And from that branch, ⁓

Once we're there, we create a branch from that branch in our fork. And then we just type in git space, cherry hyphen pick height space. The commit for that actually got merged to the main branch. That'll take that commit and attempt to bring it into all the changes into that branch. And hopefully there are no merge conflicts or weird things.

That's what you have to check. So after you do that, you should run CI locally, make sure that things work. Depending on how complex your repo is running it all CI locally, it could be a non-starter. ⁓ In the case of MFLOAD, that is what we have. There's no way you're running the full test suite ⁓ locally. Your computer will choke and die and it'll take hours to run.

So what you then do is just file that PR to your fork in GitHub and CI hopefully is set up to run everything, all your tests. Once everything is passing, you should feel confident and you may have added a test as part of that fix. You should have that verifies that fix works. So that way when your fix plus the test that verifies it's merges into that branch, you can be confident that, I fixed this issue.

Michael Berk (23:42)
Right.

Ben (23:42)
in this

particular version of the code, not just in the main branch. And then once everything's all good, you merge that PR into that branch. You don't merge that PR into main branch. So the branch 0.1 has your fix and then you can do your release pipeline from there, whatever that might be. It might be semi-automated. It might be fully automated depending on how much time you want to spend doing that. But you need a lot of testing to make sure that that's good to go.

or it's manual verification. Like, can I install this thing? Does it work? Can I run my tests locally from this being installed? And I'm not in edit mode on my main branch. And once you verify all that, release it.

Michael Berk (24:26)
How do you, so then there's a separate process to bring the 1.0 release or I guess 1.1 release into main.

Ben (24:35)
No, it always stays separate. So that's why you merge your fix first to main and then you pick from main into the release branch. So there's consistency for just that particular like fix in both so that you don't have to wonder like what the state is of things and you don't have to do something super dangerous, which I've seen people do before.

Michael Berk (24:41)
I see, ⁓ gotta come.

Got it.

Mm-hmm.

Ben (25:02)
are like, I'm doing a patch release and fixing this thing. And then you look at the change log and you're like, dude, you're like 80 features in this thing when you should have just had one. Now most open source repositories, you'll never see stuff like that. You'll see a cherry pick being done. Like professional engineers know how to do this. But if you're looking for tips on this and you're working on some like internal repo for use by your team or for use at your company,

Michael Berk (25:10)
Mm-hmm.

Ben (25:30)
maybe this is helpful to like listen to hear like how this is actually done in practice because it'll save you a ton of headache. The last thing you want is you're releasing a patch fix and you've got the fix in there, but then you accidentally include like four new features and those have regressions. So now you have to release another patch release to fix the things that you accidentally included and now it's whack-a-mole. And sometimes you can see that in open source repos.

where it's like, man, they're on patch release 11 on this minor release, and they release minor releases every month? What's going on? It's probably something like that. Or it's just a long running bug detection cycle. Like some packages aren't super heavily used, or it takes a while for people to migrate to the latest. And then you can see multiple patches being released. Sometimes that happens with us. Sometimes you see like,

Michael Berk (26:07)
Yeah.

Ben (26:26)
version dot three on a micro release and you're like, are they incompetent? It's like, no, it's well, maybe, ⁓ but it's like, we don't find out about this particular esoteric weird edge case bug until weeks later when somebody just tried to use it and we're like, ⁓ we had no idea. We don't have test coverage for that. Cause that's really weird. It's like not using the API that we

in the way that we thought people would use it. So now we gotta go and fix that.

Michael Berk (26:59)
Right. Question though, I am not deploying to PyPy. The intended use is reference documentation. Unfortunately, potentially copying and pasting here and there, a Git clone into your personal repo. ⁓ But each of these are sort of modular components. And as we scale it out, ⁓ as you'll see in this very janky PR that's still work in progress, ⁓ we will add additional components within the components directory.

such as the sync table component that has a readme of, right, what the hell is a sync table? How do I do stuff? And then the utilities associated with that. So this is not intended to be like pip installable. How would you think about versioning? Does it now not matter? And just main should always be robust?

Ben (27:47)
This actually goes into how we do management of stuff like our runtime versions. So Databricks runtime is effectively when you look at the numbers of them, like, I'm using 17.3 or I'm using 16.2 and there's like this LTS thing. What those actually are doing is exactly what I just explained with branches being cut. It's a separate repo.

than where we actually do development, which makes it even harder to reconcile. But effectively what you're doing is you're taking a copy of the code in your main repository for that. But it's not all of the code. It's the code that's required to build that runtime image. And you're applying semantic versioning principles to those copies. So you'll have a

a branch that is like the 17 point X branch and that's for active development in the DBR 17 versions. So when you make a change, you can select whether that gets automatically cherry picked. So we do have like a CD system that basically handles this for us, thankfully. Otherwise we would, for every change, we would have to be filing like dozens of PRs. So we can say,

I need to target these particular branches of the runtime release. And the system will automatically take our, our picks and verify that they can be safely merged and then apply those to all of the different runtimes that we select. And you have to get a bunch of approvals for that, but it's a pretty darn good system for applying that in a clean way. There's also bulk transfer stuff as well.

Michael Berk (29:35)
So just to make

sure I understand, you guys have a sub-feature within the entire Monorepo of Databricks. And within that Monorepo, there are different runtime versions. And so you want that feature to apply to all of those versions and make sure it works for each version individually.

Ben (29:50)
Yeah.

Yeah, and we handle it cleanly by having a separate repo to handle all of that. And that repo is, you can select a particular branch version and the state of that repo is heavily different between. So you could go in and select like branch 10.1 or 10, like whatever the 10 release semantic version and got up to, and then can like do a diff of that and watch your web browser die. ⁓

Michael Berk (30:00)
Mmm.

Got it.

Mm-hmm.

Ben (30:25)
to like

19.x, which is like under active development right now. And you would see that probably like 70 % of lines changed. It's like that crazy. So the diff would be in the mini millions of lines ago changed. So yeah, we handle that through that separate repo so that one repo can basically

go through an orchestration system for that continuous deployment of pushing those hot fixes or warm fixes or general bulk transfer of low level bug fixes into these different run times.

Michael Berk (31:05)
But going back to my question, it sounds like I could achieve the same state where main is a very clean, like get clonable implementation and maybe my fork is more of a working version. And that makes sense to just have a fork in general. But this is the use case is not, and I'm fine to change this, but right now we're not designing it so that it's PIP installable whatsoever. There's no utilities that someone would use, no APIs that someone would use. So just keep.

Ben (31:32)
Right. So the thing

you need to think about is where does this run and what contract do you have with your users that this will be runnable in perpetuity? So there's two ways to approach that. One is this only runs on the latest, like the latest LTS. At that point, whenever a new LTS is getting ready to be cut, like

because all of this stuff is running on a Databricks runtime. And that's where your integration tests are running. So if that's the case or the state that you want to be in, then you tell your users, hey, this repo only works on latest. So if your customers are on an older version, no guarantees. Go figure it out yourself. If that's what you want to do, cool. Then do that.

You'll have a bunch of fixes probably to be made every time you're, leading up to an LTS release where it could be one file changes in this, like where this repo will be a year from now. It could be one file, like one particular API changed or something, or the behavior changed on how you handle auth credentials or something. Or it could be every file changes. So you have this like massive fixed PR that's like, okay, I'm moving from

18 dot whatever LTS to 19 dot whatever LTS. And I have this massive PR that changes every file. It's 6,000 lines of code in this fixed PR. And then it's now compatible with that. And you tell your users, all the examples in here are now vetted against 19 dot X. So there you go.

The other approach is the version, semantic version, minor versions of compatibility to the runtime for each individual component that you have. So you would do like components slash, you know, 17 slash model serving. And that's the one that, you know, works. We do that in, in Databricks for some things. I mean, it's effectively what it is, except it's, it's,

Michael Berk (33:37)
Got it.

Ben (33:49)
an external repo for something where your audience is not interacting with a platform, but instead they're interacting with source code, it might make sense to keep it all in one model repo like that. It all depends on how you want to architect it. Like where do you want to create that hierarchy? Do you want it at the base level? Maybe it depends on how many files you have and do they interact with one another? Is it something where

everything is guaranteed to work ⁓ at a particular version together, then do that higher up. So maybe it's before components, you have like this runtime version selector and people just drive into that and that each of those directories are the code that works in that environment. Pros and cons, like it depends on how you want to do it. ⁓

If you're doing it that way, every time a new runtime is getting available, you would basically take that directory within your repo, copy everything that's in there into a new directory that's named the new runtime version and paste. And then make fixes to that part. And then CI would be driving to your integration tests would be potentially driving to different versions.

Michael Berk (35:09)
Right.

Ben (35:12)
And you would probably want to think about drawing some line of like, how long am I going to support this? ⁓ Do I want to support a runtime version that released four years ago? Maybe? I don't know. Or do I want to just have something that I'm only going to maintain the last three?

because every additional version that you're adding is going to add a bunch of time and complexity and costs to running validation checks and also maintenance burden.

Michael Berk (35:43)
Okay, that makes sense. Those were the main things that I had open questions on that I can remember. ⁓ Ben, what are your thoughts? Like, what else would you review in this repo?

Ben (35:57)
if I was doing like a PR review of this from like an initial state of a repo, I would definitely go into that components model serving directory and look into, okay, what is he trying to do here? So then I would go into that, that concurrent custom py func. And maybe it's a bad practice that I have. I don't generally read readmes because I don't want it to influence how I'm reading the code.

Michael Berk (36:24)
Mm-hmm.

Ben (36:24)
I want to read

it from like, have no idea what's going on. I just want to see what my brain notices. So then I would probably, my first thing here is look at the, like the user interface. So best practice.

And then I would look at, I would probably just drill down from the top, like, okay, log model and test inference. What does that look like? Like, okay, it's a Jupyter notebook or Databricks notebook. Once I see that at the top, I would probably click back to the Python script if there is one. That's just personal preference. I don't like reading notebook source code.

Michael Berk (37:04)
I see what

you're saying.

Ben (37:06)
⁓ I hate saying like all the magic commands and stuff. It just, it's like distracting to me, but this is also like this type of thing is more of when I'm looking at a notebook, that's the, like an end user's interface to something and not, it's usually very verbose. There's lots of comments, lots of notes and stuff. I see that as a review of the same way that I would review docs.

Michael Berk (37:11)
This is a really

Ben (37:32)
So before I review docs, I need to know how the code actually works before I can properly like evaluate documentation. So then I would go, I would go to that other directory or that other one, the model from code. Like, okay, what's okay. It's another notebook. And then I would probably like, is there a Python scripts that is here or is it just notebooks? But considering we're, yeah, we're looking at notebooks.

Michael Berk (37:58)
It's just notebooks.

Ben (38:01)
Yeah, would look for simple code smells that might inform how critical I'm going to be at looking at the code. So code smell, ⁓ it's actually a term that I hate. That's the use of it in most cases. But to me, what a code smell is is somebody writing code in a way that

it leads me to believe that there might not be paying attention to minute details. And they might be, if we look at the very top, ⁓ sometimes there's smells on just the imports for me of like, okay, so what do we got here on line 17 context bars? Okay, we're gonna be dealing with context manager, which is cool. That's usually a good thing to see if there's something that we need to clean up state about.

And we don't want to have some super nasty code that's very verbose about cleaning things up. So when I see that, I'm like, okay, so somebody who knows what they're doing, import UUID, cool. That's pretty standard. Line 19, that sets an alarm bell off for me that there's going to be some considerations to think about with like thread safety of anything that's operating in, if there's a thread pool executor being used.

I now have to start thinking like, are there references to some global sentinel objects? Is there something that within a thread we're accessing something that's mutable? So I'm prepping myself for like, what are the questions I'm going to ask myself while I'm reading through the code about additional code smells or things that feel off to me? ⁓ And then there's like a from typing import any, that looks good. And actually the first knit that I see,

that's you're probably going to roll your eyes is line 18 should be below line 20.

Michael Berk (40:04)
This is it.

Ben (40:04)
and

line 19 should be above line 17.

Michael Berk (40:07)
Actually, this is rough formatted and if it's from like ABCDEF, but I agree, alphabetizing, yeah, alphabetizing imports is helpful.

Ben (40:12)
Oh yeah, yeah, yeah, yeah. That's right. You have that rule set.

Yeah, I think, I think what we have in our rule set for MLflow is like, think there's a space between that. So it's always built-ins up top, alphabetical order. And then the from statements have a space between the blank import of a module. yeah, you're right. You're right. right. So if I left a knit like that, you, you would dunk back on me and be like, dude, this is rough. And I'd be like, yep, you're right. And then you have import MLflow pandas, the

Michael Berk (40:41)
Yeah.

Ben (40:49)
Postgres connector and some Databricks imports. Workspace client, vector search client. Cool. ⁓ If this was not in a notebook, I would probably start thinking about, okay, why are connectors imported in the same location as user code would be, or things that would be ⁓ dictating the operation of an application of something.

So typically we would have connectors, things that interface with platform type stuff in like a utils in order to keep the code like a little bit cleaner and reusable. Cause if you're using like a workspace client, which is a connection to the Databricks backend for your particular account, there's no way you're only doing that once in your code. And you would want to have some sort of factory.

Michael Berk (41:27)
Mm-hmm.

Ben (41:47)
or constructor that just does that for you and only define that once in a function. And then you can just import that and use it wherever you need to. Same thing with vector search client. But then when you're looking at pandas, MLflow, those two would be something that, OK, this is what a user would be actually running. So you're wrapping this up as an example of how a user would interface with stuff. So that's what's in my mind. ⁓

And then we go down and we're like, okay, we're creating a Python model. It's an MLflow Python abstraction. We have an AI generated docstring, ⁓ which is not correct. ⁓ The init statement should be in there. like, what is the, well, you don't have any in that statement. So I guess that's fine. But like a one line docstring is usually, it tees me up to be like, okay, AI wrote this.

because humans don't typically do that. And then you have a

you have a definition of a public method that doesn't have a docstring, which that's a little weird, but you're creating a subclass of something else, so it's not like critical, but it could be beneficial to explain since this is a ⁓ tutorial for people to have a docstring there that explains like, what is going on here? What is this load context thing? You could go, like just provide a link to the MLflow docs to explain like what this is.

allow users to grok why this exists. It's because we do side loading to construct an object ⁓ from loading it. There's a lot of things that go into that. But everything else, model config looks good. You're loading that and that's set when logging the model. And you're getting some, you're setting some instance attributes based on things that are in that config. Looks good.

⁓ If there's no table name, raise a value error.

So self.synced.

Yeah.

the falsiness of this. guess that's correct, ⁓

Michael Berk (43:57)
That's a good point

actually.

Ben (43:59)
So if self.syncToTabletname equals false, then 44.

Michael Berk (44:05)
or pass. It's not none. ⁓

Ben (44:05)
You want it to be like...

Yeah, because an empty string would also be invalid, right? So your value error that you're raising isn't accurate. Because it could be, hey, I provided it as open close quotes. It's not a valid use of the API. you could make the value error a little bit more targeted. Same thing with the database instance name.

Michael Berk (44:21)
Mm-hmm.

Ben (44:37)
but that's nothing really bad here. Getting an index. What happens if the index isn't found?

Michael Berk (44:46)
That'll just throw whatever the vector search client error is

Ben (44:50)
Got it, yep. So that's the right way to approach it. ⁓ And I see this in not in just Gen.ai generated code, but also in some people that do ⁓ like file PRs against MLflow, like open source contributors. They just don't know better. ⁓ And sometimes we do it too. And somebody else will catch us on a PR and be like, dude, what are you doing? Where they'll wrap this in like a try catch.

and try to catch the exception and then re-raise the error. There are places where we do that. And it's not for reasons that people might think that we do it. We'll do stuff like that because the underlying exception is so confusing. It's like meant for another software engineer. So if we just surface that to our users, the user's like, what? I don't know how to debug this. Like, what went wrong?

Michael Berk (45:38)
Mm-hmm.

Ben (45:44)
Or if it's something that's like buried so far deep in the call chain that it's the stack trace becomes so complex to read through that it's way easier for us to just eat that exception. Take the message that comes from it as the from like we'll raise our own message that explains the context of what went wrong, how to fix it. Maybe if there is a fix for it and what

Sometimes like hey read more about this in the docs at this location from the base exception And that gives you a nice clean stack trace and it gives you the full stack trace instead of just eating the exception then raising a new one You always want to extend the exception so that your Message to users are at the very top so they can just read it really quick and be like, okay, I get this ⁓ So yeah, but if the if the message is super clear

Michael Berk (46:31)
Right.

Ben (46:40)
And like vector search client, if it's not found, it's completely unambiguous what went wrong. And their message is really good. Don't reinvent the wheel. Like don't wrap their exception. It's way cleaner to see it thrown from, from that location.

Michael Berk (46:56)
Right.

Ben (46:57)
And then we're defining self.workspace client. So I would probably move 56 to the very top. ⁓ The reason being, ⁓ when looking at anything that interacts with external systems, the name of the game is Short Circuit. So if in the execution of

Michael Berk (47:08)
The rash, really good.

Ben (47:25)
what you're trying to do, something, if like a condition is not met for processing the entire thing, like, Hey, I can't create a Nothing before that, that does any sort of action should ever be done. So it's like the first thing that you do. Don't even bother trying to load anything. Don't set any object properties that you don't need to do any of that.

Michael Berk (47:43)
Hmm.

Ben (47:51)
If it's just a bare thing that's like, do I have connection to the service? Fail that first, like right away. that depending on what you might be doing before that point, it could be milliseconds or microseconds to do it. But in some situations, if you don't get into that habit of thinking of like, how do I short circuit this as fast as possible? You won't be thinking of that when you write code that has

a five minute processing time of doing something. And then you check like, ⁓ now I need to create a client to do this operation. And then users will be sitting there waiting for five minutes thinking that things are all working well. They're like, okay, it's like I'm getting progress. And then it throws an exception. That's so frustrating for users. So I always think about stuff like that, but like push it up to the top as much as possible.

Michael Berk (48:37)
Right.

In the last few minutes, can you go a little bit more high level? All the syntax is great and structure is great. going a bit faster, let's say you don't have an hour to review in depth, what would be the things that you quickly gloss for and look at?

Ben (49:03)
Like proper usage of APIs, stuff like how much stuff is in a particular method or function? Can I read it really quickly? Do things seem like they're defined in a way that's maintainable? ⁓ I'm not like a pedantic code reviewer. I don't lose my mind over like style stuff that much.

I think more on functionality. Like how is this going to run when a user hits it? Does the API make sense? Is it overly complicated? Like this interface that you're providing. So there, okay, if you go back up to line 131, if we just look at this interface, this is a private method. Find, where's our predict?

model input? What's the type of the model input? So I would always have a type there. Let's say

Michael Berk (50:02)
reason I

dropped that is because MFlow will try to infer the signature based on it.

Ben (50:07)
and you don't want that.

Michael Berk (50:08)
The signature inference was incorrect. ⁓ Well, not incorrect. It was just playing weird and I wanted to specify a signature. I forget the rationale now that I'm saying it out loud.

Ben (50:19)
Got it. Yeah, we do that to simplify signature inference so that it skips having to predict, like run an execution, because that can be expensive. So I would put a note to bene for something like that. And then I would specify it as any type.

Michael Berk (50:25)
Mm.

Mm-hmm.

Yeah, okay. I think I did that here.

Ben (50:40)
and then

just let MLflow handle signature validation. But in most public APIs, when I'm looking at that interface, I look at what is actually specified as that type. ⁓ And outside of the world of Pydantic, we're not really using those types for anything, but letting another software engineer who's reviewing the code or letting a user know, like, hey, this is the interface that you want to use. It's not used for

at runtime, Python ignores it, it's just for humans. except PyDantic, PyDantic will enforce that stuff, ⁓ which is how we do signature validation through when you specify that type, we actually create a PyDantic model out of that and then validate that against your input. So when I look at the public interface, I check to see, is this complicated?

is this as a user uses this, this isn't a good example because this is based on your subclassing something that's widely used everywhere. But if you're looking at like, yeah, there's another PR where you might have some utils or something.

Michael Berk (51:35)
Hmm.

Ben (51:50)
What do we got here?

Yeah, like the interface here, clean.

nothing to really comment there because it's straightforward. What you see is what you get with those. Like I wouldn't even need to read the docstring because your argument is named so well that it's like, yeah, this makes sense to me. And then we get into, okay, that one's a little bit bigger. There's some pedantry that goes on with some people that come from different languages. They're like, no function should have more than

Michael Berk (52:05)
Mm-hmm.

Ben (52:24)
five arguments, anything more than that is too much. And then you see major Python libraries that have 30 arguments, but they're all named really cleanly and you understand what they're used for and most of them are optional. It's like, yeah, it's a lot, but it kind of makes sense. ⁓

Michael Berk (52:31)
Mm-hmm.

Mm-hmm.

Ben (52:44)
So yeah, you have create a sync table that requires a client. If you go back up to like what that signature looks like.

databases, the RDBMS database, table names, primary keys.

Michael Berk (52:57)
And this is one-to-one naming

with the Databricks clients. I think that database instance name and logical database name are horrible, especially because there's a UC version and a Postgres version. So I just stuck with Databricks naming convention and then made it clear in the docstring.

Ben (53:15)
Yeah, like I don't see any anything here that I'm like this sucks. It's pretty good. I would probably change the constant that you have there. ⁓

Michael Berk (53:20)
Cool. Let me see if there's anything.

instant.

Ben (53:28)
depending

on how, like you have a ⁓ uppercase string that's defined as a default there. We typically don't do that. ⁓ We'll allow users to pass in the full enumerated constant value or any combination of casing. So that if they want to just type lowercase continuous, that's allowed. We'll handle that in an enumerator.

Michael Berk (53:43)
Hmm. This is a call out.

Ben (53:55)
and that enumerator is what's used throughout the code base. So that string that you have here should be defined in one location and one location only. ⁓

Michael Berk (54:03)
Yeah, this is

an enum that can come from the Databricks SDK as well. So that's a really good call out.

Ben (54:09)
Yeah. But you don't want to have strings in code. ⁓ The reason being is when you need to change that API or you need to like add a new enumerated value to something, you're like, we now support this other thing. You now have to like find all of that in your code. And the probability for a large enough code base that you're going to miss that is pretty high.

Congratulations, you just created a bug that you now have to later on go and fix and maybe do a patch release for. But with a constant, you change the strings and the definitions in one place and one place only. Everything else inherits from it. Clean code, happy developer, bug free. It's just easier.

Michael Berk (54:55)
Cool, let me see if there's anything more.

Ben (54:57)
The only other thing is unit test.

So you got all of these functions that are defined that maybe these are user facing, maybe these are internal. I'm not sure whether they are or not. There should be unit tests for each of these to verify that from the input or series of inputs that you could pass into these things that you're processing the output in the way that you expect.

Michael Berk (55:00)
True.

Ben (55:26)
The only exception to that is if you're running a wrapper function around another function from like a platform service. And at that point, if you're creating an interface in your code, that its only purpose is to just call another API. You don't really need to test that because there's no logic that's going on. But if you're doing anything where you're parsing like here line 280, you're doing something to principles.

might be good to just have a really quick super like simple dumb test that mocks that call to Postgres and returns the state of principles on that returned object.

And you'd also want to make sure that you have a return from this so that it is testable and that users might, if they're hitting this, they should get some confirmation back that this API actually succeeded. Because if the return type is none, you don't really know, like, did this work? In order to validate, you now have to click through UI or call another API independently to verify, did the table get created with the create schema?

Michael Berk (56:22)
Got it.

Mm-hmm.

Ben (56:38)
But if you're returning the state of something that succeeded, then you now have composability for users.

Michael Berk (56:46)
Okay, super helpful.

Ben (56:47)
Except for deletes.

Deletes are the one thing that you should just pass none. Do you know why?

Michael Berk (56:57)
should just pass none. No, I do not.

Ben (57:00)
Why would it delete?

What are your expectations? This is my question for you. And I would leave this as a PR comment. What are your expectations for if a user calls delete on a table and they're not sure if it worked and so they called delete again? What should happen?

Michael Berk (57:23)
⁓ Well, what does happen is shows table not found, like it throws.

Ben (57:29)
Exactly. What is a typical expectation for most APIs when you call delete multiple times on something.

Michael Berk (57:38)
First time it'll, well for Databricks, first time it returns like a true like message. And then second time it'll throw.

Ben (57:45)
Which can be frustrating for users, right? So we don't do that with our APIs. We introduce this thing called item potency, which means if I call delete and then I, in my code, for some reason, the way I've written it, it's calling that again on that. It's just silently, like it might throw because you get a database error like.

row not found or record not found, index not in database, we eat that exception, but specifically just that exception to make it so that this operation is idempotent. You can call it multiple times. It doesn't break your code. It doesn't break your flow. It just is a no op. So what we'll do is as part of that delete statement, we'll be like, check to see if the record exists and if so drop it. But if the record doesn't exist, just return.

And that way users can be fairly confident that the record has been deleted. Cause if it fails, we actually make it like darn sure if on the initial attempt, if it finds a record and then it conducts the delete, if any error happens there, we surface that immediately. Like could not delete because of this, but we only trigger that if the row actually exists. That's, that's a way to make it item potent.

Michael Berk (59:05)
got it.

Okay. Well, it seems like a lot of these, this feedback is sort of syntactical and like a few strategy changes, but do you have any other advice for building something like this? Cause often applied AI people will be building a lot of code. And again, field engineering at Databricks would like to be a little bit less redundant with their implementations and share these implementations between people.

So if you're on a team, you want to share these utilities with your teammates. If you're in Databricks FieldEng, hopefully everyone uses it. Do you have thoughts or guidance or advice on how to structure this type of utility?

Ben (59:44)
I would put a lot of comments in that stuff, in the code, because people are going to be reading the code themselves. like, hey, I want to use this snippet. I would create effectively documentation blocks in the code that are not inline embedded, per se. You can have single line comments that are explaining this is what this is doing. But then before that, have a tutorial that is explaining each of the components in this thing.

about why it's done this way, why not this other way. So that people don't have those comments embedded in their, in their source code, like walls of, of text comment. And then I would also make it very apparent that this has been tested and then show where this test exists that like validates this to kind of remind people like, Hey, when you're writing your own version or you're, changing this in some way.

Michael Berk (1:00:13)
Mm-hmm.

Ben (1:00:37)
Here's an example test that validates that this works correctly. So they can have both of those. They can have like, here's the source implementation. Here's also the source test. I need to mutate and like, need to add 30 lines of code here or change this around some way to make it work for my use case. If they have the test there, when they run that with their changes, the test is going to fail. So now it's up to them to fix the test that matches.

Michael Berk (1:00:50)
Right.

Ben (1:01:06)
so that the behavior is what they expect. And now they have a more robust implementation.

Michael Berk (1:01:11)
That's a really good point, yeah. Testing in Databricks is a horror story, so I've been trying to push that off. Like if you're developing locally and then pushing as like should be done, then it's great, but if you're doing notebook-based development, it doesn't play nice. So that's a good point.

Ben (1:01:24)
Mm-hmm.

Yeah, I mean, you're

looking at that point when you're testing a notebook, that's what we talked about beginning, right? Is like that integration suite with asset bundles. Like that's what we do. That's the only way to really do it is like we have an end to end use case and we have dozens of these notebooks. Like this one's for PyTorch. So it like trains and does tuning and then deploys it and all this crazy stuff. We can only really know.

Michael Berk (1:01:39)
Mm-hmm.

Ben (1:02:00)
Like, does this work end to end by running it? And then you see like, okay, it touches all these systems. We're good to go. Exit code zero. We're awesome. And if it fails, you do get that debugging capability within that notebook. Cause you can go and open that notebook up and see, ⁓ cell eight had this issue. Here's my stack trace. Where are my logs for this cluster? ⁓ that's weird. Now I know how to fix this. Let me go and fix it.

But when you're doing anything that's like these utils or things that people will be embedding into another person's repo, you gotta have pie test for that stuff.

I mean, if you don't have those units SS examples and explain to people like, when you're developing this stuff, yeah, you can use notebooks to write code and create an application, but there's reasons why IDs exists. There's reasons why people write code in the way that they do like software engineers do it. And it's because we know how stupid we are.

and how bad we are at our jobs. That's why these things exist. We make mistakes all the time and unit tests help protect us from ourselves.

Michael Berk (1:03:11)
Yeah. All side. All right. Well, I think that's, that's an episode. I did not summarize because I was busy sharing screen, but, um, yeah, just from memory, a few things that stood out are integration testing for end to end implementations is really something you should consider. And there are different levels of testing. There can be unit tests, mocked unit tests.

tests against live resources that are sort of integration based and then end-to-end tests. And this is all a continuum. So there's also levels in between them. Let's see what else. Yeah, there are a few others, but I think that was actually one of the main things that I was thinking of. Cool.

Ben (1:03:54)
Imposability, testability, big, important. And then how do you create a reference repo that is effectively versioned against its environment of execution?

Michael Berk (1:03:56)
Mm-hmm.

Yeah.

And also thinking very critically about your users and their journey. Like if they're going to be reading code, make it more readable. If they're going to be just pip installing, maybe test a little bit better for a specific implementation. ⁓ Doctrines are important. So think about the high ROI implementations that would make your code more usable for your users.

Ben (1:04:27)
Yeah, like doc strings are one of those things that nobody realizes how powerful they are until they see a user who just is not operating in an interactive environment with like a web browser open right next to them. A lot of people do that, right? They're like, I'll just look this up in the docs real quick. But if you've ever seen somebody who's like using Vim for writing Python code.

Michael Berk (1:04:49)
Mm-hmm.

Ben (1:04:50)
like VI or Emacs or something. They just have a terminal window open. That's it. That's like their screen is just a terminal window. So if they want to know how an API works, they just type help open paren name of function. And then it prints out the doc string. And if that doc string doesn't match to what they can see as the interface, they will lose their mind. They'll be like, what the hell Mickey Mouse bullshit is this? Like, this is not, like not useful.

Michael Berk (1:05:16)
Yeah.

Ben (1:05:19)
Or if the doc string has no argument explanations whatsoever, you now have to break out your magic ball and be like, I think this is what they mean by this. If there's no types that are defined, they're playing like a game of guess them. You're like, I think this expects the string. Try it out. Nope. Maybe it's a list of strings. Nope. Is it a dictionary? You have to try all this stuff and you get frustrated and annoyed.

So docstrings are super important. Anything that's public facing. And if it's not public facing, preface it with an underscore that lets those VI and EMACS people know that like, okay, this is an internal thing. And you don't, you could have docstrings for that for other developers, or you could just leave it as a simple comment. It's like, this is what this is for.

Michael Berk (1:06:08)
Heard? All right. Until next time, it's been Michael Burke and my co-host and have a good day, everyone.

Ben (1:06:13)
Ben Wilson.

We'll catch you next time.

More episodes

Chapters

What is Free Form AI?