The Bootstrapped Founder

AI systems change constantly. Models get deprecated, APIs shift, and what works today might fail tomorrow. Instead of trying to keep up with everything, I've built my systems for permanent adaptability.

That means migration patterns that let me run old and new prompts side by side, using OpenAI's hidden Flex tier to cut costs by 50%, front-loading repeated data in prompts to maximize cache savings, and implementing circuit breakers so runaway AI costs can't blow up my bill. These aren't optimizations — they're how you run AI in production without losing your mind or your money.

I'm running a time-limited Black Friday sale of The Bootstrapper's Bundle: all my books, all my courses, all formats, for $25 instead of $100+. Grab it here: https://tbf.link/bff

This episode of The Bootstraped Founder is sponsored by Paddle.com
You'll find the Black Friday Guide here: https://www.paddle.com/learn/grow-beyond-black-friday

The blog post: https://thebootstrappedfounder.com/ai-best-practices-for-bootstrappers-that-actually-save-you-money/
The podcast episode: https://tbf.fm/episodes/425-ai-best-practices-for-bootstrappers-that-actually-save-you-money

Check out Podscan, the Podcast database that transcribes every podcast episode out there minutes after it gets released: https://podscan.fm
Send me a voicemail on Podline: https://podline.fm/arvid

You'll find my weekly article on my blog: https://thebootstrappedfounder.com

Podcast: https://thebootstrappedfounder.com/podcast

Newsletter: https://thebootstrappedfounder.com/newsletter

My book Zero to Sold: https://zerotosold.com/

My book The Embedded Entrepreneur: https://embeddedentrepreneur.com/

My course Find Your Following: https://findyourfollowing.com

Here are a few tools I use. Using my affiliate links will support my work at no additional cost to you.
- Notion (which I use to organize, write, coordinate, and archive my podcast + newsletter): https://affiliate.notion.so/465mv1536drx
- Riverside.fm (that's what I recorded this episode with): https://riverside.fm/?via=arvid
- TweetHunter (for speedy scheduling and writing Tweets): http://tweethunter.io/?via=arvid
- HypeFury (for massive Twitter analytics and scheduling): https://hypefury.com/?via=arvid60
- AudioPen (for taking voice notes and getting amazing summaries): https://audiopen.ai/?aff=PXErZ
- Descript (for word-based video editing, subtitles, and clips): https://www.descript.com/?lmref=3cf39Q
- ConvertKit (for email lists, newsletters, even finding sponsors): https://convertkit.com?lmref=bN9CZw

Creators and Guests

Host

Arvid Kahl

Empowering founders with kindness. Building in Public. Sold my SaaS FeedbackPanda for life-changing $ in 2019, now sharing my journey & what I learned.

What is The Bootstrapped Founder?

Arvid Kahl talks about starting and bootstrapping businesses, how to build an audience, and how to build in public.

Arvid: 00:00

Hey, it's Arvid, and this is the Bootstrap Founder. I recently realized something while building PodScan, my podcast database system, that does a lot of background data extraction and AI analysis for my users, I think I've stumbled upon a couple of AI integration best practices in all of these years now that a lot of people might not be fully aware of or just have never experienced. So today I want to dive into the concepts that I found not just useful, but essential for maintaining and then operating this mission critical data handling with LLMs and AI platforms and AI tooling. A quick word from our sponsor here, paddle.com. I use Paddle as my merchant of record for all my software projects, the ones that involve AI and the ones that don't.

Arvid: 00:51

They take care of all taxes, currencies, they track declined transactions, they update credit cards in the background so that I don't have to. It's really cool. And they allow me to focus on dealing with my competitors and my customers instead of banks and financial regulators. So if you think you would rather just want to build your product, well, then check out paddle.com as your payment provider and merchant of record. So I was reminded of these AI practices that I've established in a tweet that I read from Greg Eisenberg.

Arvid: 01:21

He said something along the lines of keeping up 100% with all the new AI tools and the models and their capabilities and the benchmarks and all that. It's pretty much impossible at this point. And that something that works today might fail tomorrow. And that's very true. It's an observation that I learned.

Arvid: 01:37

It's probably my biggest learning in building Potscan because I've built a couple of SaaS in the past, but AI is the new thing. And I think I realized that I turned all of this into not just a process, that's also true, but an implementation style. So that's what I'm going to share with you here today. I'm going to share what I've built and how I've built it. So whenever I use an AI call, be that to a local model that I have installed on a GPU enabled server somewhere or a cloud model on OpenAI, Anthropic, Gemini, whatever it might be, there are so many things and I'll get to the diversity here.

Arvid: 02:10

I always have a migration pattern implemented in the code. So I extract all of my API calls into services. Right? That's just how I generally structure my code. I want these services to internally handle all the connection stuff, all the prompt massaging, prompt construction.

Arvid: 02:27

And in addition to the specific prompt that I want for each API call, the service will format it in a way that makes it easy for the endpoints to consume. And all of these services always operate on what I would call a state of permanent migratability. That means they can always use the latest version and the latest model that I have chosen for this particular purpose, or at the very least, they can use the version of the prompt and the model that I used previously to that. And I realized that I needed this when I started implementing GPT-five a couple of months ago after having used GPT 4.1 for a long while. In fact, I started experimenting with this, like, the day the API became available, and it was quite horrible.

Arvid: 03:12

These experiments were like, oh, wow. This is very different from what it just was. Now I need to take care of a lot of things. A lot of API changes happened and a lot of the Nuance part of the prompt that worked for me in 4.1. Well, that didn't work as well as it did before.

Arvid: 03:28

My prompt was crafted for 4.1, which it ran on for at least half a year, but it just was not reusable for GPD five. So then I thought, how can I deal with this? My prompt for GPD 4.1 had a JSON format expectation. I told the OpenAI system, kind of it's part of the OpenAI API where you can say, I want this to be a JSON that comes back and you're guaranteed to get JSON. It's called JSON formatting.

Arvid: 03:55

And everything in the prompt was aimed at then creating that JSON structure more effectively. And when I migrated that particular call to GPT five, it would still create JSON data, but it would drop certain keys. And it was not really as reliable as it used to be before. And it was strange. I tried to figure out the reason, and it turned out that GPT-five was more aimed at being compliant with structured JSON schemas, which is different.

Arvid: 04:22

Right? Instead of just saying, this is gonna be JSON, whatever you think it should be, you say, this is the exact JSON structure, the data structure that you're gonna output. And you define the keys and all the potential values in a standardized format. And then you explain in your prompt how these are filled. And it's both fundamentally similar to what it was before because you say, hey, this JSON and I want it structured like this.

Arvid: 04:44

But it's technically different because the way you define what it's going to look like is different, and the parameter you give to the call are different. So that needed to already change. Right? And since I didn't wanna deploy new experiments with a new model and hope they produce the same data because you have to run this in production on production data, I came up with this internal migration strategy. So for either a certain time period or while I'm testing on my local and my staging environments, I could say, well, use the old prompt with the old model and then use the new prompt with the new model at the same time or one after the other.

Arvid: 05:18

Maybe even a completely different data structure, completely different API call, because OpenAI also wants people to move away from their chat style API, which I was using in the past then, to the response style API. These are specifics. It doesn't really matter. But what I would do is I would run both. For each data thing that I wanted to analyze, I would run the old one and the new one, and I would log both results, see where the major differences were, do a diff between the JSON files because they were structured similarly so I could see what was missing, what was gone, what was new.

Arvid: 05:47

And if there were any differences, the system would tell me what, show me the diff, and make it accessible for debugging. And in the request that it was in, it would respond with the new data to whatever functional procedure that was calling it. And for those servers running this dual approach, obviously, this incurred twice the cost, but I was able to see what the old model did and what the new model did and how the differences between prompt would impact the quality, expectability and the reliability of this particular data. So for debugging or for just migrating over, it was really, really useful. And I found this very, very useful for almost anything you do with AI tool calls, whether you're using an LLM to get some kind of response or data.

Arvid: 06:25

Because unless it's purely conversational use case where you have like an AI chat with your customers or something like that, it wouldn't matter much there. But if you're doing something like background data analysis, number crunching, semantic analysis, something like that, it really helps to have this migration pattern where for testing purposes at the very least you can run both versions at the same time. You see good results. You can really analyze it. And if you find that the new results are not as good as you thought, well, you can always roll back and just use the legacy version for now and then tinker with your new prompts and work on your testing and staging systems to get the new version ready because you know you will have to migrate eventually because these API systems will be deprecated at some point anyway, which is the whole point why I moved, right, from 4.1 to five because in the video on YouTube, Sam Altman and all the other wonderful nerds over at OpenAI were talking about GPT five and how they were gonna be deprecating all the old endpoints because there were just so many models around.

Arvid: 07:25

And look at it now. There's, like, what, twelve, fifteen different versions of g p d five. But I digress. It's just these old things will eventually go away, and you need to be able to migrate to the newer ones. And for that, you need to have a really strong path because AI systems are fickle.

Arvid: 07:39

They are nondeterministic. So you need to see what the differences are and adjust, which funny enough is also a good use case for AI because you can take the old data structure, the old prompt and the new data structure and the new prompt and get the results for the old one, get the results for the new one, throw all of this into an AI system and tell it, well, why did this happen? And how can I format the new prompt to get results that look more like the old results? Of course, this is all kind of witchcraft when it comes to, you know, what exactly that prompt then suggests to you as a solution that may also well be hallucinated. But it is a good way of trying to figure out how you can get to the the kind of data, the structure of data that previously worked for you well.

Arvid: 08:19

It's always quite helpful to do a different JSON data that comes back just to see what's missing, what's different. Often, that'll inform how you will reshape prompts. You might even have Claude or OpenAI Codex do this for you automatically with a command or something. And this migration pattern has made it into every single service that I have that communicates with outside machine learning models. It's been a godsend really just to see what differences are, how things react differently on new models, and how to understand how nuanced your prompts have to be to get reliable data or comparable data to begin with.

Arvid: 08:53

Because once you've set up a prompt and it works, you really wanna use it for as long as possible because the other things in your application, they build on top of this. And the migration pattern helps with that in establishing it and in maintaining it. It also helps that if the new service suddenly experiences a degradation, with a flip of a switch, you can go back to the old one and get your data as reliably as you did before. It's kind of a derisking strategy too because sometimes things go down. Right?

Arvid: 09:18

Sometimes all of a sudden, service demand goes up so much that things fall off that you just want to flip back to the old one that nobody uses and you're going to be fine until the new thing is restored. With this, this is a config change and that is quite substantial. Now that was all about migrating between services. Let's talk about handling differences that are not in the model but somewhere else. And I've realized that something that OpenAI have kind of not necessarily hidden but have made not very apparent in the documentation, some of these cloud services offer what we might call service tiers.

Arvid: 09:51

And these are different priorities for how important your API call is and how quickly it gets responded to, and you pay accordingly. So if you're a developer who reads documentation thoroughly, you probably have spotted this quite early in using OpenAI and these kind of platforms, but it took me a couple months to figure it out, which should tell you everything about how quickly I read and how well and thoroughly I read documentation. For OpenAI, I figured out at some point, every standard request you send is built in a default tier. That's the price you find on the website. Where what is it currently for GPT-five?

Arvid: 10:23

It's like 40¢ per 1,000,000 tokens for the GPT-five nano model output tokens. That's what the price is on the page. But if you look at a little bit closer, you can also find that they have something like batch requests, which are significantly cheaper. They are usually half the price or something like this. But batching isn't always useful for people doing, like, on the spot analysis because you don't know when your batches will be full on your your input side and when they'll be processed fully on the output side.

Arvid: 10:50

So for anything that's kind of semisynchronous in nature, where you need to send the request and then expect to work with the results shortly after, batching is not the most useful approach. But what I failed to realize is that between this default tier that everybody pays and batching that might not be useful, there's another tier, at least for OpenAI. It's called Flex. And that tier is effectively built at half the price, half the price, with the caveat that it might be slower, might take 50% longer, and at certain points might not be available at all. But it's still the exact same model with the exact same data quality.

Arvid: 11:23

It's just a little bit slower, but significantly cheaper. And once I figured this out, I knew I was gonna use the FlexJear for all my back end stuff, for all this stuff that runs in the background. It happens behind the scenes. Because PodScan, that stuff's just extracting who's the guest on the show, what are they saying, summarizing the full episode. The fast part of Podscan is transcribing and making content fully searchable.

Arvid: 11:45

I try to be as fast as possible there. The not as fast part is doing all this extra background analysis that eventually makes it into the system. And if I can save 50% of my cost on this, clearly, can do twice as much. And that means twice as much value for my customers at the exact same price that I would pay, just a little bit slower. And it's background stuff anyway.

Arvid: 12:05

So I implemented this into all my extraction and inference background jobs. And since OpenAI has implemented in their own SDK auto retries for this, I didn't even need additional logic for that. What I did implement in addition was a fallback. So if your request to the flex tier runs into a four twenty nine error, which rarely happens and means that the system is kind of overwhelmed or you're rate limited in some way, well, I would then set the service tier for that particular job to be the next highest one, which is standard. And then it would retry and actually get the data.

Arvid: 12:37

So it tries a couple times with Flex to save 50%. And if that doesn't work, it eventually sets it to default priority and you pay the full price. Pay twice as much, but you get it done in those situations where there's overly strong demand on the Flex tier. And I've implemented this at scale, and it led to an immediate 50% price reduction because the Flex tier is quite available most of the time. And I'm not like spamming it anyway.

Arvid: 13:01

I may have what might it be like 20,000 requests a day. So it's not that much, I guess, compared to other people using this, but still, it was really cool. And because I have so many input tokens and so few output tokens, because I throw a whole transcript of a podcast hours worth of text just to get a couple of JSON elements and summarize data, so a lot in and a little out, the cache tokens on that tier are also half price, and that's quite significant. It then allowed me to immediately double the amount of extraction and inference I do, which increases the overall quality of the platform for effectively free. Right?

Arvid: 13:36

And that's apparent not just to my paying users that things are better and more analyzed, but also my prospects. And this ultimately leads to a higher conversion. Just a little, but still, it matters. And there's more data in the background for people to draw on. So in your API calls to any kind of AI system, check each platform that you use and look for something like service tiers and consider how long you can allow for something to process.

Arvid: 13:59

You might even wanna think about batching. Batch pricing costs just as much as Flex processing on the OpenAI side. And if you can't batch and need something kind of asynchronous but somewhat synchronous, go for the Flex tier. If you can batch, batch away. It's nice anyway to do stuff in the background and have all of this handled whenever it gets done.

Arvid: 14:18

You don't need to then build complicated retry logic and all of that to keep your workflows going. Just consider that you send it off, it comes back at some point, and that's it. The clear indicator here is that if you can batch enough the infrastructure for it, that's great. Flex pricing only exists on a couple models though, so you need to check that out. It's g p t five five point one at this point at least, o three, o four models.

Arvid: 14:40

Models like Codex and g p t five Pro and four o and real time audio might not have flex pricing as easily available, just check that side. It depends on the model. So you need to figure out this on a per platform, per model basis. But if those tiers exist and you're not using them, well, that's just financial negligence as a founder. Right?

Arvid: 14:59

It's very simple to set the service tier. It's just literally one property on the API call. It's very simple to default back to the standard if something is not working. And if you have stuff that really needs to go through, to go through even if there is a lot of congestion, there is another service tier and that's priority, which costs double, but you will get results even quicker. You have to understand that priority might also not exist without the certain models there, but that's really what Greg Eisenberg was getting at with his tweet.

Arvid: 15:30

There's so many models, so many different ways of using them that you really have to be flexible in implementation and how you optimize for this because things change all the time. And this brings me to something a bit more well known, I guess, but something I found really makes a difference if you have a lot of data to analyze, prompt optimization. In this case, front loading the thing that repeats itself in your prompt. Front load your system prompt. If you analyze the same bunch of data multiple times, start your prompts with that data and make it verbatim the same every time you use a piece of data.

Arvid: 16:01

For me, that's a full transcript. I start my system prompt. You're like a transcript analysis expert. You do this and that and all that, and then I fully paste the transcript. And then at the end of the transcript, I say the specific thing that needs to be done.

Arvid: 16:13

Sometimes it's looking for a particular kind of guest or finding sponsors or asking a customer's question on it, like, is this about the thing that I really care about for my business? That my customers can set this up in Potscan. They can ask a specific question of every transcript out there. But they always front load the actual data because those will be the cash tokens. Those will cost you 10% of what tokens that are different between prompts might cost.

Arvid: 16:37

And I know my data case is special because PodScan has these massive transcripts that many questions are asked on, but if you're doing anything meaningful where you throw data into an AI system for analysis and do the same data multiple times for different kinds of analysis, put that data that will repeat itself first and put the stuff that doesn't repeat itself after. So to make the order absolutely clear, system prompt describing what the thing is, then system instructions that are always the same, like you will extract data in this format and so on, then the data that's potentially duplicated between prompts, and then the specific instructions for each prompt. So don't go with, like, you're doing this, you're going to extract this kind of format and here's the data, and then the data is all different. And then you say something at the end that repeats itself. You don't want that either.

Arvid: 17:20

So that's the order you want to do, and that will cost optimize the prompts. It's also very helpful if you have multiple prompts you want to check. You can take those and feed them as data into Cloud Code conversations or ChatGPT conversations with the express instruction to analyze these different prompts, the instances of prompts. Let's say you generate, like, five prompts in your own prompt generating system and ask these tools if they can be optimized. And you will get insights from the AI that you can use.

Arvid: 17:48

Right? They might say the same thing that I do. I'll pull this up, put this down, and move this around or say this differently. I wouldn't take this advice verbatim because if you ask an AI about prompt optimization, it's still just predicting tokens. So just trying to guess what's right.

Arvid: 18:02

Just because it says something might be more useful doesn't mean it is. But if you have a lot of prompts and let the AI check 10 of them at the same time and tell you what you could do better, you will get meaningful insight from that. And finally, since we're talking about using external platforms that cost on a per token basis, even if we're saving a lot of money by doing certain things, build a lot of rate limiting into these systems. You will need to rate limit the customer facing actions that trigger AI interactions, and you need to rate limit the AI interactions that can be sent from your back end server. You don't want a race condition that restarts the same process over and over again, triggering the same OpenAI call repeatedly, like a thousand times a second.

Arvid: 18:41

Even if using cache tokens, this will cost you before you catch it. Make sure that if something is out of order, like if there's 10 times the normal AI tokens being used in any given hour, you're at least made aware of it and you have the abilities to stop it. I have in my application a full on circuit breaker for all AI features. If you have something like Laravel, which has a command you can run on your server, maybe you have an admin interface where you can flip a switch on or off, you might want to implement Full Circuit Breaker for all AI tools or for specific parts of the application. Just turn off all AI and then whenever an AI call would be made, the service checks if the circuit breaker is on or off.

Arvid: 19:18

And if it's on, well, then it stops and doesn't send the request. For Podscan, users can do something like self onboarding and click generate a cool default config button. Right? And I have that on and off because if people find a way to abuse it, find a way to automate this and send hundreds of dollars worth of cost my way, well, I wanna be able to switch it off with a toggle. I wanna see that it happens, and I wanna have a toggle.

Arvid: 19:42

So a feature toggle in your back end, not just front end, but in the actual back end, right where it happens, is what I would suggest. And do this wherever you think you might need it. And then once you build this into the locations where you think you might need it, have an AI scan your code and tell you where else you need to put it because you might forget. The AI will scan through all your files for you, Cloud Code or Codex or something. Right?

Arvid: 20:02

It's always a good idea to have an agentic coding tool just for this purpose. Where are the potential AI call abuse situations? And can we have a feature toggle there that prevents this from being used? If you, as an administrator, turn it off, that's a prompt, and it will build this for you. I think that's a very important feature because bootstrappers, solo founders, we don't have the means to spend a thousand dollars on a software bug in our system.

Arvid: 20:26

So I highly recommend this for anything, which also means that any AI usage has to go through your back end system. Right? You cannot implement something, and you should never. That is client side and doing an AI call to any platform with your token. It's generally not a good idea to begin with, but do not do this.

Arvid: 20:42

Always funnel it through your back end system so you can reliably turn these features on and off and get alerted when usage is high. I have several of these rate limits and all the features, front end user rate limits, back end rate limits, feature toggles, alerts for when a feature gets abused on a per account system, on a per subscriber type system, even per IP as well. I have a lot of that tracking and I get notifications. And I really think these things need to be built in. Have I a feeling that tools and frameworks will implement this in the near future as a kind of a baseline anyway, but for now, right now, we need to implement this ourselves as founders because we don't wanna lose money.

Arvid: 21:18

Yeah. The AI landscape is such a weird place. It's changing constantly. The models change. APIs change.

Arvid: 21:24

Pricing changes. I think your job is not to keep up with everything. I think that's impossible. But your job is to build systems that can adapt when things inevitably shift. Migration patterns, good idea.

Arvid: 21:35

Right? Have the ability to go back and forth between implementations. Service tier optimizations, pay only what you need to, even if it's a little bit slower. Prompt caching, rate limiting, circuit breakers. I think these aren't just nice to have.

Arvid: 21:49

These are the foundation of running AI in production without losing your shirt. Build for change. That's the real best practice. And that's it for today. Thank you so much for listening to The Bootswap Founder.

Arvid: 22:01

You can find me on Twitter at avidka, a r v I d k a h l. Everything I just talked about, migration patterns, service tiers, rate limiting, that's all running under the hood at PodScan, my podcast monitoring platform. If you need to know when someone mentions your brand or name across 4,000,000 podcasts, that's what it does. And for founders still searching for the next thing, I set up ideas.podscan.fm, where AI pulls validated problems straight from expert conversations. Sometimes the best ideas are just listening to what people are already asking for.

Arvid: 22:30

So thanks so much for listening. Have a wonderful day and bye bye.

More episodes

Chapters

Creators and Guests

What is The Bootstrapped Founder?