Screaming in the Cloud

In this episode, we chat with Randall Hunt, the VP of Technology at Caylent, about the world of generative AI and how it's changing industries. Randall talks about his journey from being an AWS critic to leading tech projects at Caylent. He shares cool insights into the latest tech innovations, the challenges and opportunities in AI, and his vision for the future. Randall also explains how AI is used in healthcare, finance, and more, and gives advice for those interested in tech. 

Show Highlights: 

(00:00) - Introduction 

(00:28)
- Randall talks about his job at Caylent and the projects he's working on

(01:35) - Randall explains his honest and evolving perspective on Amazon Bedrock after working with it hands-on

(03:35)
- Randall breaks down the components and improvements of AWS Bedrock

(06:08)
- Improvements in AWS Bedrock's preview announcements and API functionality

(08:05) -
Randall's predictions on the future of generative AI models and their cost efficiency

(10:00) - Randall shares practical use cases using distilled models and older GPUs

(12:12) - Corey shares his experience with GPT-4 and the importance of prompt engineering

(17:21) - Bedrock console features for comparing and contrasting AI models

(21:02) - enterprise applications of generative AI and building reliable AI infrastructures

(28:13) - Randall and Corey delve into the costs of training large AI models

(36:37) - Randall talks about real-world applications of Bedrock in industries like HVAC management

(39:40) - Closing thoughts and where to connect with Randall

About Randall Hunt: 

Randall Hunt is a Software Engineer and Open Source Developer Advocate at Facebook. Previously of AWS, SpaceX, MongoDB, and NASA., Randall Hunt, VP of Cloud Strategy and Solutions at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski., Randall is the coder in the boardroom.

Links referenced: 

Randall Hunt on LinkedIn: https://www.linkedin.com/in/ranman/
Caylent: https://caylent.com/
Caylent on Linkedin: https://www.linkedin.com/company/caylent/

* Sponsor 

Prowler: https://prowler.com

What is Screaming in the Cloud?

Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the "why" behind how businesses are coming to think about the Cloud.

Randall: It's hyper focus on, on per token cost is kind of like missing the forest for the trees. Because that is only one part.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a hot second since I got to catch up with Randall Hunt, now VP of Technology at Caylent. Randall, what have you been doing this episode? I haven't seen you in a month of Sundays.

Randall: Uh, well, I'm still working at Caylent and we are still building cool stuff.

That's my new motto is we build cool stuff. And yeah, a lot of gen AIs. Coming out from a lot of different customers. People are getting really interested in applying it. So that's what I'm doing these days.

Corey: Prowler works wherever you do. It's an open source powerhouse for AWS, Azure, GCP, and Kubernetes.

From security assessments to compliance audits, Prowler delivers with no hidden corners, just transparent, customizable security, trusted by engineers and loved by developers. Prowler lets you start securing your cloud with just a few clicks, with beautiful, customizable dashboards and visualizations. No unnecessary forms, no fuss, because, honestly, who has time for that?

Visit prowler. com to get your first security scan in minutes. Some of the stuff that you have been saying on Twitter, yes, it's still called Twitter, uh, has raised an eyebrow because back when we first met, you were about as critical of AWS as I am. And what made this a little strange was at the time that you worked there, you're one of those people that, uh, I could guess, could best be described as unflinchingly honest.

Sometimes this works to your detriment, but it's one of the things I admire the most about you. And then you started saying nice things about Amazon Bedrock in public recently. So my default conclusion is, Oh, clearly you've been bought and paid for and have thus become a complete and total shill, which sure that might fly in the face of everything I thought I believed about you, but simple solutions are probably the best.

Before I just start making that my default assessment, is that accurate? Or is there perhaps something else going on?

Randall: No, uh, so I think if you look at the way I was talking about Bedrock back in April of 23, you could see I was still as unflinchingly honest as ever. Uh, although I guess I, I've grown up a little bit over the years and I try to be a little less, uh, I don't know, inflammatory in my opinion.

So I, I've, I'm like, Hey, This isn't real. This is vaporware. This doesn't work. So, since then, we've had the opportunity to work with, and me personally, I've had the opportunity, like, hands on keyboard to work with over 50 customers in deploying, like, real world, Non experiment, non proof of concepts, production solutions built on bedrock.

And I have to say, the service continues to evolve, it continues to get better. There are things that I still think need to be fixed in it, but it is a reliable, good AWS service that I can recommend now. I see your head exploding.

Corey: Yeah, I hear you. Um, the problem is, let me, let me back up here with my experience of Bedrock.

Well, let's, before we get into Bedrock, let's talk about, in a general sense, I am not a doomer when it comes to AI. I think it is legitimately useful. I think it can do great things. I think people have lost their minds in some respects when it comes to the Unmitigated boosterism, but there's value there.

This is not blockchain. We are, we are talking about something that legitimately adds value. Now, in my experience, Bedrock is a framework for a bunch of different models. Uh, a few of them great, some of them not. Some pre announced, like Amazon's Titan, the actual Titan, not embeddings. And I, I believe that has never been released to see the light of day, for good reason.

But it always seems to me that Bedrock starts and stops with being an interface to other models. And I believe you can now host your own models unless I'm misremembering a release or, and or cloud company that was doing that.

Randall: So there's a lot of different components of Bedrock. You know how you think of SageMaker as like the family for traditional machine learning services and you've got SageMaker

Corey: jumpstarted.

Well, I used to until they wound up robbing me eight months after I used it for 260 in SageMaker Canvas sales. And now I think of it as basically a service run by thieves. And I look forward to getting back into it just as soon as I'm made whole. Two and a half years later, I'm starting to doubt that that'll happen.

But, but yes, I used to think of SageMaker that way.

Randall: I agree with you on the Canvas side. I'm pretty equally frustrated. Like, I have to administer it for all of Caitlin. So I, I am our AWS administrator and I have to manage all of our costs. So I very much empathize with that component of it. But I do think.

The evolution of SageMaker, you have to think that it went all the way back to, what was it, 17 that it launched or was it 16? I think I wrote the launch blog post, so I should really remember this, but I've forgotten.

Corey: Pretty sure it was 17. It came out and my first question was, what are we going to do with all this Sage?

Randall: You know, it's got like three different generations of stuff on top of it and figuring out the cost inside of SageMaker, which generation it belongs to. Is it part of Studio? Is it part of Canvas? Is it? It is not fun. So I totally empathize with that part. However, SageMaker has many good components to it.

So that part of things, you think of SageMaker as a family of services. Think of Bedrock as a similar family of services. You have things like guardrails, and these started out as very sort of rudimentary. You would do a regular expression search for certain words. You would, uh, in natural language, define, you know, certain questions that you didn't want the models to respond to.

is much better now. So you can tune in things like toxicity. You can tune in things like, Oh, you know, you're an HR bot, but don't answer any questions about finance or tax, things like that. This works now, whereas previously it was more of like a preview feature didn't really work. Then the number of models that are available is going to continue to grow.

You can't actually bring your own model yet. So what you can do is you can bring weights.

Corey: That's what it was. Sorry, it all starts to run together on some level, and it's hard, to be frank, it's hard to figure out what's real and what's not, given how much AWS has been running its corporate mouth about AI for the last year and a half.

They've been pre announcing things by sarcastic amounts. They have, and it's, it's always tricky. Is that a pre announcement or is that a thing customers can use today? I don't know. I do know that I know nothing about what's up and coming because that team doesn't talk to me because in the early days I was apparently a little too honest.

Randall: So I, I think one of the things that they have improved on, and I lost my mind over this back in 2023, is they've stopped saying it's in preview. When it's really coming soon. So that was the biggest thing that drove me crazy, is they would say something was in preview. All of my customers would come to me and they'd be like, Randall, I want to get into this preview.

And then I would go to AWS and I would say, AWS, I want, we want to get into this preview so that we can advise our customers and all this. And then it turns out it was really a coming soon.

Corey: Yeah, it's in private preview for a limited subset of customers whose corporate name must rhyme with Schmanthropic.

It's like that, that is not the same thing.

Randall: But they've gotten better about that. So they say coming soon now. I don't know if you've seen some of the more recent announcements where it's, it doesn't say the preview or anything like that. It says coming soon, which is so much more helpful in like helping our customers understand what's real, what's on the way.

What can you use in your account today? That sort of thing. Getting back to Bedrock, it is. A very solid API, I think it's well designed, the ability to return, like if you, if you do invoke model with response stream, right, it's going to return at the end of everything a little JSON, you know, payload that has, you know, bedrock metrics.

And you can also get these in CloudWatch, but it's very useful to have them per inference. Instead of having to go and like average them out and get them from the aggregate results, like you can get per inference, things like time to first token, which is very useful when you're building something that's customer facing where you want streaming responses.

You can get inter token latency, which is also important for that. And you can get total inference time. Now total inference time is less useful in a streaming use case. I mean, it's still a valuable metric, but all of that stuff is returned through the API. There are some models that don't return that.

And I think that's just because they're kind of older legacy models. Titan is a real model. I've used it, but to your point, not the best, but that's fine. I think AWS is probably working on some stuff and I hope they are, you know, I'd love to see them release a model, but I also think we're going to get into this area.

Here's, here's my prediction, right? And I know we're getting a little bit off the topic of Bedrock with this, but my prediction Regarding generative AI is that we will have a few foundation models that are very powerful, large models, and then we're going to have many frequently changing distilled models, so smaller models.

We did this for a customer recently where, you know, the cost, the per token cost in production of using a large language model like, you know, Cloud 3 Sonnet or Cloud 3 Opus was going to be way too high. It just wasn't going to work given the threshold that they were operating at. What we did is we made Cod3 Opus the decision maker, deciding which tool to use for which job.

And then we used something called DistilBERT, which is just a version of BERT that you fine tuned, which we did in SageMaker, on their particular dataset for that particular thing. We used DistilBERT as the secondary tool. So then we could process this massive amount of data, like, the I think it was 50, 000 requests per minute or something.

With these Distilbert models running on, um, some G4s and G5s. We tried Inferentia as well, and we've had really good success with Inferentia on some of the Llama models and some other customers. But the G4s and G5s, like people, I don't, I don't know if I want to say this because the spot market for them is really good right now.

But, um, maybe we can cut that out.

Corey: It's not the latest and greatest NVIDIA GPU, at which point everyone just loses their collective mind.

Randall: We've been saving customers a lot of money by staying on some of the older GPUs, and they perform really well with these distilled models.

Corey: Oh, yes. I, uh, I've been doing some embedded work on Raspberry Pi doing inference running two models simultaneously for a project that will be released in the fullness of time.

And, yeah, there are some challenges in resource constrained environments. GPUs help.

Randall: The other trick, I guess, maybe I'll give away all my tricks here, local zones. So, you can get very decent prices on GPUs in local zones, so if end to end user latency is important to you. Uh, check out the prices there.

Corey: But the initial, my initial understanding of Bedrock back when it first came out was that it was a wrapper around other models that you had.

And one of the, you, you say now it is a great API. The problem I recall was that every model required a payload specified a subtly or grossly different way. So it almost felt like, what is this thing? Are they trying to just spackle over everything into a trench coat? What, what is the deal here?

Randall: It's a wrapper.

So you still have to customize the payload a little bit, but The good thing about the payloads is that they're basically all trending towards this message API. So 1 and 2 had this API where you would say human assistant and you would just kind of go back and forth in turns and that was the entire.

There's this new one which is the messages payload and that structure is much more amenable to moving between different models. Now, that brings us to the topic of prompt engineering. And prompt engineering is still quite different depending on which model you use. So, if you, you can't take a prompt that you designed for Llama3 and move it into Plod3, for instance.

It just, they're not compatible. That said, there's a ton of tooling that's out there these days. And, you know, there's LangChain, there's GripTape, and I think all of those are, are good things to sort of learn. But if anyone is listening to this and wanting to get into it, the best way to learn is to actually just remove all of the SDKs that make this a lot easier and just write the prompts raw yourself until you understand it.

And we did that for our reInvent Session Navigator that was powered by Cloud 2. 1. It's just like, no SDKs except for Boto3. And then like, I think I used whatever SDK to talk to Postgres. Like, those were the only SDKs we used, and you can see what all of these tools are doing under the hood. The tools like Llama, Indax, and LangChain.

And once you understand it, you realize A lot of this is just syntactic sugar. It doesn't add a ton of overhead. It's just niceties on top of things.

Corey: I've been using GPT 4 for a while now as part of my newsletter production workflow. Now, instead of seeing an empty field, which has the word empty in it historically, because back when I built my newsletter ingest pipeline, you could not have empty fields in DynamoDB.

So, I just, easy enough, cool, I'll put the word empty in place and then just have a linter that validates the word empty by itself, does not appear in any of the, uh, things, so it means I haven't forgotten anything, and we're good to go. Now, it replaces it with, uh, auto generated text, and sets a flag, so that I still have something for a linter to fire off of, but I, and I very rarely will use anything it says directly.

But it gets me thinking, it's better than staring at an empty, at an empty screen, but it took me two months to get that prompt dialed in. I am curious as to how well it would work on other models, but that has been terrific just from an unblocking me perspective. I, I, probably it's time for me to look at it a bit more.

But, in this particular case, I viewed, at least until this conversation, GPT 4 as being best of breed, and I don't care about cost, because it costs less than 7 a month for this component, and as far as any latency issues go, well, I'd like it done by Thursday night every week. Which means that, okay, for everything except the last day of it, I could theoretically, if cost did become an issue, wind up giving, using the batch API that OpenAI has, and pay half price and get the response within 24 hours.

So my use case is strictly best case. It took me months to get the prompt right so it would come out with the right tone of voice rather than what it thought that I did.

Randall: So you have a couple options. If cost really doesn't matter, you should really compare GPT 4 to Cloud3 Opus. And if you're looking to use it without any other sort of AWS tooling, you can just access the Anthropic SDK directly.

Corey: And that's why, that is, I guess, my big question as far as is there, is there differentiated value in using Bedrock for something like this as opposed to just paying Anthropic directly? There is. Because I did debate strongly, did I, do I go directly with OpenAI or do I go with Azure's API? The pricing is identical and despite all the grief I give Azure, rightly so for its.

Lack of attention to security, this is all stuff that's designed to seat the public anyway. If they wind up training on its own responses in AWS blog posts, I assume they already are. So it doesn't really matter to me from that perspective.

Randall: So the pricing is identical, the per token pricing and everything.

The advantage of running within Bedrock is you can get all those metrics that I was talking about. You can log everything in CloudWatch. You can get all the traditional AWS infrastructure you're used to. That's one benefit. The other benefit is, and this is less useful for your use case, by the way. So this is more useful for industry use cases.

You get stable, predictable performance. Have you ever hit the OpenAI? And it's been like, LOL, you're out of tokens, 429 back off. Like, I'm not going to give you anything more.

Corey: No, I use, for a lot of stuff I use, it's ad hoc. I use, uh, chat jippity, which just sort of sits there and spins and acts like it's doing something before giving a weird error.

And sometimes it's in flight wifi not behaving, but other times it's great.

Randall: There's instability in a number of the consumer facing APIs. that you can get around in the AWS APIs. So if you want to go and purchase provision throughput, which that was another huge gripe I had with Bedrock, is that the provision throughput was antithetical to the way that AWS did usage based pricing.

So if, you know, you would have to commit to one month or six months, You commit to an hour now, which is much more reasonable from a, you know, I want to build something and I know it's going to run in batch, so I'm going to purchase, you know, two model units of provision throughput for one hour. That works super well.

And we've had customers do that for batch workloads, and it's very dependable. You get precise performance. It's dialed in. You know exactly what you need. You know, if you're using the on demand APIs, you can get 4. 29 backoffs all the time, and originally when Bedrock first came out, the SDKs, this is, this is funny, the SDKs, particularly the Python SDK, did not correctly parse the, uh, 4.

29 backoff. Because ThrottledException was a lowercase t and it was trying to do a case sensitive match on ThrottledException. But that's fixed now, so it'll, it'll properly do the 429 backoffs and everything. But those are the advantages really, is that you can get the predictable performance that you're looking for.

It's much more suitable for kind of production workloads.

Corey: Something I saw recently reminded me of a polished version of Bedrock, and I wish I could remember what it was. It was some website, I forget if it was a locally hosted thing or some service, that you would bring your own API keys for a variety of different AI services, and then it was effectively a drop in replacement for ChatGibity.

You would, and you could, you could swap between models, race them against each other, and also just have a, a great user experience. Like the, the initial sell was instead of 20, pay for your actual API use case, which except for the very chattiest of you is probably not 20. And okay, great, I'm not that cheap.

I haven't, I didn't go down that path, but if I can swap out models and do side by sides, that starts to change the story.

Randall: You can actually do that in the Bedrock console now. So if you go to the Bedrock console, you can compare and contrast models, you can see what the cost is. I can't, we built something like that before it existed in the Bedrock console, so we call it the Bedrock Battleground, and we, you can pull in any of these models.

I think the one you're thinking of is probably the Vercel AI SDK, which is also very, very nice. We actually have submitted some code and pull requests to make Bedrock work, uh, better in that SDK. And, yeah. You know, adding in models like Mistral and streaming support. But yeah, I mean, I'm totally fine with that approach.

But if you need to do it within AWS, it's right in the console now.

Corey: The reason I would avoid using Bedrock directly for something like this, perfect example of AWS's long tail challenges, catching up with them. Very often, I will use the iOS app for ChatGibity and can pick up where I left off or look at historical things.

I'm also not sure if any of these systems, please correct me at all if I'm wrong, but the magic part to me about. I can be asking it about anything that I want and then it's, oh yeah, generate an image. of whatever thing I want. Like one of the recent ones I did for a conference talk was a picture of a data center aisle with a giraffe standing in it.

And it was great because there's never going to be stock photography of a zookeeper doing that. But the fact that it's multimodal, I don't have to wind up constructing a separate prompt for DALI. I mean, the magic thing that it does for that is it constructs the prompt on the back end for you and you get much better results as a direct result.

I don't have to think about what model I wind up telling it to use. As an added benefit, because it has persistent settings that become part of its system prompt, uh, that it should know about you, what I like is that I can say, oh yeah, unless otherwise specified, all images should be in 16 by 9 format, or aspect ratio, because then it just becomes a slide if it works out well.

Randall: I think you're still thinking about it from the consumer perspective, which is valid. You know, GPT, ChatGPT is a very polished product, and it's simple, you know, it's a simple interface with an incredible amount of complexity. underneath. And I think what Bedrock is providing, uh, among other things, and it does have image generation models, by the way, Titan image generator and stability is a, the same thing that AWS has always been particularly good at building blocks.

So it's, it's letting people build capabilities like chat GPT into their own products. And Even going beyond that, there's a ton of kind of, you know, use cases beyond like the the chat interface that I think we're going to see Bedrock applied for. One of the things that we did for a customer is we built a resumable kind of data science environment.

So think about pandas data frames that exist within Jupyter notebooks. Now imagine you have a chat GPT or something that can go and talk to this data frame and it can send plots and those are all kept on Fargate containers there. You know, we, we save the notebook, we persist it to S3. And then if a user wants to bring that session up again, we restore it.

We bring that session back to life. And we go and we resume, you know, the, the Python execution and we say, Hey, uh, this plot that you made in, in Cloud3, by the way, supports multimodal, so you can put images in and you can say, Hey, look at this plot, uh, and then change the X axis so that it gets rid of these outliers that I don't care about.

Uh, and it'll redo that and it'll actually write the matplotlib code or the plotly code in this case, but whatever, and it'll go and redo it. And that is something that I think is genuinely valuable and not just a typical chat use case.

Corey: Tired of big black boxes when it comes to cloud security? I mean, I used to work at a big black rock and decided I was tired of having traditional jobs, so now I do this instead.

But with Prowler, you're not just using a tool, you're joining a movement. A movement that stands for open, flexible, and transparent cloud security across AWS, Azure, GCP, and Kubernetes. Prowler is your go to everything, from compliance frameworks like CIS and NIST, to real time incident response and hardening.

It's security that scales with your needs. So if you're tired of opaque, complicated security solutions, it's time to try Prowler. No gatekeepers, just open security. Dive deeper at prowler. com I want to push back on one of the first things you said in that response. Specifically that, uh, separating out the consumer from the professional use case.

The way that I've been able to dial these things in and has worked super well for me is I start with treating ChatGPG as a spectacular user experience for what I will ultimately, if I need it to be, repeatable. Like I don't need, I don't need infinite varieties of giraffes in datacenter photographs because neither giraffes nor.

Cloud Repatriation are real, which was sort of the point of the slide, but that was a one off, and it's great. For the one off approach, though, I did iterate on a lot of that using, uh, ChatGPT first, because, is this even possible? Because once I start getting consistent results in the way that I want them with a prompt, then I can start deconstructing how to do it programmatically and systematize it.

But for the initial exploration, The fact that there's a polished interface for it is streets ahead and that's something AWS has never seemed to quite wrap their head around. You still can't use the bedrock console on an iPhone because the entire AWS console does not work on a phone. The, the next piece beyond that then is if it's that easy and straightforward to build and play around with something to see if it's possible Then what can change down the road?

The closest they come up to come with this so far has been PartyRock, which is too good for this world, and I'm still surprised it came out of AWS because of how straightforward and friendly it is.

Randall: So I, I think we are talking about two different use cases, right? Like, I'm talking about the, the enterprise or even startup, you know, the generative AI, in which case Bedrock is absolutely the way that I would go right now.

And you're talking about the individual end consumer usage of generative AI, which I agree.

Corey: True. None of the stuff I've done yet has been with an eye towards scaling. You're right. This is, I'm not building a business around any of this. It is in service of existing businesses.

Randall: Listen, AWS builds backends really well.

About interfaces and frontends, I mean, there's a lot to be done. I've actually been pretty pleased with some of the changes that have happened in the console. I know people don't like it when the console changes, but. There used to be these little bugs, like, not having endings on the table borders in the DynamoDB console.

That infuriated me, I don't know why. It just, it was just It was such a like simple thing to fix. And I worked at AWS at the time, and it took me two years to get a commit in to fix that console.

Corey: That was the entire reason you took the job. Once it was done, it was time for you to leave because you've done what you set out to do.

Randall: This is actually a fun piece of history. You know, the, the, the AWS console started out in a Google web toolkit. So it's GWT. Does anyone remember that? I don't think like this, you wrote your entire front end in Java and it would be translated into like Ajax, like in HTML on the back end. That's how. All of the original consoles were written.

Corey: 2009,

2010 was my first exposure to AWS as a whole. Was that replaced by then?

Randall: No, I think many of the backends were still, or sorry, many of the consoles were still GWT back then. Come to find out, surprise, surprise, a few still are today. Kidding, I hope, I hope that's a joke. I don't think any are, I mean, I don't use SimpleDB, so it could still be.

But I think almost all the consoles moved to Angular after that, because there was a pretty easy upgrade path between GWT and Angular. And then a lot of people started experimenting with React, and then there was this kind of really polished internal UI toolkit that let you basically pick the right toolkit for your service, the right front end framework for your service.

And I think they've continued to iterate on that. And I do think that's the right approach. I wish there was a little more consistency in the consoles. And I wish there was a little bit more of an eye towards power user experience. So a lot of times console teams that are new, like new services that launch, they don't think about what it means.

Oh, I have. 2, 000 SAML users here and that's not going to auto populate when I do a console side search of users, uh, it needs to be a back end search like this. All these little things, but I think that's because AWS works in this semi siloed fashion where the service team does a lot of the work on their own console.

The only like truly centralized team that I'm aware of at AWS is their security team. And then everything else is sort of, okay, I got to get one front end developer, I got to get one product manager, I need four back end developers, and I'm going to need one streaming person. So I think that's just an artifact of how they work.

Corey: Yeah, that is, it is their superpower and also the thing that they struggle with the most, because individual teams going in different directions will get you to solve an awful lot of problems, but also means that there are certain entire classes of problem that you won't be able to address in a meaningful way.

User experience is very often one of those. Billing, I would argue, might be Be another, but that's a personal pet peeve. Uh, on the topic of billing, you've also been, um, the polite way is talking, the other, the impolite way is banging on about, uh, a lot about unit economics when it comes to generative AI.

As you might imagine, this is of intense interest for me.

Randall: What's your take? So everyone wants the highest quality tokens as quickly as possible, as cheaply as possible. Like if you, if you are a enterprise user or a large scale user of generative AI, The unit economics of this go beyond tokens. And I think if people just keep designing to lower the per token cost, you know, there are models, there are architectures that may not require tokenization that we might want to use one day.

And this hyper focus on on per token cost is kind of like missing the forest for the trees, because that is only one part of the scale and the cost. that you have to deal with. You have to think about embedding models. So that's actually one place where I've been pleasantly surprised and impressed is, is, uh, AWS released the Titan V2 embeddings, which support normalization.

Um, you know, they're, they're fairly new, so we don't have hard, hard numbers on these yet, but we've had really good Initial experiments. And I have like all the data on it if you want.

Corey: A dramatic reading of an Excel spreadsheet. Those are the riveting podcast episodes.

Randall: But if you want me to send you like a graph afterwards, I can show you where we saw the good capabilities and we can, I can show you kind of the, the trade off between the 256 vector size, which brings us back to the unit economics, right?

Like the original Titan embeddings, I think they had like a 4k vector output. Now, if you put that into PG vector, which is the vector extension within Postgres and you try to query it. Well, guess what? You just blew up your RAM. Like, keeping that whole index in memory is very expensive. And these cosine similarity searches are very expensive.

Now, back then, pgVector only supported what was called IVV flat, which was just an inverted index. Next, what they did is they, SopaBase, AWS, and this one individual open source contributor, all worked together to get what's called HNSW, or Highly Navigable Small World Indexes, into Postgres. And all of a sudden Postgres is beating Pinecone and everyone else on, you know, price performance and vectors.

Now, the downside is that Postgres doesn't scale to like a hundred million vectors, because as soon as you get into sharding and things like vectors don't shard well, you have to pick a different shard key, all this other good stuff. That is a whole other side of the unit economics is like, what is your vector storage medium or your document storage medium?

And what is your cost of retrieval? And then what is your cost of context? Because the Cloud 3 models, for example, they have, you know, 200K of context. And they have darn good recall within that entire context, but that's a lot of tokens that you have to spend in order to put all that context in. So part of the unit economics of this are, hey, how good is my retrieval at giving me the thing that I'm looking for so that I can enrich the context of the inference that I'm trying to make?

And measuring that is three levels of abstraction away from tokens. You know, you have to have a human in the loop say this is what we thought was a quality answer and the context was quality too and it was able to correctly infer what I needed it to infer.

Corey: I think people have lost sight of just how horrifyingly expensive it is to get these models up and running.

Uh, there was a James Hamilton talk at the start of the year at CIDR where he mentioned that an internal Amazon LLM training run had recently cost 65 million dollars in real cost. And that, like, honestly, the biggest surprise was that Amazon spent anything like that on anything without having a massive fight over frugality.

So that just shows how, how hustling they are around these things. But it's, I think, why we're seeing every company, even when the product isn't fully baked yet, they're rushing to monetize upfront, which I appreciate. I don't like Everything subsidized by VCs until suddenly one day there's a horrifying discovery.

Um, I love GitHub Copilot. It normally would cost 20 bucks a month and I'd pay it in a heartbeat except for the fact as an open source maintainer I get it for free. It's worth every penny I don't pay, uh, just because of how effective it is at weird random languages with which I'm not familiar or things like it.

That is just a, it is a game changer for me in a number of different ways. It's, it's great stuff and I like the fact that we're monetizing, but they have to because of how expensive this stuff is.

Randall: The other thing to think about there is there are power users when you, when you price something right at like 20 per user per month or 19 per user per month.

There are power users who are definitely going to go above what, you know, that costs. So that's, I think part of the economic balancing act there is how do I structure this offering in my product. Whether it's a SaaS product, whether it's a B2B product, or even a consumer facing product, such that I am going to provide more value and impact than it will cost me to deliver, and I will make this margin.

And those are the most interesting conversations that I get to have with customers is Moving, you know, first, I love the implementation. I love getting hands on keyboard and building cool things for people. But then we move one level up from that and we're like, Hey, this is a technical deliverable, but did it solve our stated business goal?

Did it, did we actually accomplish the thing that we set out to do and building in the mechanisms for that and making sure we can measure the margin and know that it's genuinely impacting things and moving the needle? That takes time. Like that's like more than a quarter over quarter view because it takes time for people to learn about the product and to adapt it.

And people have to be willing to make some bets in this space. And that's scary, uh, for some enterprises that are not used to making bets. There was one other thing that I wanted to mention there about the cost of training, which is The transformer architecture, the generative pre trained transformer architecture, has quadratic, uh, essentially, or exponential even, training costs.

So as you grow the size of the transformer network, as you increase the number of parameters, as you change the depth of these, uh, encoders You are increasing the cost to train. Then you have the reward modeling, which is the human in the loop part. You have all this other stuff that you have to do, which again, increases the cost to train.

There are alternative architectures out there. And I think the future is not necessarily purely transformer based. I think the future is going to be some combination of like states based machines. and Transformers. And, you know, we're going to go back to the RNNs that we used to use. And I, you know, what kind of ticks me off is, I don't know if you remember back in 2017, Sunil and I did this video walking through the Transformer architecture on SageMaker.

And even we didn't get it back then that it was going to be this, you know, massive thing unlocking, uh, emergent behavior. And I think it was only people like Ilya and Andre McCarthy who, who realized like, Actually, you know, if we just keep making this thing bigger, we get emergent behavior. And it is suddenly not just a stochastic pair, it is actually able to, you know, the act of predicting the next token has suddenly given us this emergent behavior and this access to this massive latent space of knowledge that has been encoded in the model and can, in real time, in exchange for compute, be turned into a valuable output.

Very far from AGI still, in my opinion. But I think you could brute force it if you use the transformer architecture and you just threw, you know, trillions of parameters at it and, and trillions and trillions of tokens, you could probably brute force AGI. I think it is much more likely we will have an architectural shift away from transformers or, or transformers will become one part and we will use, Something akin to SSMs or another architecture alongside that.

And you've already seen promising results, uh, there from different models.

Corey: These are early days. And I also suspect on some level the existing pattern has started to hit a point of diminishing returns. When you look at the, the cost to train these models from generation to generation, at some point it's like, okay, next, When Sam Altman was traveling around trying to raise seven trillion dollars, it's okay.

I, I appreciate that that is the logical next step on this. I am predicting some challenges with raising that kind of scratch. And that is, so there has to be a different, different approaches to it. I think inference has got to be both on device and less expensive. For a, for, uh, For sustainability reasons, for economic reasons, and for example, something I think would be a terrific evolution of this would be a personalized assistant that sits there and watches what someone does throughout the day, like the conversations they have with their families, the things they look for on the internet as they do their banking, as they do their job, as they have briefings or client meetings, and so on and so forth.

And there's zero chance in the universe I'm going to trust that level of always on recording data to anything that is not on device, under my control, that I can trust. Foot hit with a hammer if I have to. So to do that, you need an awful lot of advancements. That feels far future, but future's never here until suddenly it is.

Randall: I don't think it's that far away. We've already gotten Llama 3 running on an iPhone 15, uh, the 8 billion parameter model. And I think we got 24 tokens per second or something. I mean, and admittedly, that was a quantized version of the model. I mean, that's what everyone does for this sort of hardware. I think we're not as far from that future as you might think, and I love the idea of agents, and that's another good feature of Bedrock, and I know we've gotten far from the topic of Bedrock at this point, but I, I know we're coming to an end here, and I'd really just want to, uh, my goal with this, Corey, is to convince you to give Bedrock a shot, and like, go in, try it again, explore it, and like, update your opinion, because I, I agree with you.

When it first was announced, there was A lot of hullabaloo about something that wasn't really there already yet. But we have these real things that we are building on it now. And it is really exciting. Like, I just love seeing these products come to life. There's one customer we have, Brainbox. They built this HVAC management system that is powered by generative AI.

And it can do a lot of the data science stuff that I was talking about before, where it's like, Oh, resume a Jupyter Notebook and, uh, in a Fargate container and, and show me this plot. Also, you know, look up the manual for this HVAC thing and tell me what parts I'm going to need before I drive out there.

And it's all a natural language interface, and it's really helping the built environment be better at decarbonization. And those are the sorts of impacts that I'm excited about making. And I think building that on top of OpenAI, it would have been possible. Like we could have done it on top of OpenAI, but getting it to integrate with Fargate and all of these other services would have been more challenging.

It would have introduced, More vendors, it would have been this, this overall like very weird kind of complex architecture where we're balancing different models against each other with bedrockets, you know, one API call, we're still able to use multiple models. But it's all within this Boto3 ecosystem or this TypeScript ecosystem.

And we're able to kind of immediately, when a new model is added to Bedrock, like we started out in Cloud 2. 1, or maybe it was Cloud 2, I don't remember. Immediately, we were able to switch to Cloud 3. 0 Sonic when it came out and get better results. So that's the other advantage of Bedrock is like, because this stuff moves so quickly, I can go as soon as it's available in Bedrock without having to change or introduce new SDKs or anything.

Start using that model. I got way off whatever point I was originally trying to make. I got excited about

Corey: it. Now, the point you started off with is that you were urging me to give Bedrock another try. The entire AWS apparatus, sales and marketing, has not convinced me to do that, but I strongly suspect you just may have done that.

Uh, for someone who's not a salesperson, you are disturbingly effective at selling ideas.

Randall: Listen, there are other SDKs out there, there are other offerings out there, and many of them are good. Thank you. But like, Bedrock is one that I'm bullish on. I think if they continue to move at this pace and they continue to improve, you know, it's gonna be a very powerful force to be reckoned with.

There's, there's a lot more they need to do though. Like I, I, I email the product manager all the time and I'm very sorry, sorry if you're listening to this. I'm sorry for blowing up your inbox. Like there are all these little things that I want fixed in it, but the fact is they fix them and they fix them within days.

Getting that sort of responsiveness and excitement from a team is just really powerful. And you don't always get that with AWS products. Sometimes teams are disengaged, unfortunately.

Corey: Sometimes teams are surprised to discover that's one of their products, but that's a separate problem. Okay, fair enough. I will give it a try, and then we will talk again about this, on this show, about what I have learned and how it has come across.

I have no doubt that you're, that you're right. I'll be surprised. You've, you are many things, but not a liar. What do you think about CDK these days? I haven't done a lot with it lately, just because I have, I've been honestly going down a terraform well. Honestly, the big reason behind that, honestly, sometimes I want to build infrastructure where I'm not the sole person on the planet capable of understanding how the hell it works, and my code is not exactly indexed for reliability and readability.

So there's a question around finding people who are conversant with a thing, and terraform's everywhere.

Randall: I'm a little worried about the IBM acquisition, to be honest. I don't know how all of that is going to play out.

Corey: Suddenly, someone who's not a direct HashiCorp competitor is going to care about OpenTofu, so that might, that has the potential to be interesting.

Randall: But I don't know if you remember, you used to not be the biggest fan of CDK, and then you and I had a Twitter DM conversation, and then I think you started liking it, and

Corey: Oh, then I became a cultist and gave a talk about it at an event, dressed in a cultist robe. Yes, costuming is critical, is critically important.

I'm

Randall: hoping I can convince you on the bedrock side too. Uh, I don't think it's cult worthy yet, but you know, it could get there.

Corey: We'll find out. Thank you so much once again for your time. I, I appreciate the, uh, your willingness to explain, uh, complex concepts to me using simple words. Something I should probably ask.

People want to learn more. Where's the best place for them to find you?

Randall: Oh, you should go to calent. com. We post a bunch of blog posts. We've got all kinds of stuff about LLM ops and performance, and we post all our results, uh, back when Bedrock went GA. I wrote a whole post on everything you need to know about Bedrock.

Some of that stuff's out of date now, but we keep a lot of things up to date too. Uh, and if you need any help, if you, if you find all of this daunting, all of this knowledge, all of this kind of content around generative AI really difficult to stay apace of, feel free to just set up a meeting with me, uh, through Twitter or with Kalen and, you know, we can, We do this every day.

Like this is, this is our jam. We want to build more cool stuff.

Corey: And we will, of course, put a link to that in the show notes for you. Thanks again for taking the time. I appreciate it.

Randall: Always good to chat with you, buddy.

Corey: Randall Hunt, VP of technology and accidental, uh, bedrock convincing go to market marketer.

I'm cloud economist Corey Quinn. And this is Screaming in the Cloud. If you enjoyed this podcast, please leave a 5 star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a 5 star review on your podcast platform of choice, along with an insulting comment, which is going to be hard because it's in the AWS console and it won't work on a phone.