Behind The Bots

Chris Estreich joins Behind the Bots to discuss Anycall.ai, an AI-powered phone system that can mimic human conversations. Estreich explains how Anycall allows users to easily configure AI agents with custom personalities and voices to handle phone calls and conversations. Using state-of-the-art natural language models like ChatGPT, Anycall aims to provide seamless, natural conversations through automated speech recognition, text-to-speech, and real-time transcription. Estreich demonstrates Anycall's capabilities through a call with "Vito," an AI Italian pizza shop owner. Chris outlines Anycall's potential for customer service, taking orders, booking appointments, and more. Looking ahead, Estreich sees huge potential as AI models become more powerful, available offline, and localized to devices. Although still early, Anycall shows the future of AI voice technology and using it to create customizable, conversational interfaces. Estreich provides an inside look at building with AI today and keys to innovating as models rapidly advance.

ANYCALL.AI

https://anycall.ai/
https://crunchbase.com/person/chris-estreich

FRY-AI.COM

https://www.fry-ai.com/subscribe
https://twitter.com/lazukars
https://twitter.com/thefryai

Creators & Guests

Host

Ryan Lazuka

The lighthearted Artificial intelligence Journalist. Building the easiest to read AI Email Newsletter Daily Twitter Threads about AI

What is Behind The Bots?

Join us as we delve into the fascinating world of Artificial Intelligence (AI) by interviewing the brightest minds and exploring cutting-edge projects. From innovative ideas to groundbreaking individuals, we're here to uncover the latest developments and thought-provoking discussions in the AI space.

Chris Estreich: I guess I started building products in college with, it was funny, like my freshman year dorm, I had no idea what I wanted to do when I got to college. And freshman year, our dorm was entire, like the first floor of the dorm was entirely computer science people. So I just kind of just followed them to the classes they were going to and then ended up doing that. So basically became an engineer by trade.

Where'd you go to school? And this was Stanford in the early 2000s. And there was kind of a shift to startups around that time. Originally, most of my classmates were going to work for like Microsoft, was sort of the big place you would go work or Amazon. And then just starting to be Google around the time. I was like 2005. And then I noticed that people were starting to go to these much smaller companies with much younger people.

It was kind of interesting. And so I also followed that trend and ended up working at a startup called Udall, which was like a classified startup. And then eventually through hubris, got the idea that maybe I could start my own company. And so around 2008, I started a company with a good buddy of mine from college. Originally it was this thing called Blippi,

Ryan Lazuka: which- A guy on Netflix, have you heard about Blippi?

Hunter Kallay: I've got a daughter, so that's all she watches, but that's funny.

Chris Estreich: Yeah, I have a daughter too. She hasn't gotten into that, but it's more Cocoa Mellon than Blippi. So Blippi was, we're trying to build a social network oriented around a feed at the time, like the feed mechanic was just starting to emerge as like a really great sort of consumer interface for a bunch of things. And we thought we could build a feed of your purchase history and we would scrape your credit card transactions using the same type of methodologies that Mint and other companies like that were using.

And it would sort of publish passively the stuff you were buying. And the idea was to create social interaction around that. And for whatever reason, that idea caught fire within like the VC community in Palo Alto.

So I'm still based around Palo Alto. And there was kind of a bidding war to invest in this crazy idea with these young kids who had no idea what they were doing. And we ended up raising like way too much money for a product that had no traction and it was just all kind of hype to a certain extent. And so we spent about a year trying to build that.

It didn't really go anywhere. Some of the integrations that we had built in particular with Etsy were kind of interesting in terms of the data they were producing. And so we decided to build a e-commerce product to try to elevate Etsy sellers.

And then we spent about a year on that. And that had some modest traction, which eventually led us to build a sort of a marketplace where these Etsy sellers could sell in sort of a real-time, live auction environment type of product. And that ended up taking off and getting to pretty massive scale. And so we spent about 10 years on that business. And there were a bunch of trends that we were writing, but we effectively became like a version of like Wish or Timu, which there were a lot of unbranded Chinese products that were selling in a sort of massive consumer marketplace. And then we'd spent so long on that business that my co-founder and I decided to leave. It was kind of a slog, a tough business, which sort of leads to today where there's this explosion of AI technologies and tools and frameworks that my co-founder and I were sort of linking back up after taking a year or two break and exploring like what types of consumer applications or SaaS businesses could you build with this?

I wouldn't say that we're machine learning practitioners. I think we're just more excited about the types of products that you couldn't build a year ago that you can now build and ways that you might bring that technology to people that are not necessarily early adopters and going to like figure out how to run some of these models locally or could even navigate chat GPT, like my mother-in-law, she's like reading about AI stuff in the New York Times and really know how to dabble with it. But when I gave her like a phone number to dial and engage with this sort of automated agent, it kind of blew her mind. And like that sort of old school UX to this sort of crazy new technology was like an interesting, I think that's an interesting paradigm to try to build products with. So that sort of leads us to today where we don't really know exactly how this stuff is gonna play out, but we love tinkering with it.

Hunter Kallay: And so your project is called anycall.ai, the current one you're working on. Can you just tell us a little bit about that and what that does?

Chris Estreich: Yeah, so the idea is that you, we're almost reaching a point where you can service any phone call using a series of models sort of in line and so we're getting to the point where text to speech, like some of the stuff 11 Labs is doing, it's almost too good. Like you can clone a voice and you can get AI to sort of speak with accents and intonations and things like that, that it's almost indistinguishable from real humans. And so you sort of pair that with some of the automated speech recognition technology that is near perfect for English and some other sort of largely spoken languages.

And you package all that together. There's a couple of other models like voice activity detection and stuff like that to understand when there's a pause in the conversation or when someone stops speaking and should try to process the data. But you sort of line all these things up in a stack where it's everything sort of running and streaming in real time and you can kind of simulate a person responding to a telephone conversation. There's a little bit of latency, but minus that latency, it's sort of indistinguishable from a human. And when you just run the inputs and the outputs through chat GPT, in particular, we're using GPT-4 Turbo, which is sort of a significantly step up but not that much slower than GPT-3 .5 Turbo, but definitely not as slow as GPT-4. Then the capabilities that you can build into these voice agents are like incredible.

And for particular types of tasks, it's like actually the experience is faster and better than it might be with a real human. And so we just kind of wired this up over the course of one or two days and tried playing with a few cases. Like the case we really liked was like this guy that runs a Neapolitan style pizza shop that has like the persona of Vito Corleone from the Godfather and just had, like gave him a little bit of a personality and had him do some basic tasks in terms of taking orders for pizza. And the results were just like mind blowing. And we thought like, okay, we're not exactly sure what the killer use case for this is, but I think there's something here. So let's fire up like a self-serve platform for building these types of agents and just see where it takes us.

Hunter Kallay: On your website, anycall.ai, you have, you can play around with Vito, which is kind of cool. And before we got on, Ryan and I were talking to him, I'll just give people an example of what it's like here.

Ryan Lazuka: And this is live right here. It's awesome.

AnyCall Demo: Oh no, this is Vito speaking from the Don's Doe over here on Avenue J in Brooklyn. How many pies can I get started for you today? I'll take two pizzas. Great, two pizzas coming right up. What size would you like for these? We've got small, medium and large.

Hunter Kallay: I'll take the biggest pizza you got.

AnyCall Demo: Ah, I like your style. Two large pizzas it is. Now what toppings would you like on these? I'll get some. Just so you know, we stick to the six here.

Oh, all right. I'll have some. Could you tell me which toppings you would like?

Just go ahead and list them for me. Of course. What toppings would you like on your large pizzas? I'll have some.

I'll name them and I'll make sure they're added to your order. Awesome sausage. Sausage, got it. That's a choice you can't refuse. Will it be sausage on both of the pizzas? Perfect, two large pizzas with sausage. Now what time would you like to pick them up? We can have them ready no earlier than 30 minutes from now.

Hunter Kallay: I need them quicker than 30 minutes. I'm sorry, but the earliest

AnyCall Demo: we can prepare your order for pickup would be in 30 minutes from now. Our pies, they need a little time to be perfect, just like a fine wine. Can we make it work for you in 30 minutes?

Hunter Kallay: So that's just a little bit of veto. And it's just a demo version, obviously. It's just kind of funny, but it sounds pretty cool.

Chris Estreich: Yeah, I was watching the conversation through E-Men. You can already see some of the limitations in terms of like it got interrupted by the feedback from the audio and your speakers and those sorts of things, but generally pretty good and the stuff is only gonna get better, but yeah, that was fun.

Ryan Lazuka: If you guys wanna test it out, you can test it out on the website. It's really cool. The first time we tested it, it was like, it was very good. Like the latency wasn't bad at all. And on this example, it picked up some of the feedback like Chris was saying on our computers and through the podcast, but it was talking about, it just had a lot of personality to it. It was talking about the dough and how it was made, how you should let the dough, we wanted the pizza done in 20 minutes.

It's like, well, you gotta let the dough rise and just going into how the process of pizza making goes. So it was really cool that it had that much personality behind them. That was the cool part about it.

Hunter Kallay: And the cool thing I think is you did one for Santa Claus around Christmas time and you could have your kids talk to Santa Claus. You would call and then tell them some information about your kids and then Santa Claus would pick up the phone. It was so cool. Actually, the person that was in the room with was like, like thought that I was talking to an actual person. And I'm like, yeah, like this is just an AI. Like it was so realistic.

It was super cool. But can you just tell us like, it seems like you're moving towards, like you can actually design the personality of the AI entity you're calling and then it can learn from you in some ways as well.

Chris Estreich: Yeah, I mean, there's a couple of interesting things. Like just to be clear, like the heavy lifting is the LLM in terms of like figuring out the personality. And so this is so typical of a lot of these AI applications now is that to a certain extent, they're kind of wrappers around the capabilities of these GPT models. And so for us, like the innovation is more figuring out how to reduce the latency, process the audio in sort of the optimal way to clean up all these sort of interruptions and things like you just experienced. And so in that sense, it's more of like a systems design problem, but these LLMs are just incredible for being able to creatively come up with the personality that you feel, which makes it like a really good choice for this type of application, assuming that you want a personality. Like we have, we're talking to some folks who sort of want no personality because it's like a business application, but I think it's way more fun if you are dabbling with some of the creativity that these LLMs can produce. And then in terms of like learning over time, like there's definitely like some magic involved in fine tuning for specific use cases, in which case you can actually choose a less capable model that's fine tuned for your particular case and then really reduce the latency because you can run out like an open source model on the edge and really not have to pay the penalty of hitting open AI as API. But it's just, it's so early days on this stuff that like, I think a year from now you might see a world where there's like effectively zero latency and the cost to do this stuff is really low and it's almost indistinguishable from a human, which I think in its own right, it's sort of a scary scenario if you can imagine like people calling you and harassing you and upselling you and it's all AI and you can't really tell.

Ryan Lazuka: Like even if you have people, if you're watching YouTube videos, a lot of them, I'd say like probably 20 or 30% of the ones I'm watching right now for educational reasons, like a lot of them are AI voices and I can tell they're, you can tell they're AI voices because it'll be the same voice, like the same 11 labs AI voice on multiple different videos.

But at the same token, it's like, people get scared of interacting with AI, but a lot of times those videos are better than ones done by humans. So does it really matter in the long run? I don't know. Like one of the things that you mentioned, Chris, that I think might be a thing is, all these, there's a lot of companies out there that are creating AI voices for call centers, for answering customer service calls, but maybe, and I just thought of this, maybe the customer service agents, these AI agents are gonna be programmed to have their own unique personalities, each ones, because it makes it more interesting, rather than just calling and getting the same AI person every single time. Maybe you'll have an AI personality on the call center that is, was born in India, then moved to the United States and then has a personality because they are into math or something, I don't know. Like it's not, maybe that's the ticket is, we're gonna have all these personalized AI interactions rather than just some scrub down version that gets right to the point.

Now some customers or some clients might not want that, they might want a scrub down dumb version AI bot answering calls, but it sounds like a lot of people would really rather talk to an AI bot that has personality. What are your thoughts on that?

Chris Estreich: Yeah, I mean, that's how I feel about it. Like I think there's, well, certainly it's, it's probably a very good thing that you can get this stuff to match the culture of what you're trying to build, the company you're trying to build. But I think for us, we had, at the peak, we had thousands of customer support emails come in and we had like a massive team.

And it was like very dry, like transactional sort of emails being sent back and forth. And like there's a possibility for so much better. And I can imagine a scenario where your customer service prompts sort of evolve over the course of the conversation to progressively get sort of more expressive or funnier, like evolve the personality based on the type of speech coming in from the caller. So if you can tell they're pissed, then you can kind of dial up like how serious the personality is. But if it makes sense to be more whimsical, then you can sort of progressively move in that direction through prompts sort of evolving based on the prior input. And I think like something like that could be really, really interesting and sort of fun and probably a net benefit for the world of customer support where it's sort of a bad experience on both sides. You've got like customers who are pissed and then you have poor service agents who are like very even brunt of that.

Ryan Lazuka: Yeah, it's almost like when you meet someone in the real world and you like them because you like their personality, right? And you sort of just click and hit things off right away. It's almost like that could happen with a call center call as well. Like you're just gonna like this person because the AI is gonna be trained to know what you like based on your voice tone and just what you say over the phone to them, which is crazy.

Chris Estreich: Yeah, and that also brings up this, I think like a lot of folks we talk to, they wanna leverage this as like a cost saving mechanism where it's like, okay, I can spend less money on support agents because there's this heavy lifting the AI can do. But I really like kind of flipping that and be like, well, what if it can sort of provide such a better experience that you're upselling people or there's more of like a positive outcome to the call as opposed to just a cost cutting mechanism. And I think that's an interesting vector, I think, down.

Ryan Lazuka: For sure. I goes back to this, this topic keeps coming up is people are scared about AI, but I think maybe one day people are gonna prefer to talk to someone at a call center at least that is an AI agent because it's just gonna be so much easier to talk with them and they're gonna know your personality a lot better and you're gonna have a better conversation with them rather than with a human. Because that human might be pissed off, they might have answered 20 other calls and not in a good mood, but this AI bot is always gonna know exactly what you want based on your voice tone.

And they're always gonna be happy no matter how many calls they've taken. So what's your, right now there's a wait list to get on any call that AI, what are you guys looking in terms of launching the product to the end user?

Chris Estreich: Yeah, so we're right now, we've plucked a few people from the wait list that have use cases that we think are particularly interesting. And so there are things around customer support that I think are pretty obvious, but there's also some things that came up that I didn't anticipate. Like one that's come up a couple of times is like property management type stuff where let's say you're a company that manages a hundred properties and things tend to break at a pretty high rate when you have that many.

The number of calls that these companies will get to service like basic things is like pretty overwhelming and trying to deal with that call volume seems to be a real challenge. And this technology is like perfect for that, I think. And I think you can sort of seed it with some specific data for that, like the types of buildings or the types of problems that might frequently pop up and apply some retrieval augmented generation techniques to really like try to solve those things in real time and then kind of fall back to a human.

And so basically we're trying to nail a few use cases out of the park. And in doing that, we'll build some capabilities that I think will be broadly applicable to anyone who wants to use this platform as a self-serve thing. And once we feel good about where that is, then I think we'll sort of open it up as like a self-serve product and really let people go to town and use their creativity and build whatever they want with it.

And I think we'll do a pretty generous free plan so that we get the benefit of seeing what people can do with this and go from there. No, I mean, on that I think we move pretty fast. So I would imagine this is, we're talking like a couple of weeks to let this process play out. But yeah, pretty excited about it.

Ryan Lazuka: That's awesome. Well, yeah, I mean, the cool thing is like you've got, had a bunch of experience already launching companies. So you know how to at least point your, this newest venture in the right direction. Talk me through the platform.

Hunter Kallay: How does it work from the consumer side? Let's say, you know, you release it and I go on, I go on to anycalls.ai. What does it look like and walk me through kind of what I can do on there?

Chris Estreich: Yeah, so the way we're trying to build it is, it's sort of like a no code platform where there's a series of wizards that let you configure the agent to your liking. And so you can do things like choose the voice, represent the personality, describe the sort of outcome or task that the agent's trying to perform. And really what this does is it sort of collects all that and puts it in a prompt that we've optimized for this particular task. That's sort of like the most basic version of this. And now we could expose like the raw prompt to people and let them play with it the way you could with like a chat GBT. But I think that to a certain extent that doesn't yield the bet. Like we feel like we've converged on a prompt that works really well for this particular case. And it has to adhere to like a certain set of properties. Like you don't want to stuff too much into it or it starts losing some of the context as the conversation flows.

And so that's sort of where it stands today. But we are working through a version of it where it's more like a sort of a drag and drop type editor where you can string multiple prompts together. You can define triggers that will call out for particular integrations if you want to add more capability to it. And like there's not, like the prompting ability right now is I think good enough to where we don't need that really power usury type version of it. But it's something that we think we'll need eventually.

And so we're sort of tinkering with it. But at a basic level, it's sort of like a, imagine like a type form like wizard where you're sort of answering basic questions. And then at the end of it, you've got this agent configured to your liking and you can go back and tweak things. But it'll immediately dump you into a web RTC like interface like something like Zoom where you're like actually chatting with the agent you built in real time.

It doesn't use any phone systems. It's all over like RTC over the network. And you get a sense of what the capabilities are, the agent you just created, and then you go back and edit it and you've got this sort of iteration loop happening that way. And at the end of that, hopefully you have something that you like and then you can figure out if you want to fire up a new phone number to deploy it or you can use a call forwarding on your existing number to switch over to it if you want to do that. So that's sort of the basic flow at this point.

Hunter Kallay: And you said you have multiple languages too. How many languages are you capable of right now?

Chris Estreich: Yeah, it's really bounded by how well chat GPT will do with a given language. And so I think they claim something like 60 languages are pretty high quality. And then the other sort of constraint is the speech recognition models, how well does it do with multiple languages? And that that is like automated speech recognition has been around for so long that it's largely a solved problem. It's really about latency more than anything else, but there's multilingual ASR models that are open source that are fast.

And so that's really not a constraint. And then for text to speech, I mean, there's so many text to speech models. I think 11 labs to me is the state of the art in terms of latency and expressiveness. And I think they support most of the languages that chat GPT works well with.

So we're talking like between 30 and 60 languages of very, very high quality, which is such a, like that in itself is a game changer. Like if you just want to have your customer support line, you know, work with like Spanish or French, you can immediately do that with like almost no configuration, which is kind of a game changer.

Ryan Lazuka: So you can at the end of this configuration, you, one of the, one of the things that will come up is like you can have a phone number, right? So people can call that phone number right away and talk to the agent you just created. Now, how does it work in terms of like, so for the customer service, service example, in terms of privacy, can the end user look at that phone call or like, how does that work? Like, because is there a transcription of the call or are there like rules and regulations around that, that that cannot happen?

Chris Estreich: Yeah, so there's a couple of things we could do. So one, there's sort of an outcome from the call. So in the veto case, it would look like a structured pizza order that you would then plug into the system for processing pizza. We also have sort of the summary of the call. We've got the structured data. We've got the summary of the call.

We've got the transcription and we have the raw recording. And so there's sort of a spectrum of what we can expose to the end user there. In terms of, I mean, the short answer is we haven't really decided how to handle that quite yet. All that stuff is available, but not really exposed.

It's mostly just the structured data at this point. In terms of the phone privacy, there's a few considerations. Like, one, does it make sense to disclose that the call is being answered by AI? I think probably it does, but I think we wanna do that in a way that it doesn't sort of ruin the illusion.

I think like, so that's sort of open-ended. And then in terms of whether you can record the call, I think it depends on the state of the caller or the state of the answerer. And so we haven't really quite figured that out, but I think like in the case where the calls are being handled sufficiently well and performing their functions, you may not need to expose the recording.

I think the recording is good to have for... generating training data to improve the quality going forward. But I think in certain cases, it may not make sense to expose that. In other cases, if it's all sort of internal communications, then it does. So there's definitely some issues there to figure out.

And you know, like that's, those are the types of compliance things that I think I worry about later and worry about sort of building the product in a way that solves a problem now. That's just sort of as Berkeley, how we operated.

Ryan Lazuka: Sort of like open AI's model in a way. Like they just sort of, you know, got all their data and then worry about things later, you know, how they got there.

Chris Estreich: Yeah, worry about copyright later. I guess.

Ryan Lazuka: One of the things that we interviewed replicants and they do custom, all their product is solely for customer service from what I remember. But one of the things they said is they do reveal on every call that it's AI. But the way they did it was like, when someone answers, when the AI answers the call, it says, hey, this is your smart, smart automated call center. Or something to the fact that whatever language they used, you wouldn't know that it was AI. So it didn't put anybody off. Cause if you just said, hey, this is the AI robot taking your call right now. Most people, a lot of people might just hang up. But the way they put it, I forget the term, but it's in our podcast with them. It made it sound very friendly and like, oh, who cares? Let's keep talking.

Chris Estreich: Yeah, that's interesting. I think like the, for me, and I might be weird in this, but like I'm almost more likely to talk to AI than like a human. It's like, it's sort of like, if I'm gonna, if I can text you or call you, I'm gonna text you, right? Yep. Similar type of, I think, yeah.

Ryan Lazuka: No, I agree. I think a lot of people are gonna start to fall under that. They're gonna start to, they're gonna start to realize that it might take time. Like when, you know, any new technology comes out, like email or the internet, but people are gonna realize it's better to talk to AI than it is to talk to a human. I really think that's gonna happen.

Chris Estreich: And it's kind of crazy that you're not, I mean, I don't know if you guys are so entrenched in this world, maybe you have experienced a lot of cases where you're interacting with AI during the course of sort of normal things, like a chat bot or over the phone, but like it really doesn't feel like it's penetrated my life in a way that you might expect. Because I, you know, I'm an early adopter on all of these things and I see so much of it. And I'm like, okay, like the capabilities are there. Why is this not like sort of widespread? Infecting every area of my life. It's just not quite there yet.

Ryan Lazuka: What are your thoughts on why that is? Do you think it's just slow development speed from getting things to market or what's, what are your thoughts?

Chris Estreich: Well, I think the companies that have the distribution and reach to really make it feel like it's infecting your life are probably like it's so early that they're kind of slow to deploy these things would be my guess. So like the classic example is like, you saw the rabbit, rabbit.

Hunter Kallay: Yeah, the little device. The rabbit device yesterday. And like the feedback I saw to that was like, okay, this is silly. Like these models are gonna be deployed to your smartphone like sort of imminently as soon as Apple feels like they sort of have it dialed in in a way that makes sense for them. And whether they're using someone else's model, probably not, they're probably developing their own. They'll deploy it to your phone. It'll like the capabilities will be amazing. And this rabbit thing will be like just lost to the history of failed launches.

Chris Estreich: And so I think they're taking their time with it. And as soon as that happens, then it feels like, okay, LLMs are gonna be a part of everyone's life almost daily. And so I think for phone systems and stuff that you interact with like Shopify stores, I think that it's just a matter of time, but the players with the massive distribution are taking their time with it.

Hunter Kallay: Particularly with AI voice, seems like that's a big industry that you're taking part of. Where do you see the future of AI voice going? We've talked to some people who talk about voice actors and how it's kind of elevating the game for voice actors. Where do you see this application in the next year or two or five, 10 years down the road?

Chris Estreich: Well, I think it's, the technology is, yeah, I mean, there's like the doom and gloom way to look at it where it's like replacing humans to a certain extent, which, but there's also all these cases where, on your iPhone, you have this voice cloning thing, which I think presumably the reason why they built it is, if you lost your voice for whatever reason and you need to be able to synthesize it, then you can do that. And so in a sense, it's like a really empowering feature and that's one way to look at it.

Another thing you mentioned is like people on YouTube who maybe don't wanna get in front of a camera or they hate the sound of their voice and they don't wanna speak and they can use speech synthesis technology to basically create content where otherwise they wouldn't be able to create content. I think that's like a net win for the world. And I think that's the way I tend to look at these things. And so in terms of like voice acting and stuff like that, I think it's the types of stuff getting created, I think we'll get better and better over time and things that might otherwise not have existed will. And like there's certainly trade-offs involved there, but like I tend to look at it more optimistically.

And certainly like there are some financial motivations that I mean, I think you saw it with like the, this screen actor is Guild and writer strike stuff where it's like, okay, this is sort of a scary scenario to be we're on the precipice of replacing all these very skilled people. But I think, you know, thinking for myself, like I use co-pilot and visual studio code and I can like tell it to write particular pieces of code that I'm too lazy to write or are pretty trivial and it does a fantastic job. I don't feel like, I feel like that frees me up to work on more creative things sort of on the edge of what AI can't do. And in a sense, the productivity increase I'm getting is pushing me in that direction. And I think overall I'm gonna build better, more complex, more interesting things as a result.

And I think there's a way to frame, you know, the question you're asking similarly. You know, I'm not necessarily afraid of being replaced as like a product builder quite yet. I think it's just gonna improve my output and my capability.

Hunter Kallay: Yeah, I thought about this quite a bit. And I think AI is really raising the bar in almost every single industry. You can't be average anymore because AI can do average in about every single industry. So you have to find ways to leverage AI in whatever industry you're in and kind of arise above that to do something even greater, something that requires even more human creativity. But that doesn't necessarily mean that there's more work on the part of the individual because now you have this amazing tool that can come alongside you, take care of that sort of mundane stuff that you were hinting at, stuff that takes a lot of time.

And now you can focus your energy on building on top of that. It's more creative stuff, more innovative stuff. And I think it's going to really raise the level of humanity in general in almost every single industry. So it's pretty exciting, I think.

Chris Estreich: Yeah, and I think there's a parallel to like company building, especially in the AI era, where if most of the technology is going to be commoditized or developed by a very small set of elite ML practitioners, like the researchers are working in open AI, and you're not really going to compete with them on building foundational models or things like that. But you have these fantastic use cases for the technology they have built. Like how do you sort of innovate in that world? And I think it comes down to, there need to be business model innovations or marketing innovations or innovations in other areas that lets you differentiate.

That's not necessarily like hand-to-hand combat on the tech stack or something like that. And I think that's been a tough transition for me to think through. Like when I built my first company, it was far less competitive. There were far fewer companies being built, and it just feels so noisy. Like every random idea I have, especially in AI, I just do a quick search and I find 10 other companies doing that thing. And so that just means that the way innovation plays out is going to look a little bit different than sort of where it historically has been.

Ryan Lazuka: Yeah, it's almost like now, if you're building a startup, you really have to, you might have a great idea, you might have 10 great ideas, but you really have to think through what the landscape's going to look like a year from now. A lot of the AI startups out there right now, chat GPT could come out with something the next day and just blow your business model out of the water and your business for the most part. Just like the GPT store, that came out recently, I think it came out today, but you'll be able to make custom chatbots via open AI and on their store. And that might put a lot of businesses out there out of business, like something that summarizes like YouTube videos or something like that.

There might be a website that does that right now via AI, but there could be a free tool on the chat GPT store that does that for you. And it's hard, you got to think through all that stuff before you launch a product now.

Chris Estreich: Because like you said, there's a lot of young entrepreneurs building interesting things in AI that are going to learn the lesson of platform risk pretty

Ryan Lazuka: quickly, that maybe they haven't experienced before. Right, even if you have a tool that does, can't be replicated on a GPT store, you're basing your business model, I'm not saying you in particular, but most of the companies out there are basing their business model on opening AI's API, and they have full control of that. So that's just something else to keep in mind that's a risk that you have to keep in mind and be prepared for as well.

The good thing is that there's a lot of other APIs out there that you can call for LLMs, and there's a lot of open source ones that are coming to the market as well. It makes me think at night sometimes, like I feel like we're going down this road where there's going to be five, 10 big companies that control everything in the United States. I hope it doesn't get like that, but that sure seems like where we're headed.

Chris Estreich: Yeah, that's why I'm such a, I'm a huge fan of like Meta's approach to this stuff. I feel like these things need to be open, the models need to be open, the weights need to be open. I think it's a scary scenario where there are just a handful of closed source things that are sort of best in class, that everyone uses. And yeah, I mean, there's a lot of sort of sinister stuff that could be lying in these models that we either don't understand quite how they work or there's some sort of, not necessarily malicious, but like there's some decision making governed by KPIs that sort of creep into the development of these models. So I'm just like rooting from the sidelines for like the Metas of the world or like the Mistrolls to build these very capable open source models.

Ryan Lazuka: I mean, you can tell, you know, what you're hitting at Chris is like, even with the ChatGPT, like if you go on there right now, you can tell that it's influenced by, you know, the big wigs of the companies or the decision makers of the companies of how it's gonna work. Like right now you can't go on ChatGPT and say, and they did this recently from my understanding, you can't go on there and say, give me, make me a picture of Taylor Swift riding a horse, you know, via Dolly three. It won't let you do that anymore.

So any kind of big name celebrity or big name politician, they won't let you create an image of them anymore. So there's all these restrictions set in place by ChatGPT and OpenAI, whereas the open source models don't have that. So, I mean, they do, if you wanna implement those restrictions, but it's a lot more refreshing to go on Mistroll and ask to be, you know, roasted and they'll swear at you rather than ChatGPT, if you ask them to roast you and swear at you, they'll just say, I cannot do that. So just those small examples, like you don't wanna have a big company out there that just has full control of all these ChatBots, cause that would be a bad thing.

Hunter Kallay: I wanted to ask you about copyright. That's a big concern right now as far as material goes, but specifically with voices. Who owns the voices that are created by some of these AI voice generators? And who should own them?

Chris Estreich: Sort of an interesting question. The voice cloning capabilities make it really easy to take someone's likeness and profit from it in certain ways. And I think like, it's sort of TBD on like how that stuff, like what the current state is. I think there's a bunch of gray area built into it.

So I feel more comfortable talking about like, where I believe it should go. And I think like, the attitude I tend to have is like, if we are too restrictive on copyright, then like there's a real opportunity cost like to losing potential innovation. So I tend to fall on the side of like, let's ignore copyright for now, while we kind of figure out how the landscape is going to evolve. And if it turns out that people are, like some group is disproportionately heard in a way that seems avoidable and makes sense, then maybe you can legislate that after the fact. But I think people tend to, if you're on the wrong side of a trend and you feel like this kind of thing is directly impacting you without really understanding the causality, I think it's easy to sort of loudly complain about the state of the world and how, you know, like the sort of gray areas around copyright and this stuff is disproportionately hurting you.

And it's hard to know like what's actually true or not. So yeah, I think like this came up with music. And there are certain artists that, I think tend to see themselves as benefiting from people cloning their music because it sort of brings attention to them as an artist. And then there are people that feel the exact opposite, like this is completely cannibalizing like my sales or whatever. And so there's definitely different ways to think about it. And like in the absence of real compelling data that suggests that we know what's happening with this stuff, you know, I tend to fall on sort of the, like let's ignore copyright right now. I think it's better that open AI did and yielded, like these capabilities that are like in my mind, unequivocally like good for the world that might not otherwise existed and like with the quality that they currently do. And, you know, now they're big enough to where they can fight legal battles or against the New York times or whoever else.

Ryan Lazuka: You mentioned a little bit earlier when we first got on the podcast about your tech stack. But can you go into a little bit more detail? Feel free to share as much or as little as you want about that as well.

Chris Estreich: Yeah, I mean, there's, I think like the most interesting part is that it's almost like real time audio processing happens. So as soon as a large enough chunk of audio data comes into our system, there was a bunch of operations that get performed on it.

And you can think about it as sort of pipelining this audio data through a series of models to process it and spit it back on the other end as quickly as possible. So you get an audio chunk and then you decide, does this audio chunk represent a pause? Like, is there a long enough silence baked into this to where we can begin the process of transcribing it through a speech recognition system? And so there is what's called a voice activity detection model which is just basically been trained on is this human speech yes or no? And if no, like what's the duration of non-human speech?

And if that's long enough to represent a pause that suggests that we should process the audio chunk for to transcribe it into speech, then we run that through a model called whisper. It's basically an open source model that open AI built and it's been ported to C++ originally written in Python. So it's been ported to run much faster.

So it's running on a local machine here. It'll take that chunk transcribe it into text and then it will run that through our prompt in the LLM. And then as soon as it starts streaming tokens back in response to that text, then it takes those chunks and sends it to 11 labs, which again has another streaming API. So you take that streaming API and then spit it back in the case of a phone, it's a Twilio app. So it takes that streaming data and we'll echo it back on the other side.

In the case of like the web application, it uses a web RTC. So it's pushing data back on that socket. And so you get like very small incremental chunks of processing happen.

And the reason why you do that is because if you buffered it up, then there would be long delays and it would sort of break the illusion. And so building that sort of pipeline of like processing the smallest amount of data to actually successfully make the application work was sort of like the, not really an innovation, but like that was sort of the hard part of this. And then obviously like we didn't develop our own speech recognition or text to speech or, you know, like that stuff is sort of out of the domain for what we could have done in a couple of weeks. But yeah, that's effectively how it works. And there are some other interesting things that can be done that we're not currently doing. So there are models that are capable of understanding when it's one speaker versus another. So you can imagine a conversation happening between multiple people or someone talking to you in the background while you're on the phone, well, you can actually filter that out by running it through that model in parallel to know that, okay, you're speaking, so I will run that speech through the pipeline, but I will ignore all the other speech coming in. And so there's things like that that are actually possible in pretty high quality now.

Ryan Lazuka: I think one thing that separates your tool from others that we've used before is, it feels like the latency is really low and good. Like it feels very natural. And there's still so much more improvement to be made too. So that's really cool that you guys got it down pretty good for the end user right now.

What do you do? One thing that another problem that keeps coming up is like, what do you do for natural language when someone's calling your app and the caller interrupts the chat bot? Because that seems like that happens all the time in the natural language when humans are talking with each other. But how do you mitigate that or make it feel natural when you're talking to an AI entity and the end user interrupts the AI? Like does the AI sort of just forget what they were gonna say? Do you force the AI to forget what they were just about to say and sort of listen to what the user's saying in that moment of time? Or how does that actually work? Because that seems like that's a hard thing to get right.

Chris Estreich: The most annoying thing about that actually, well, first of all, we can run in non-interruption mode. And when you run in non-interruption mode and just assume that you're not attempting to interrupt, like the conversation flows much more seamlessly.

However, we really wanted to support interruptions. And so there's a couple of issues. One is there's a certain amount of data that's already on the wire going, that is gonna go back to your phone. And when you interrupt it, we can't like undo that data's already been sent. And so it's sort of like the latency of the systems that are sort of like moving those bits on the wire that like if that's really fast and the interruption feels seamless, but if your network is a little bit slow or any sort of subsystem along the way is experiencing unusually high latency, then you get like the interruption happens, but then you keep getting like output from the agent that you don't expect. And that is a tricky problem to deal with. The best way to deal with it is to remove as many third parties from sort of the data flow as possible. So for example, when you're using like webRTC versus Twilio, it doesn't have to go through sort of Twilio systems at all.

And so that it feels a little bit more snappy and responsive in that case. But since we're not going to like replicate Twilio's infrastructure, there's really no other way to deal with that for now. So anyways, that's one problem that's difficult. And the other problem that's difficult to deal with is like figuring out when you should interrupt. And so there's a lot of external noises that will trigger interrupts that you shouldn't necessarily interrupt for. And so what we do is we make it a little bit more difficult to interrupt than you could. And so we notice when there's the start of some audio segment that may represent you interrupting and we kind of let that happen to some sort of natural pause. And if it feels like you're saying like, hold on, as soon as the pause comes after the hold on, then we've already processed that and we know that we want to interrupt. So we sort of kill all the existing processing that's happening in the background and sort of start over. But in terms of what the prompt receives, the prompt receives the full transcript of everything that's happening with like the, if we cut you off mid sentence, like what the agent is saying, it'll see that cut off and it'll process it like a real transcript where an interruption happens. And so in terms of the response that OpenAI is generating, that is actually pretty good.

And so that works well. It's just sort of like deciding when to interrupt versus not interrupt is a tricky problem. And then sort of preventing this excess of data on the wire from flowing through the system.

So those are the two challenges. And I think like we can do a much better job, but the difference between a system that you can interrupt and not I personally found, I much prefer the one that you can interrupt, especially when you're working with a prompt that tends to be verbose and you just wanna like go on to the next question or whatever.

Ryan Lazuka: Right, and it definitely feels like you're not talking. If you get interruptions even decently right on the AI end, it feels so much more natural, talking to that rather than just some AI bot that just forces you to whatever it wants to tell you and you can't talk back to it, you know?

Chris Estreich: Yeah, exactly. But I mean, right now like the thing that happens more than I'd like is like you interrupt it and then you sort of interject something and then pause and then start talking again. But that second pause, I think that pause there, like the agent will try to, like if there's too much of a pause, it'll try to respond. And so if you don't get the pauses right, then you end up like both talking again at the same time.

I think that happened Hunter with your call with Vita like once or twice. And that, it's just like annoying, but I think we can tune that to where it feels a little more natural. But you know what, that actually happens in the real world too.

Ryan Lazuka: Exactly, exactly.

Hunter Kallay: People talk over each other all the time. It does happen in the real world all the time. Especially on these podcasts when Ryan and I try to talk at the same time, that happens all the time. I mean, it works really well. And I think that over time, it will continue to get better for sure. And I'm excited to see what happens for sure.

Chris Estreich: Yeah. Well, like if you break down the latency, the biggest latency is the time to generate the first token from OpenAI. And so I think the worst case from your call I'm looking at, it took OpenAI 2.4 seconds to respond with a first token to one of your inputs. On average, it's actually less than a second. And so when it ends up being closer to the low 100, couple of hundred milliseconds, it starts to feel snappy. But not surprisingly, the variance of latencies is pretty high variance. I think there's just a lot of usage of OpenAI. And I don't really, yeah, I mean, who knows how scalable their infrastructure is.

I think they're just kind of like trying to meet demand. But if you were to move to a model, a less capable model that ran locally, that was fine tuned for your particular use case, then that latency would be gone. And it would almost be like, I think it would be around 400 milliseconds to respond to your pauses, which would be about as good as what a human is gonna do. And so when we get to that point, then this stuff is really, I think gonna be, you know, like when there's Llama 3 that comes out, that's like as good as GPT-4 that you can run locally with pretty strong hardware, then yeah, this thing is really gonna be impressive.

Ryan Lazuka: So one day, like all of our computers phones will have some kind of local Llama on it and then we'll handle that data for us instead of calling an API to OpenAI.

Chris Estreich: And then you could just bake the logic onto the phone. But I mean, in this case, my server processing the data won't have to call out to OpenAI at all. And that will be a huge win for this particular application.

Ryan Lazuka: Well, we'll start wrapping things up here but thanks again for coming on Chris. We really appreciate it. If anybody wants to check this out again, it's any call.ai. It's one of the best conversations I've had with any kind of AI. It's really fun to play around with their demo. So go check them out. It feels very natural. Is there anything else you wanna promote here, Chris, other than your company?

Chris Estreich: No, no. Thanks for taking the time to chat. It was a lot of fun.

Hunter Kallay: Yeah, it was super fun. And then be sure to subscribe to Ryan and I's weekday newsletter at fry-ai .com where we do top three stories of the day along with some tools and community interaction. And we have a mystery link too.

So that sets us apart. You can click that link every single day and it'll take you to something crazy, cool or interesting in the world. I should go click that right now.

Yeah, go click the mystery link right now. And then on Sundays, we do deep dives into very cool developments and developers like this one, anycall.ai with Chris. And then be sure to subscribe to our YouTube channel behind the bots so you can see all our cool interviews. Thanks for watching. Thanks for watching. Thanks for watching. Thanks for watching. Thanks for watching.

More episodes

Chapters

Creators & Guests

What is Behind The Bots?