Behind The Bots

Josh Meyer, co-founder of the synthetic voice startup Coqui.ai, joins the Behind The Bots podcast to discuss the potential of AI-generated voices. Coqui provides an intuitive platform for easily creating high-quality synthetic voices.

In this episode, Josh covers:

- How Coqui helps creative teams like video game studios produce realistic dialogue and voice acting at scale.
- Coqui's "prompt-to-voice" tool that allows users to describe the type of voice they want to generate.
- The challenges involved in making conversational AI sound natural, including aspects like interruptions in dialogue.
- How the synthetic voice landscape is evolving with more specialized vertical applications.
- Coqui's vision for the future, including making open access generative models available to everyone.
- Advice on how to get started with Coqui through their website, GitHub repo, and Discord community.

If you're interested in the growing potential of AI-generated voices, make sure to check out this insightful conversation.

COQUI.AI

https://www.coqui.ai
https://github.com/coqui-ai
https://discord.gg/CzxHHp8mtZ
https://twitter.com/coqui_ai

FRY-AI.COM

https://www.fry-ai.com/subscribe
https://twitter.com/lazukars
https://twitter.com/thefryai

Creators & Guests

Host

Ryan Lazuka

The lighthearted Artificial intelligence Journalist. Building the easiest to read AI Email Newsletter Daily Twitter Threads about AI

What is Behind The Bots?

Join us as we delve into the fascinating world of Artificial Intelligence (AI) by interviewing the brightest minds and exploring cutting-edge projects. From innovative ideas to groundbreaking individuals, we're here to uncover the latest developments and thought-provoking discussions in the AI space.

I pitch it as the garage band of voiceover. So we've
developed what we call prompt to voice. We help you

kind of correct performances. Far you have to
pass. To be a voice actor has just been raised a lot.

You're getting beat out by AI any day of the week.
You can say, oh, talk like a cowboy, or talk like

you're in a horror movie, or talk like however you
can imagine. It's not just Alexa coming out, this

monotone voice. So yeah, in a nutshell, my
background personally, I came out of academia. I

did a PhD in speech and language technology,
mostly focusing on speech recognition. So speech

to text. And I finished that a couple of years ago.
So that was kind of the beginning, well, I guess

maybe middle of the boom of the neural networks and
deep networks in particular. And yeah, so I was

doing academic research specifically on making
speech recognition work for so-called low

resource languages, or basically just languages
where you don't have a ton of data sitting around

and you have to hack away a little bit because you
can't just throw hundreds of thousands of hours of

audio at these models when you only have a little
bit. So towards the end of my research, I joined up

with the folks at Mozilla. They were working on
this open source project that was open source for

code for training these kinds of speech
recognition models and also deploying them to the

code side of things. And there was also a data side
of things, the common voice project, which is, as

far as I know, one of the most popular kind of
crowdsourcing projects specifically made for

gathering data for training machine learning
models. And I joined the team and my kind of angle

was make the technologies work for as many
languages as possible. But when I joined, it was

working really well, both the data collection and
the speech recognition was working really well

for English. And I pushed and helped to get it
working for more and more languages. And fast

forward, I was collaborating with that team, the
machine learning group at Mozilla for a couple of

years when we kind of hit this point where the code
we were working on and the community we were

working with was getting a lot of traction. Our
speech recognition models, we were working the

deep speech project in particular, was getting a
lot of traction. People were using it in

commercial applications and we were getting
pinged every single day, like, let me pay you to do

this, let me pay you to do that. And so we decided to
basically spin out of Mozilla, get some venture

capital and try to take the project and the
community to basically to the next level to

actually get it some resources because we were a
pretty small team. We were like four or five people

at any given time, the actual core team with a
larger community around that, but to actually get

more salary to people working on this, it takes
some money. So for the last, I think, two and a half,

three years, basically we've been working on
Co-Key and at Mozilla our focus was mostly speech

to text, but when we were starting to split out, the
person who was working on speech synthesis was

making some breakthroughs that were getting a lot
of traction. And we actually had this demo that

went viral at number one on Hacker News for a while
too, for voice cloning in particular. So clone a

voice from a small bit of audio and then synthesize
that voice speaking multiple languages. We were,

I think, as far as I know, the first ones who had that
working like in kind of a production setting. And

then basically we stopped working on speech
recognition, unfortunately. I mean, when you

have a finite number of humans working on
something, you splitting between speech

recognition and speech synthesis which are both
difficult, it's kind of not tenable for a business

model. So we went with speech synthesis and
haven't looked back. Happy to dig in, there's lots

of stuff, I'm happy to dig into whatever is
interesting to you guys. Well, it's cool. So

you've been working on this for a while because
artificial intelligence, machine learning, all

this stuff has sort of came to the limelight this
last year, in popularity-wise for the general

public, but you guys have been grinding away in the
background and gotten to the point where your

work's appreciated, at least, it's gotta be
pretty cool to see that. Yeah, honestly, it's been

weird, especially, it's been weird, but it's also
been great to see how much enthusiasm is around it.

Especially for the generative AI, like we've been
working with technically generative models for,

I mean, I've been working with them for over a
decade. We never called them generative. I mean,

we did it at a very kind of like, for deep learning,
you'd usually classify models as is this a

classification model or is this a generative
model? But it never would, like generative as a

thing was never out there, hypey on its own. So, but
it's nice because now people are looking at these

technologies in a different way, even though
they've been around for a while, like speech

recognition has been around here for I don't know,
since the 70s or the 60s maybe. It's been around for

a while. But only in the last year with these
breakthroughs and the kind of diffusion models

have we gotten to the point of not just human-like
speech, but like convincing, entertaining, fun

human-like speech. So it's been a whirlwind,
yeah. Yeah, I bet it's crazy. The first time I heard

about, I mean, one of your competitors is 11 Labs.
And I heard about them before I heard about you

guys, but I remember using them for the first time
and your products are similar in quality, I think,

but it blew me away how real the voices are. Like if
someone out there listening to this podcast

hasn't used Kokui, how do you say, I'm gonna screw
it up every time I say it. It's Kokui. Kokui. Kokui,

okay. Kokui, it's kokui.ai. If you haven't
checked their site out, check it out because it'll

blow you away. It's really cool. You type in any
kind of text you want and someone will read it like

you're telling your parents to read it or your
friend to read it out for you, but it's a

computer-generated voice. But if you wanna go in a
little bit about the website and how it works,

Josh, that'd be great as well. We've got a few kind
of products. One of them is this Kokui Studio. It's

a creator application and creative teams. So I
pitch it as the garage band of voiceover, but with

AI infused. So that's one kind of, probably the
most common way people interact with us is through

that. It's at kokui.ai. We also have an API. So if
developers want to integrate us into their

applications, whether that's speech synthesis
or voice cloning, that's something they can do

from the API as well. Basically the core
technology is the core models that we make

available in the creator application. We make
available with an API. And also we have this open

source, like we've been talking about this open
source side of the project, which is the models and

the code that we are developing. That's also
available on places like GitHub and Huggingface.

But so starting kind of with the website Kokui
Studio. Yeah, in a nutshell, it's like garage band

for voiceover. And it allows you to organize your
projects. Let's say you're working on a movie or a

video game. Those are kind of the two big
applications. A lot of our customers are coming

from the video game industry in particular. But
when you organize, let's say a movie, you have

scenes. So we also have scenes in that structure.
And you have dialogue and different characters

speaking to each other. And so we give you tools to
organize projects and scenes within projects and

then lines of dialogue. And also we give you some
tools to adjust the dialogue. So it's not just like

you give us text and we give you audio. And if you
don't like it, that's kind of like, you might just

shrug your shoulders and be like, with the older
models, you would do that. Like you give us text, we

give you audio. And if you don't like it, that's
your problem, right? So that's for creative

individuals, that's not good enough, obviously.
So we've built in these tools to help you kind of

craft performances. Yeah, for the models and also
the languages that we make available, that's kind

of a big part of what we're working on is rolling out
to more and more languages. I think right now as

we're speaking, there's seven languages
available in CoQ Studio. These are kind of seven of

the more spoken languages. And we've got 13 total
that are working. Our open source is kind of like

the bleeding edge of what we're working on. So
there's 13 languages available if you go to

Huggingface or GitHub and we're polishing those
up and then putting them into CoQ Studio. But yeah,

we try to make the core models we're working on
available whether or not you know how to code and

also whether or not you've got a big budget. So
people who want to use our models from GitHub and

Huggingface, we release them that's open access
as opposed to open source. There's a strict

definition of open source that our models do not
meet. We created a new license, the CoQi public

model license because this is something might be a
rabbit hole but open source licensing for machine

learning models isn't I think a very well, how do
you say? It doesn't work very well basically but

yeah, that's kind of high level of three things
that we do. The open source, the API and then CoQi

Studio. So it sounds like one of the
differentiators between you and your

competitors is like you have more advanced tools.
If you want to do fine grain things with your tool

it's a little bit easier to do it than maybe someone
else. So when you're generating speech, you're

doing it for a certain application for a certain
end product. It might be something that's real

time and you don't want a human to review it. Like
you just have, let's say you've got like millions

of audiobooks and your goal is to make those
accessible as fast as possible to people with

vision impairments. Like it's just not possible
to take the library of Congress or some like

Project Gutenberg or something and synthesize
that and then have a human listen to make sure

there's no mistakes. It's not possible, right? So
you want something that's robust and you can just

throw text at it and you're gonna be confident that
it's gonna come out mostly okay. That's one

application that's a certain kind of model and
that requires a certain kind of API. On the extreme

opposite end of that, where we're working is in the
creative applications that need, that require a

lot of human in the loop. Like this is a, we're
creating tools for humans with the Coqui Studio.

So specifically dialogue. If you're making a
movie or a video game and you're working with like a

voice actor, you're not gonna stop when you have
like a mediocre reading of a line of dialogue,

you're gonna work until you get that line of
dialogue perfect. Or else you're just not gonna

use it. You might use subtitles instead, right?
And traditionally that's what's in the gaming

studios because it takes so long to make really
good dialogue. You need really good voice actors.

You need really good creative directors. You need
really good sound engineers and you need sound

booth time. Like video game studios, they use so
much sound booth time that they will build into

their studios like where the engineers are
sitting, coding up the video game. They'll have a

back room, which is just a sound booth, a really
nice sound booth. And they use it all the time. It's

a very creative, intensive process and it's very
rigorous. And if instead of shipping bad audio,

they'll just ship subtitles. That's why lots of
video games you have, 10 characters who get voiced

and then you have whatever, 300 other characters
who get subtitles. And so we're trying to make it

possible to have that same level of creative
control, but much more efficient so that you can

get really good, excellent voiceover for all of
those other unvoiced characters. Yeah, that's

awesome. I mean, that's, like you said, video game
could have 300, 1,000 people and to be able to have a

dialogue or written out dialogue and have a Coke we
do that for you is pretty cool, pretty amazing.

Instead of hiring a voice actor, which it sounds
like it's gonna happen one day, like maybe

there'll be no voice actors, it's possible. Yeah,
I think, so I thought about this a lot. I personally

don't think that really, really talented voice
actors are gonna lose their jobs anytime soon.

However, I think that the bar, the barrier to
entry, like the bar you have to pass to be a voice

actor has just been raised a lot. Like you can't be a
subpar voice actor anymore because you're

getting beat out by AI any day of the week. And this
is kind of a fundamental split, like with the voice

technology, what happened in the last year? So
since like 2016, we've had generative voice that

is really human-like that if you did a perception
test, is this human, is this not? You would fail it

from like 2016. However, what's changed? Why is
like 2022, the year of generative AI? For voice in

particular, what happened is we went from
human-like to entertaining humans. So like for

the last five years, we've had human-like voices,
but they're just really boring humans. And it was

something that you could use to convey
information like, Siri tell me the time,

whatever, and Siri maybe sounds human, but it
sounds like a really boring human who hasn't had

enough coffee, right? And that's not good enough
for movies and video games. Now it's different. I

mean, you have AI voices that can shout at you, like
convincing us that the voice is human isn't

enough. The voice has to convince us of something
else, like has to make us afraid, make us sad, make

us excited. And that's where it's getting really
cool this last year. And a lot of cool things you can

do is, so for example, what Josh is saying right
now, like you can say, I'll talk like a cowboy, or

talk like you're in a horror movie, or talk like
however you can imagine. So we'll have that kind of

emotion behind it, whereas it's not just Alexa
coming out, this monotone voice telling you

something. So it's really cool to play around
with, and it's fun too. It's not, you can use it for

video games, but you can also use it for just to play
around with, for entertainment purposes. You can

use it for YouTube. Like so one use case it sounds
like is, you could do faceless YouTube videos, and

have them in seven different languages
instantly, off of one script. So things like that,

where you just write one script down of what you
want the YouTube video to be, and it will output

that in seven different languages, and whatever
style you wanna come across in the voice as well. So

if you wanna have an accent, a Mexican accent, or a
French accent, it's possible that that can

happen. So there's so many, like you alluded to
earlier Josh, like it's a rabbit hole with all of

this technology. So this tool is a rabbit hole as
well, cause you can use it for so many different

things, which is really, really fun to play around
with, for entertainment purposes, but also

awesome for real world purposes, like for
businesses as well. So it's really cool. So Josh,

can you just walk me through, let's say somebody
says, I wanna try out this Co-Key, seems really

cool. What does it look like when you, I mean, Ryan
and I visited your website, but for the average

person, he's never been to your website before,
how can they use the Co-Key Studio? What does it

look like? I kinda walked through that process. So
basically you can sign up, and you get a pretty

short free trial period. We used to have it longer,
but that was getting a little bit of use. We have a

shorter free trial, but so basically you sign in,
and you've got this kind of homepage with your

projects, and they look like big folders, and you
open a project, and you've got lines of dialogue.

Each line has controls on it that you can adjust to
change how that performance happened, in any

given kind of scene, you have at the bottom this
timeline, that this is kind of, I guess, the most

garage band looking part of it, and you can review
your whole dialogue with all of the characters by

playing through that timeline. So the typical
workflow is you, this is the way I do it at least. So

let's say I have a dialogue that I want between two
people, let's say Fred and Nancy. I go in, and I know

the text ahead of time, let's say. I put all of the
text in the lines, say I've got 10 lines of

dialogue, so I got that all at 10 lines, and then I
start reviewing, and then tweaking. And the

tweaking process is, I say the most time
intensive. Like to go from zero to first draft is

like instantaneous, but then when you really
wanna polish something, make it good, that's

where the creative kind of director mode comes in.
And then when you're done with that, you can hit

export, and then you've got a wave file that you can
do it you want with. You can drag it over into

iMovie, you can put it into Unreal or Unity, but
that's the typical workflow. And when you say

tweak it, do you mean tweak the text or tweak the
voices or pretty much everything? Both. Okay.

Okay, this is actually a good point. Something
that we, Koki, is pretty clearly differentiated

versus other voice providers is we give you tools
for tweaking the performance of a single line, but

we also give you, this is the differentiating
part, we also give you tools to create new voices.

So the kind of typical way in 2023, you might go
create a new voice, a synthetic voice is to make a

clone of somebody. That's really fun, that's
really cool, but if you're actually gonna use this

in like a commercial application, you could get in
trouble very easily. Like I can't just go, I can't

just go have Morgan Freeman voice my car
commercial. Oh man. As much as that, I know a lot of

people would like to do that, but if you're like a
serious company using this in production, you're

just begging for a lawsuit. We've developed what
we call prompt to voice, which is basically you

describe in text the kind of voice you want. And if
the model works well, that's the voice you get. So

if I say I want a 30 year old man with a New York accent
who sounds like he smokes too many cigarettes, I

can type that text in, hit generate, and I should
hopefully get a voice that sounds like that. It's a

product, as far as I know, we are the only ones who
have this product working in production. It's

something that we've been working on for a while. I
think we announced it in March, and it's something

we're constantly developing to get better and
better, but it's really fun tool. It's kind of

voice creation side of things. So you can do that.
You can also, we have this kind of, I guess a guided

version of that, which is we give you kind of like a
word cloud and you hit buttons of the words, the

qualities of the voice you want. And then if it
works well, you should be getting a voice that

sounds like that. So, and we guide you through,
this is actually something that we rolled out not

too long ago, maybe a month and a half ago, two
months ago. So like you start saying, I want, you

know, a male who's a teenager, who's got a
Australian accent, and then you start hitting

these kind of attitude attributes or voice
attributes, like happy go lucky. You can click

that. And then based on what you click, it starts
giving you other traits that go well with that.

It's a pretty fun feature. So, yeah, so there's the
tools for adjusting an individual line of

dialogue in the whole project, but then there's
also these tools for creating voices. And

practically what our users do is they create a
bunch of voices and they have a kind of voice bang.

And then they use those for different characters
and different projects. And we see very much as we

move forward, Koki Studio to be this kind of team
collaboration application. So in any kind of game

development studio, you've got the writers,
you've got these like narrative designers,

you've got the audio people, you've got the
creative directors and everybody who's going to

want to come in and do something with the dialogue.
And so that's what we're moving toward, these kind

of team collaboration tools. Yeah, awesome. In
layman's terms, or maybe you can get as technical

as you want with this, but how does it actually work
under the hood? Like when you're typing in a

prompt, I want to sound like someone with an
Australian accent, happy, go lucky. How does that

work? Like are you calling on some kind of LLM to do
that? Do you make your own? Like how does it work to

get on the hood to sound like that? Everything we do
is all the kind of core models that we're working on

and we're developing in-house. And this
particular aspect is something that we have not

made public yet. So this is kind of one of our secret
sauce things that I can't go into depth about. But I

can say at a high level, yeah, there's kind of the
newest LLMs and diffusion stuff going on under the

hood. So right now, what do you use to tweak? Do you
use those kind of prompts, like those cloud

prompts? But you're working on the customized
prompts? Am I getting that right? For an

individual line of dialogue? Yeah. Yeah. So right
now, for an individual line of dialogue, we're

moving towards that. That's something that I want
to reform, change. Just like when you're working

with a voice actor, like read it a little more sad,
put more stress here and kind of natural language

direction. That's where we're moving towards.
But right now, it's kind of this drop down set of

kind of style emotions. So like angry, sad, happy,
very much a first step on the direction. But for

directing a certain line of dialogue, that's what
we have. But for creating the voices, it's more

free form right now. One interesting thing that
sort of comes from this is say if you create your own

voice and it's very unique, you have a prompt
that's very detailed, right? Do you sort of own

that? Who owns the rights to that? Maybe that's not
even sorted out yet, but that's something

interesting that's going to be coming down the
road in the future legally, I think. I totally

agree. I think that is a very hard question. If I
have created a voice that doesn't exist in the

world and I've created that using my creativity
plus somebody else's tool. So if I'm a user, I'm

using my creativity to craft this prompt, but then
I'm also using this base model from Koki and you

both them together to make the voice, right? I
don't know. I mean, I think that this is something

that in the there's been some legal precedent in
the image generation space. But I don't know

because voices, I think are a little bit of a
different thing because it's not a static. Like a

voice doesn't exist outside of speech. It has to,
the voice is a generator in itself. So it's like

when you use Koki to create a voice, you're using
this voice generator to create a speech

generator, right? I think at least for voices in
particular, as far as I know, this is still

it's going to be solidifying very soon when it
comes to voice clones. I think that that's

something that people are being too loosey goosey
with right now. But when it comes to creating a new

voice, because there's so few technologies that
can do this, and there's as far as I know, we're done

was that can do it like production level. I don't
know. It's a really interesting landscape, I

agree. It's going to be fascinating. We've spoken
with a bunch of different projects and music is one

that comes up a lot. So what I see happening is tools
like yours being used to create songs, right? So

who owns the song? If some virtual AI star appears
out of nowhere, they're just fabricated, right?

It's not a person, it's just a made up avatar. And
they're singing a song based on music that's

created by a tool like yours. Who owns the music?
Like we don't know. I think it should probably be

the person I created it, but it's going to be a very
interesting landscape. If you have some

superstar made up Taylor Swift that doesn't
really exist in real life, like who has the rights

to that? It's going to be a wild world to see how that
works out. I agree. With that being said, where do

you see your project going in the future? Maybe
five years, 10 years? What does that long term

vision look like? I think about the landscape, how
the landscape is going to change and what our place

is going to be in that landscape, right?
Contextualization here. When I think about the

landscape, how it evolves, I look towards the text
generation landscape and also the image

generation landscape because just kind of
traditionally in machine learning research,

text and image are always the most advanced.
Usually it's text first and then those algorithms

and breakthroughs, they spill over into image and
then from image they spill over into voice and

audio kind of generally speaking. I think
something that's really interesting happened

also just in the last few months when it comes to, I
mean, like Meadows accidental release the Lama

weights and kind of these language models
becoming commoditized. It's very predictable.

The trajectory was there that all of these models
are going to become commodities over time. It's

just a matter of how much time. What you see
happening, especially right now with the image

space, because those are kind of the, there's less
open source I think for the image space and there's

more of these kind of clear, clearly defined
commercially viable contenders. So you've got

mid journey, you've got stability, you've got
Adobe, I mean, you've got open AI, but then you've

got Microsoft and Google, right? They got all
these players in the space. I think that

verticalizing is really important. You're going
to see more vertical specific generative models

just because like, let's say stable diffusion
works really well for, I don't know, 3D art. It

doesn't mean it's going to work well for me in my use
case because I'm a photographer, For example,

like, and gathering the right data to train the
models is a really intense process and it takes a

lot of resources and then training the model
itself is a lot of resources and time. So I think

we're going to see more kind of verticalization of
the models themselves and also companies working

on models themselves. And what that means for us in
our space is we've already, we've already gone

after video games and the needs of video game
developers because the kind of speech that they

need and the kind of control that they need is very
different from if you're working with

audiobooks, like I was talking if you're a
library, it's just different. So right now I see us

building more and more precise tools for people
who are working on dialogue, whether that's in

video games or movies or just kind of create, or
tools that allow the creator to really tweak and

create as opposed to chug through lots and lots of
content. That's what I see us kind of working on and

in CoQiStudio I think is something we spend a lot of
time on to make it really smooth and hopefully user

friendly. And that's something that we're going
to be pushing towards. But also, dialogue in

particular is its own kind of special beast. It's
not narration. It's not audiobooks. Just working

on dialogue and getting that really good is going
to be a lot of effort from us. So like right now in

CoQi, can you put in like a script with multiple
different people and then like say, I want this guy

to speak like this, this person to speak like that.
Or is it separate like just one at a time right now?

That's so funny because you're literally like, I
think it's next week. I'm hoping it's next week.

We're rolling out exactly this. Awesome. You set
me up for that one. I didn't hack into your code. I

didn't know. It's been one of our most requested
workflow features is here I have a script. I want to

drag and drop and then have my project filled out.
And that's because we have the prompt to voice,

that's something that we can do from a pure textual
script where you describe the voice of each

character. We can go straight from a script to a
like a table read basically. First pass in

seconds. And I'm super psyched to have that there.
And that's an example of like something that

people working in dialogue really need. But
somebody working like say the BBC, the BBC wants to

have all their articles read by a synthetic voice.
That's really good narration style,

informative. That's it's a different thing. It's
a different need. But you can, I mean, that's it

sounds like what you're working on for the
dialogue is on a higher level, right? It's almost

like you guys can do both, but you're just you just
want to focus on a niche that's sort of not met in

terms of yes. I'd say partially because there's
different kinds of problems. I'll give you one

example, a very concrete example. Yeah, I'm
probably very oversimplifying it. With

dialogue, it's if you take a standard speech
synthesis tool and let's say you've got dialogue

between two people, you can make really great like
monologues for each of them. But then if you put

them together, the exact same audio just sounds
weird. It sounds off because the prosody of two

people talking that when you say something, how I
reply and how I do my intonation, everything it's

informed by what you said. And so that's a
particular problem that we're working on that you

don't have in narration. But also in narration,
when you're narrating over long stretches of

text, that's also not easy because you've got
information from previous parts of the text that

also inform the intonation and prosody of later
parts of the text. And the model should be able to

account for that. And that's what like a lot of our
competition, like let's say 11 Labs is spending

time working on this kind of long form audio
generation, which is a different beast than

dialogue. Interesting. Yeah, another thing that
we interviewed real Char a while ago and they had

some, they do cool things like you can call up Elon
Musk on the phone and speak with them live. It's

just an AI generated version of it. People when
they're speaking with each other interrupt each

other. And AI is not really good at that. So how does
the if two AI characters are talking to each other,

one interrupts the other one, like how does that
tone work in their voice? How do they talk and speak

back? It's like those little subtle subtleties of
the hard thing to work out. And his product was, I

mean, he's a brand new company, but he gave us a demo
and it was like, wow, that's really cool. Like,

because he let us see how it worked between two
people, two different AI characters talking with

each other and they're interrupting each other.
And it was it was cool to see how it worked, you know,

because it seemed like that's very hard to do. That
is very hard to do. I mean, it's there's somebody in

our community who's working on it's kind of like
similar. It's like AI based conversational

interactions. And one issue that they were
talking about is the AI doesn't know how to just

respond a little bit or maybe not respond at all.
Like when they hear the pause in the speech, it

might just go off on a long reply when a human would
just say, you know, it's human conversation is a

very hard thing. Yeah. Well, I guess the good news
is in your case, programmers can kind of put that

stuff in the script, at least when they're
entering it in for like dialogue, once you release

that a little bit, like they can add their own
pauses in the in the script or however they want to

do that. So there's there's an older way of doing
that. That's kind of like called SSML. It's like

speech synthesis markup language. We ours is all
natural language input. So you can add pauses, but

the pauses look like, you know, extra, extra
periods like you put it, you know, three periods.

Yeah. Yeah. But yes, in principle, that's that's
what we're, well, that level of control is this is

an interesting thing is like there's some people
who really want that level of control. But then

there's other people who want like just like
you're working with a voice actor, you don't

necessarily say which words need more pause or
which ones need more intonation. You just say like

give it kind of more higher, you give the actor more
higher level direction, like read it with more

anger and angst, right? And then you So we actually
have in the in Koki studio, we give the user the

option here to use a model that lets you really do
fine tuning. It's called our V1 model. And then we

have another model which predicts how to give a
performance based on the text. So it knows like if I

have a text that says my cat died yesterday, it's a
sad thing. It's not going to give you a happy go

lucky interpretation of it, you know, right? It
knows that that's not what a human would say. So

it's like it sounds like it's eventually going to
get smarter enough, but we'll do all this stuff

yourself. But right now you can do it both ways in
Koki. I think it has to do with user preference. I

mean, there's some people who want to have that
really down in the weeds, like fine tuning. And

then there's some people who don't and it they're
like different creative processes. So like Yeah,

we have we have we expose tools for manipulation
that are down to like the phoneme level, which is

the it's like smaller than a syllable. There's
people who want that level of control and there's

people who want higher level of control and
there's I think it has less to do with the

capabilities of the technology and more to do with
how people like to manage the creative process.

Like a director with a director with the voiceover
actors like humans sort of that. Yeah. Resembling

that in a way. Yeah, I mean there's there's
directors do different things. Creative

directors work differently. And we have like for
example, we have tools like let you manipulate. So

at the utterance, the word the syllable and then
down to the phoneme is the smallest unit of speech.

And I know people who are using our product to go
down to the phoneme and adjust at that level. Like

they want that level of control and they love it for
that. All right. I probably wouldn't have

locations but What's your team look like right
now? Like how many people are working on the

project and what's maybe your role? Can you tell us
a little bit about what you do? Yeah. So we got 11

people right now. There's, I'd say most are doing
kind of the machine learning model development

side of things. We've got some people working on
CoQey Studio, the web application and the design

of that. And my role is mostly sales and
partnerships kind of, how do you say? Outward

facing things. A little bit of marketing too. At a
startup, you kind of do a lot of random stuff. It

seems like but right now mostly what I spend my time
is sales enterprise sales. And what is the pricing

model look like right now for CoQey? Yeah. So
there's a split. There's two ways. Either you can

go to CoQey.ai slash pricing and you see the kind of
standard pricing, which is usage based. It

basically translates into dollars per hours of
usage or hours of speech that's been generated.

This is something that people who use us, if
they're coming from the stable diffusion and

mid-journey world, this makes total sense to
them. But a lot of other folks, it's a new kind of

model. Like why would I pay for something that I've
generated and I don't like it? I'm not going to use

it. I'm going to generate something again. But we
pay for GPUs. So we have to charge for that. Well,

it's like fractions of a penny too, I'm sure. Yeah.
But then for enterprise customers, we've just

started doing kind of on-premise deployment,
which is I think really an interesting option for a

lot of folks, which means you don't have to use our
API much faster or lower latency. And you don't

have to worry about data going places. You don't
want it to go. And that's a pricing that happens

with the conversation. We like to ask everyone
this. I mean, what's your thoughts on AI in the near

future? Like where do you see things going? I mean,
that's such a broad question. But what's

something that you see coming out of this in the
next few years? I see more and more the importance

of open source in all of this. I mean, all of the...
This is kind of academic tradition that started

with all this machine learning, deep learning,
and there's this kind of desire to share ideas. And

that's why people publish papers. And that's why
people get a PhD like Hunter here. And that kind of

idea sharing is critical for deep learning and
machine learning, AI advancement. Right? So I see

more and more of these models becoming these kind
of community projects that become accessible to

everybody. And in terms of AGI, which is the one
that everybody wants to talk about, I don't see it

coming anytime soon. We have really awesome
language models that can say really cool, fun

stuff. And they can be also very useful. Don't get
me wrong. Like I mean, I... ChadGBT and the open

source equivalents can be super useful. Yeah, I
don't think... I don't think AGI is around the

corner. But I do think some... There's a trend
to... For all these models getting smaller,

faster, and the ability to run them locally, and
the ability to use them either open source or open

access. And this is something I think is really
cool. We're going to start seeing a lot more AI

enabled software and technology out in the world.
Because if you think about like... Just for a voice

application, like a conversational voice
application, there's at least three things you

need. You need speech recognition. You've got
some large language model, some brain. That's

understanding what's being said and then
planning what's going to be said. And then you've

got speech synthesis. The two models have already
been released and made really, really efficient

Lama and Whisper for the large language model and
then for speech recognition. And an XTTS, which is

what we released recently for speech synthesis,
is kind of the third piece of the puzzle there. And

there's already people working on making it much,
much more efficient. And so you've got these three

models that you can... I think not in the distant
future put on a Raspberry Pi and have local

conversations. And it's going to be really cool.
Yeah, that's awesome. I just... I'm sure you're

familiar with it, but that Mistral,
M-I-S-T-R-A-L, L-L-L-M, did you see that one,

that open source one? That's pretty wild. Oh yeah.
But all these new alarms coming out, like you said.

As soon as I say really quick, so we collaborate
pretty closely with the folks at Hugging Face. And

one of the team members at Hugging Face put
together a cool demo where you're talking to

Mistral. I think it's called Talk with Mistral.
It's on Hugging Face. It uses Whisper, Mistral and

Koki. And it's pretty fun. That's awesome. I'll
have to check it out. I checked out their Hugging

Face spaces for the chatbot, Mistral. And it's
pretty wild because it's not uncensored. Yes,

it's fresh. It's kind of a... it's fun to play with
because it doesn't hold back. Is there anything

you want to promote or other than your website?
Now's the time to do it if you have anything you want

to share with audience. I'd say... so we've got a
great community on Discord. Look for Koki on

Discord. Check out GitHub repo. It's github.com
slash koki-ai slash tts. Check out the model that

we recently released on Hugging Face. It's X-TTS.
If you just go to the Hugging Face spaces and type in

X-TTS, you'll find it. And yeah, the website
koki.ai if you don't want to do code and all that

stuff. But the Discord community is really fun. We
got a lot of different people all the way from

random indie hackers all the way to the PhDs. And
we've... everybody in between. It's a fun crowd.

The things are moving fast. People are doing lots
of different experiments. It surprises me. Yeah,

thanks for having me on and I appreciate you taking
the time and looking forward to seeing how this

comes out. Same here. So head out to koki.ai that's
c-o-q-u-i .ai. And we'll also put the link in the

description as well. And then follow Ryan and I's
newsletter, Fry.ai. French fries are just for fun

because Ryan's daughter loves french fries. So
fry-ai .com. So I subscribe. It's completely free

to subscribe. We have new news coming out every
single weekday and then deep dives into very cool

projects and developers like this one. Koki
coming out on Sunday mornings as well.

More episodes

Chapters

Creators & Guests

What is Behind The Bots?