Bible Translation Innovation Podcast

Speech-to-speech translation is in early experimentation, with initial indications of how it could shape Scripture accessibility for oral-first communities. In this episode of the Bible Translation Innovation Podcast, Joel Mathews and Chris “Klappy” Klapp, alongside Communications Strategist Isabella Scarinzi, explore the current state of this emerging approach and its potential role in oral Bible translation. They explain how speech-to-speech systems work, why most approaches rely on multi-step processes, and where these systems fall short—especially in low-resource languages. The conversation surfaces practical challenges while also pointing to ongoing research and emerging approaches being tested across the ecosystem.

Have questions? Send us an email at lab@eten.bible and we'll answer them on the show!

Subscribe to hear more conversations, updates, and experiments on new methodologies and technologies advancing Scripture accessibility worldwide.

What is Bible Translation Innovation Podcast?

A podcast from the ETEN Innovation Lab exploring acceleration in Bible translation. Tune in for experimentation, updates, and conversations about new methods and technologies advancing Scripture accessibility.

Isabella Scarinzi 0:07
So welcome back to the Bible Translation Innovation Podcast, a show brought to you by the E 10 Innovation Lab. Today we're joined by Joel and Clappy, as usual, and we're going to explore a topic that may sound a bit futuristic, but we're already seeing the first experiments taking place in the BT movement, which is speech to speech translation. So, this means speaking in one language and then having it instantly translated into another as audio. You've heard us talk about oral Bible translation before, because many communities around the world that still lack access to scripture are primarily oral cultures. So this opens up challenges, but also some very exciting opportunities for acceleration. In this episode, we're going to take a closer look at where speech-to-speech translation technology currently stands, and what it could mean for the future of Bible translation. So, Jill and Clappy, my first question: Why do we want to spend time experimenting with speech-to-speech translation? Are there any contexts where that technology is already in use?

Joel Matthews 1:19
Thank you, Isabella, and it's good to be back for another episode. This, the topic of speech to speech translation, is indeed intriguing, and the reality is a lot of the world is already practicing and using this technology for the bigger languages, but for the Bible translation use case, this becomes all the more important for oral Bible translation. We, as we continue to work in newer languages for Bible translation, more and more languages are oral only and never written and there are multiple ways to do Bible translation for such a community. You can imagine developing an orthography for such a language, and basically educating and teaching that to the community, and helping them read their own language, which they probably have never done before. Now that, that does, that is still a viable approach. However, it does take quite a bit of effort, time, as you can imagine. On the other hand, there are some, there are communities that don't want to ever have to read their language, and only want to listen and speak it the way they've been practicing it every day in their homes, and for them, they prefer also just listening to the Bible, and never having to have to read it, and so oral Bible translation, in that sense, for supporting oral only modality, is where speech to speech translation comes in, and what I mean, what we mean by that, I should say, is speech to speech is so we, you may be already familiar with machine translation for text, where you go from a source language in the text to a target language in the same modality of text, so for going from say English to German, which is which in this case both are very high resource languages, but for the sake of the example, you can imagine how you go from an English text to a German text translation through a machine, and there are many services models out there that do this for you with pretty high accuracy today, but if you want to go from English audio to German audio, there are solutions for that today, but they are fewer, and the challenges are different. So we need to kind of delve a little deeper into why this becomes more challenging when we are trying to only operate at the speech level, and ultimately our goal is to support the local use case of the believers who are basically using their language in their homes and not reading it, and for them to really receive the Bible in their heart language is to also listen to it and not necessarily have to read

Klappy 4:48
it. Yeah, in when I hear of speech to speech translation, you know, in especially in the Bible translation context, like I first kind of like recoiled, and like, oh man. Oh, that's we're nowhere near being able to have useful drafting tools for this, and a lot of that I realized was rooted in, you know, my historical context of trying to develop this technology over a decade ago, and we've come so far, and you know, I'll talk about that more later, but yeah, the primary use case is, is, you know, for oral Bible translation, if you're having oral culture and they're speaking a language of wider communication, and their target language, just like Joe was mentioning, with what we're familiar with, with text drafting of Bible scripture, we call that draft zero, you know, like this initial draft that happens that is known to be machine translation, and nobody expects that to be published anywhere, right? It's just a draft, so we want to take that mental model with oral Bible translation, and you know, how do we do this from audio to audio, you know, speech to speech, of if we, if we listen to anything on the radio or podcast, what we hear is different than what we read, right? So speech to speech is way more than just taking text in spoken form, so there are also other things we need to consider, because when we, when we do oral Bible translation, most of the time they're not just reading text, it's not somebody reading the verses, like when we have an audio Bible, and we listen to our audio Bible, we're a very literate culture, so we're used to reading, so it's not a big deal for us to listen to our written word, but for an oral culture, that style of writing is just foreign to them, like it just doesn't sound like their language, so orality and speech to speech really encompasses a different style, or a different, different tones, you know, different, you know, I hate to say writing style, but, like, a, if you listen to, or, sorry, read a speech versus read an essay, they just, they read completely different, so kind of think about that mental model, and in this translation, so with the oral Bible translation context for drafting, you know, we need to think about orality as a form as much as we are looking at accurately representing the words from one language to another, so there's speech patterns, there's different things that will not be exactly encoded well, and that in the text, or may need to have an additional adaptive layer, there. Now, the other major use case I think is exciting to me, and I use it every day. So, it was kind of funny that I kind of overlooked it when we were first talking about recording this episode, and that is really just in major languages. Sometimes a lot of these groups that will be working together, whether in preparing the team, the training the team, or, or do an evaluation. Sometimes those major languages that they speak might not be exactly uncommon, so we might be able to use this, this speech to speech technology in major languages, major trade languages, to help them translate between each other, right. So, even if the machine translate AI translation for speech to speech doesn't quite work well for the target language just yet, maybe we can use it to help bridge the gateway, or, sorry, the trade languages as they, as I speak now. I go to a church that is primarily Hispanic, and so all the sermons.. well, we have an English service now, but sometimes when we're running late, we go to the Spanish service, and I have to listen with my AirPods using the translation mode, and so I'm well familiar with the state of the art, like Apple polished tool, right? Like this is an app that Apple has approved, and it's still not perfect, right? Like it is English to Spanish, I mean, this is the best use case, and it works really well, but there are limitations there, and we can talk about more of the limitations, you know, in a bit, but yeah, it's just really exciting to see how just a couple of years ago this was a pipe dream.

Isabella Scarinzi 9:52
Okay, so it sounds like speech to speech could solve a lot of our problems, especially when we're thinking about the all-exis goal link. Languages that are at risk or stalled or maybe not started yet, so we would love if we could just snap our fingers and tomorrow start using speech to speech for all of those languages, but I guess that's not a reality quite yet. So, Claffey, you kind of mentioned that there are current limitations for speech to speech translation technology, so what are those within our Bible translation context, and how are we actively trying to solve those limitations?

Klappy 10:30
Yeah, I mean, just continue where I left off, like when I'm using it at church, if there is any background noise, like as soon as somebody starts playing the piano or the keyboard or the guitar in the background of our pastor speaking, you could almost hang up, like it just doesn't work, like the best technology out there with the best voice cancelation, like I get this audible voice in my head saying, you know, please move your iPhone closer to the source of the speech. I'm like, "Oh no. So, and that happens pretty, pretty regularly. So, there's there's those types of limitations of background noise, and we know from our field visits and working with our partners, you know, background noise is a major issue. The good news is, is that technology is getting better. I've seen just a big difference, you know, in the past six months, since Apple first released it, to now it's getting better. But the other limitations are actually more practical for our use case, and maybe Joel, you could tee this up for you to dive into the technicalities of why this is a problem, but if we think low resource languages with text is a problem, like audio is even a more scarce resource, like it's just we're just now getting into this, and you know, working working with some of our partners, like we're actively working to overcome this. We can talk about this later, but like we are collecting as much audio data as we can to help with this problem, but there's not nearly enough, and so we need audio data. We need data to help, so the limitations is just like a lack of data, if it's.. if we have limitations in English and Spanish, just imagine the limitations we would have in a minor language. Now, the good news is I have been working on this technology, like over 12 years ago, we had some phone to phone conversation, you know, major languages, back in my Soviet days, and we, you could have a phone conversation, actually talk, and Joe will talk about how that technology works, but back then we didn't have the AI we have now that we're graced with now, and it was, it was, oh, it was so lacking in major languages back then it's amazing to see this technology come to life today.

Joel Matthews 13:06
Yeah, Clappy, that's that's what the hope is, that we can also bring some of these smaller languages to the point they are for, they are for the bigger languages now, because it does seem like they're like 10 or 12 years, like you're saying, back in terms of their data and polish, but that may not be true, given the new technology still that does feel like that, but if you delve a little deeper into the technical aspects of this from the common ways of doing speech to speech translation is more of a cascaded approach, is what we'd call it, is to go from speech to a transcription of the speech in the same language, and then from the transcription, which is text, go to a machine translation to the target language, which is also in text. So now you have a target language text, and then go from target language text to speech, so it's got three parts, speech to text, text to text translation, and then text to speech, and so we fall back, usually for the real translation on the text modality, that's one of the big reasons why we want to go to text, because text models, there are more machine translation models in text, and that technology is a little more, a lot more mature, I should say. And, as you can imagine, errors in any one of these stages can compound, as in the consequence stages, and also it just means that we need three different. Different sorts of capabilities to do this one task, you want to go from automatic speech recognition, which is speech to text in the source language, which may be more easier to do because we assume the source language is a language of wider communication for the context of Bible translation, so it's a bigger language, and they're a little easier to train, from what I understand, and then you have to have a model that is trained for the text source text to target text, which again requires you to have target text data, and then go from target text to audio, all of this needs data, which goes back to your point, Clapy, and for a language that has not been written ever, this, as you can imagine, poses a whole new challenge. So, what do we do now for a language that does not have an orthography to use such an approach, and one way to circumvent this is at least is to use a proxy script, so even though the language is never written formally, what if we could use the script of the source language itself to render the target language, and that works a little, and it's, it's, it's just an internal representation, so it's never shown to the user. It's, it's basically a machine's, the machine has to figure out how to read the target language in a source script, source language script, but that a funny example of that is actually in how my own mother tongue. So I grew up speaking Malayalam, which is a language in southern India and Kerala. So I grew up just speaking it. I never learned it at school, so I'm fluent in Malayalam, but unable to actually read it properly. I read broken Malayalam, so for people like me, a lot of our song books in Malayalam have a Latin script for Malayalam, and you almost have to kind of learn how to read this new script, because there are sounds in Malayan that are just not represented in, in the Latin script, and so, like, a ra sound is zh, and you have to kind of make that mental mapping in your own head, okay, that's what they mean, and and so it's not great, but this is the sort of stuff that happens internally when you're talking about a cascaded approach with a proxy script, you'd kind of try to map it to something close enough and let the machine try to figure out what that means automatically. Now, what we are trying to do more recently, and there is a parallel effort, and this is much more in the research phase. Is what if we could circumvent this whole cascaded approach and go directly from input audio to output audio from source to text, a source to target? That would be great, but that actually is an unsolved problem. There are good strides made towards solving that. Meta's Seamless M 14 model, recent, not recently, a couple years back, released, actually did do this, and they created a model that can do all sorts of speech to text, text to speech, speech to speech, all in one unified model, which does solve this problem to a large extent. However, it's very hard to replicate the results for a low-resource language, and they seemed it's the model, so far from what we've tried, it's not as easy to use for our practical use cases on the field, and there is a lot more work that needs to be done when the data is a lot scarce, so we're like you're saying, we are clappy, you mentioned how we are collecting data on the field, the level at which we collect data is still very small compared to what some of these larger languages already have as their data sets in audio, so having hundreds and 1000s of hours of audio, it would be great, but we can probably manage a few 100 hours of audio at most from the data collection we would do, because we have a small team, small community, limited amount of time and resources that we need to maximize those are kind of the big limitations and challenges that we have there right now.

Klappy 19:50
Yeah, as you were explaining that, like especially your personal experience with, you know, Miley Alum, I'll go back to my own personal experience, you know, like in my home. Own church, you know, or in my church, you know, we have our home groups, and so once again, I'm one of the few people that only speak English, so you know, some people translate, there's usually somebody who's bilingual and helps translate, but if we're spending three hours together in a tight community of people, people are just talking and talking, and so I'm using my AirPods again in translation mode, and here in a best-case scenario, we have all the data, right, English and Spanish, there's more data than like probably any other language pair. Now I'm sure there's another language pair that might be more, but my understanding that's like pinnacle, so we have all this data, and it is hilarious how many things just mistranslate because of the context. So, another missing piece that we have is like it's not always going to know the context that you're in at that moment, and most translation systems translate one sentence at a time, maybe a paragraph at a time, but they don't keep your conversation history going, so it just kind of meanders and feels thrashing, of like, why'd they say that, you know, it's like these little moments of like you have to retranslate the translation in your head of like, okay, I think what they probably said was this in the flow of the conversation, and so I'm doing my own mental mapping, you know, for those, you know, translations, and once again, that's in the best case scenario. So, as we look at our target context, right, for in the Bible translation context, the good news is we're actually curating all this data we're collecting in a biblical context, so the default context that we're going to be building for this actually will be biblical content, so it'll be actually be disproportionately favored towards our use case, so in some ways like the big problem we have in English, I'm praying at least from my experience in building machine translation systems over the decades, like if you build one for that use case, you're actually going to be in the domain pretty well. The context will be fairly locked in, so, so that's that's that's a good thing in our, in our case, but I think the other concern that we need to be aware of is speed issues, so in these contexts, once again, in the best case scenario, I'm three to 10 seconds behind the rest of the room, so in that the fastest state of the art technology in my ears, I'm 10 seconds behind in conversation most of the time, and you know, though, those speed issues can compound as we look at technology that we would be hand rolling or working with, and having to set up for our partners, so if we're looking at real time communication between people in my ideal scenario to help bridge these gateway, or, sorry, these these languages, a wider communication collaborate together. We're going to have to consider that that lag there, and you know, as Joe mentioned, the cascade approach just kind of compounds things. I'm anxiously waiting for the next paper that drops that says we've solved the large language model problem of having voice as a primary input and a primary output. Right, we're getting closer to that. We're seeing glimpses of that. I'm able to have conversational models. I'm using, well, you know, we can get into maybe some solutions later, but using 11 labs for conversational agents, and it's exciting to see what is happening, and it feels like its voice as the primary input and output. So, 11 labs is got a good use case of having a conversation in these major languages, but we're not quite to a point to where we have these open source libraries with scripts that we can run on buckets of these oral only audio files that we've collected. We're not quite there yet at the rate things are going. I'm hopeful. I've been.. I was hopeful by the end of last year we'd have that technology already, you know. I had Ed Weaver from Spoken, I think it was three years ago, call me up and say, "Clappy, I don't know if I'm crazy, but if I just started recording a whole bunch of audio, could we do something with that one day? He's like, can we actually make an LLM speak, you know, in and out. And this is before that was really like nobody had the papers yet on this being possible. And I was like, Ed, you are crazy, but in a good way. Let's go. Ahead and start recording now. In one day, we'll be able to do something with it, and you'll be ready, and hopefully this data will be useful. And so, I think I think that's something we need to be thinking about. Where's the future heading in this area, and how do we best prepare for

Joel Matthews 25:17
it? Yeah, that's actually a great point, Clappy, to encourage data collection of audio generally, and one of the limitations of data that's collected and has been collected in the past was it was kind of unnatural audio, so it's it's in settings where people are given a script, and it's kind of formalized, and they're asked to read something out or say something out loud, and you know it was almost like a spotlight on them, and it, and they say it differently than they would like talk to their, their spouse or their child or their friend, and having conversational audio has a lot of value as a data set, but collecting that also can be challenging. I think audio - one of the challenges of audio data collection is that ultimately it encodes your voice in the raw form, and that means it's considered biometric information, and so it has to be then if you are recording such content with the intention of sharing, you have to get explicit rights to share it widely from the people who are recording it, and, and also, there are laws that you need to follow per country, so it can get a bit complicated if you want to do this at a larger scale, but generally, you know, having something written in terms of people letting their voice be used for training, and and then collecting such conversational data is high value.

Klappy 27:04
Well, yeah, just think, think about that, like, yes, there's that personal, like, privacy issue and safety issue, and all that, and, but, but bridging that into, you know, collecting that, that audio, think about how difficult it is for us to set up our environment for this podcast, right, like to get the room quiet, to make sure there's no other noise, to make sure there's no other voices happening on our microphone, like it's so hard for us to hear, they, and we're expecting people on the other side of the planet just to be shipped an audio recorder and go record conversational audio, like you know, that's not clean audio, right? And that's like a big concern that everybody's been saying, well, it's not going to be clean, so don't even bother until you can figure out how to make it right. But there will never be ideal conditions, right? Like, so you know, it's old Ed, like, just start now, let's learn as we go along.

Joel Matthews 28:04
That's good.

Isabella Scarinzi 28:07
So, we've talked a lot about the concept of speech to speech. Why we want to explore this, some limitations, and what we're doing to kind of try to solve those. Now, my question is, How do we do this effectively? So, if a Bible translation team is considering experimenting with speech to speech tools today, how would they get started? What are some of the projects that are currently out there that they could participate?

Joel Matthews 28:38
Yeah, so we know from the lab there are a few partners that you're working actively with to experiment some of the species piece techniques. Faith comes by hearing and SIL are partnering with the lab and running an experiment along these lines, it's still early, and I'm sure we'll share some more updates as we have it from their work. Similarly, we're working with Notre Dame University, or University of Notre Dame, and another large research institution in the United States, with their research labs to focus more on the speech to speech, not the cascaded approach for translation, and that is, you know, still very early in the academic research phases, because people are, we have PhD students who will be looking into this, and professors will be writing papers on this, but we are really focusing on having all that really practically applicable, so it's applied research. As soon as we have even a small breakthrough, How can we directly apply it? Also providing these research labs with data from the field, so that they are actually experimenting on things that are actually a value for us. So we're really excited about having such partners whom we are working with. Clappy, you, your work, you've worked with Marcia, maybe you can share about her work.

Klappy 30:24
Yeah, you know, I love it when there's like either an underdog story or somebody who's like out of domain coming in and trying something new, right? Like, if you, if you know me or work with me, you know I love to build proof of concepts and do things in unorthodox ways, and so that reminds me of, like, Marcia. Marcia is not an AI technologist, she's a linguist, right? Like, she has her field of study and her entire career has all been in linguistics, and so being involved directly on the field with field experience doing Bible translation in oral cultures for a long time, so this is her world, and so when she started researching and better understanding how AI works, she saw when Chat GPT launched their GPTs, it's like you could basically curate your own little sandbox of ChatGPT with a specific like system prompt and guide it for narrow use cases, almost feel like you're launching your own chat app, right? So she started doing that with no experience in AI, and it started shocking everybody. Somebody with no AI experience is kind of launching useful AI tools, so that's Marcia, right? So she started looking at how people are doing this, this speech to speech, and for her, you know, going back to what I was talking about earlier, the losses that you have along the way from from video or, sorry, in person body language and speech, from that to speech, you lose the body language and communication, and going from speech to text, you're missing even more encoded information and understanding, and then that coming back to the other language, and you know all those layers that we were talking about at the Cascade approach have some loss there, so she had this wild idea of, like, if phonemes, and not to get too technical, but Joel, you probably give a better definition of phonemes than I would, but you know it's like, you know, if you think about, like, phonetics, you know, when you learned how to read, you know, you think about those phonemes were the basic building blocks of, like, phonics of, like, how do those sounds make those letters, words combinations together, like it's over, overly simplified, so forgive me if, if that's a bad mental, it should be a good mental model, but it may not be the best technical definition, but we know, like, if you, you know, in your example you gave earlier, Joel of Malayalam Romanized, you have a lot of loss because it can't encode your body alums sounds in a romanized form, like it just can't do it, so Marcia had this idea of, like, well, if we're dealing with oral cultures and they have no interest, at least for now, to see a visual representation, then humans are not needed to read this right. So, what if we found a way to encode more information than just the phonemes, the phonics, you know, the sounds that we normally encode, and we focused on something called acoustemes, so like the acoustic sounds, and I'm sorry, Marcia, if I'm going to butcher this, but there's more information encoded than just the phonemes, and so acoustic teams are a bit broader in it, and it has like tonality, it has some more speech patterns encoded into it, and it is only machine readable, and it only needs to be machine readable, and so we already know you can train LLMs and other models based off of arbitrary tokenized elements, right? So you're teaching it off of tokens just the same way you would any other real, quote unquote, real text language, but it's using a different representation that only the machine needs to read, it doesn't need to be human readable, so she's using that as a proxy for the oral audio speech token, so it's a very similar cascade approach, but it encodes a different level of information, and it's to me, it's quite exciting to see somebody who's not a technologist come in with a great idea, use AI to help them build stuff, find a few partners who can help them kind of plug in the gaps in their understanding of technology, and it's just really cool to see how far she's coming in. So we have some partners are helping us help validate her ideas, so she's got some internal tests that. Proven promising, and so we're helping her validate those externally to see if they show value for some of our other projects.

Joel Matthews 35:10
Yeah, yeah, it's really cool how YWAM has been getting so much involved with oral Bible translation and the work that they're doing.

Isabella Scarinzi 35:20
So, if someone wants to participate with us in these experiments, how would we connect them? How can they get involved?

Joel Matthews 35:31
Yeah, that's a great question. I, as of now, we would love for people who are interested in piloting some of these experimental techniques in oral Bible translation communities, or just oral communities who would be a good fit for such a test. We would love for them to reach out, and we can maybe share a few rough examples of models or tools for them to try and give us feedback, and the other, of course, is what we mentioned, Clapy mentioned, is data collection. It'll be really, really valuable to collect any sort of data that is classified and in some ways organized as a data set for minority languages that did not have a bible in them yet,

Klappy 36:37
yeah, having having more people help us collect data is exciting, right? So I know our audience is somewhat specific and narrow in our, in our domain, but from what I understand, there are people on, you know, more of the field side, they're technologists, and then they're just, you know, leaders of organizations that listen to our podcast, so I'm excited to just invite people to help us dream, because I think some of the big things that are happening here are just, you know, where we've come from a couple of years ago, we need to be dreaming together about what the future holds and preparing for where the technology is headed not just what's possible today, because if we were only building off of what's possible today, we wouldn't have the data we have that we're playing with now, and so, like Joel mentioned, some of the partnerships you're already working with, right, in your partnerships you have data, because Ed was dreaming years ago, right, a couple of years ago, so in Marcia, what she's doing, she was dreaming, right, and so we're seeing these advancements happen. So, dream with us. Number one, secondly, collect data. I think a third thing would be, I feel like we say this every episode, be willing to experiment with us, right? So, if you're actually on the field, it's not just help us collect data, but as we have these technologies brought to the table, we need early adopters to kick the tires on this to see how useful it is. So, even though we may not have something field ready just yet, or, sorry, we're working on some things that are almost there, as we do start doing more of these field tests for the different parts, you know, we'd love to have, have more people helping us try it out, test it out, give us the proper feedback we need, you know, and as our peers, you know, are listening to us as technologists, fellow technologists listening to us, I think it's exciting to know what the different parts are that people are playing in it, right? Like, some people are specializing in ASR, or the text to text translation, or the text to speech, you know, we all play different technical parts to it, and I think it's exciting to see the combination of our roles as we work together.

Isabella Scarinzi 39:08
Well, thanks for another great episode, guys. Our focus in 2026 is on the all access goal languages at risk, and what we are talking about here are early experiments, like Cloudy said, but they do make us excited as we focus on opportunities to meet these goals. So, please reach out if you are interested in piloting a speech to speech project, or if you're interested in working in audio data collection for low resource languages as well. Also, we're going to start a Bible translation innovation question time on this podcast. We announced this last time as well, but if you have specific questions you'd like us to address on the show, you can send them to Lab at eTender Bible, and we will answer your questions in a future episode. Don't forget to subscribe to this show, send us your questions, and we will see you next. Time,

Theophany Media Media 40:01
the Bible Translation Innovation Podcast is brought to you by the E 10 Innovation Lab. This episode is edited and produced by Jake Doberins with Theophany Media. Your hosts were Joel Matthew and Christopher Clapp, with facilitation by Isabella Scarrenzi. Please subscribe on your favorite podcast platform, and we'll be with you again next month,

Unknown Speaker 40:24
you.

Transcribed by https://otter.ai