Using A.I. to Mask Foreign Accents on Voice Calls

ALSO AVAILABLE ON

Join the Age of AI Mailing List. You'll be in good company with CEOs, investors and innovators around the planet.

Ganna Tymko
CEO,
SaySo

Have you ever had difficulty understanding what someone was saying because of their accent?

Do you have an accent that others don’t understand, or hold biases against?

As a speaker of many foreign languages, I’ve always found it to be a very fascinating process. Ask any foreigner in your country about their relationship with the local language, and you’ll realise how the topic is packed with various emotions.

But as it turns out, AI can help us communicate perfectly with each other in spite of thick accents.

My guest Ganna Tymko is the CEO of SaySo, which uses AI (specifically deep learning) to transform heavily accented speech into standard English (i.e an accent from Britain, America, etc) — in REAL TIME, as you speak!

Just imagine how it would transform cross-cultural and cross-border communications! We dive into the technology and business aspects:

00:23 – Why Aman is getting desensitized to accents

04:53 – Why we struggle with accents

13:59 – AI projects: research VS design

23:13 – How Sayso’s tech works

33:23 – How our mouth muscles get trained to produce accents

41:48 – Taking accent transformation tech to the market

47:06 – When training ML models for humans, use human metrics

51:05 – Closing

(Ethics Policy: These opinions are 100% my own as an independent observer and educator. I don’t own stock in guests’ companies or their competitors, nor do I get paid by them in any form for any reason at the time of publishing, unless specifically stated. Episodes are also not intended to be an automatic endorsement of any company or its products and services.)

Are you a “non-technical” entrepreneur or professional, and hate calling yourself that?

Become a “Tech Fluent CEO.” Read my book here.

Raw Transcript:

[00:00:00]
[00:00:16] Aman: All right. All right. Hello, everyone. Welcome back to a whole new episode of the age of AI podcast.

[00:00:23] Aman: And today I’m gonna start with a little bit of, uh, self flattery, you know, one thing that I’ve always been, um, proud. About hosting the show is that I’ve always tried to bring to you a very global outlook.

[00:00:40] Aman: Right? My friend, my guests are from many different countries, you know, Europe, Asia, and north America. You name it. And because I have guests from all over the world, I also get to hear so many different accents of English. And over time, I’ve become a little [00:01:00] desensitized to it. You know, I notice people’s accents, but in general I can pretty much focus mainly on the words they’re saying.

[00:01:06] Aman: And I don’t really, like, I basically tune out the accents because I’ve been traveling and I’ve been working with people like for a long time, but I’ve recently been reminded of that. The fact. Most people are actually not like that. This is not true for most people, including myself. When I first started out, uh, when I was in, I think when I was in school, if I ever heard an American talk or a British person talk, I had, I would have no idea what they were saying.

[00:01:33] Aman: Even though I understand English, I can really, I can read English and I, this really well, and the same goes vice versa. You know, we, if you hear someone with a very thick accent, which is different from yours. It not only makes communication difficult, but it also evokes many biases and stereotypes.

[00:01:52] Aman: Right? Like you hear Americans go rah over the British accent, like, oh my God. So sophisticated looks so fancy. You know, I love the [00:02:00] British accent. Uh, and of course everybody makes fun of the Indian accent but we won’t, we won’t talk about that too much today. but, uh, this is a really big problem.

[00:02:11] Aman: Of course, as you already know, because you clicked on this episode, here comes machine learning, AI to the rescue. And, uh, with that, I would like to introduce to you my new guest, Ghana Timko. Is that, is that the right pronunciation of your name?

[00:02:26] Ganna: Yeah, this is perfectly good. Uh, my name is Ghana Timko. I’m a CEO and founder of, uh, SaySo.

[00:02:34] Aman: She’s the founder of SaySo, which is a really wonderful company. And SaySo is building a system which can plug into any voice input software, whether it’s like a video conferencing tool or something else. And it can in real time transform your accent. right. And so in this episode, although the, the pitch is pretty [00:03:00] simple, you get it, you understand that it’s a big problem, and you can understand, you know, what the product does just by just one simple sentence, right?

[00:03:07] Aman: Today you get a mini master class of sorts in AI product management, because that’s the thing that is truly unique, which is how do you, how do you actually go about building a product, which has so many. Sensitive cultural layers to it. Right. Uh, which is, it’s not just like something that you’re solving a little technical problem.

[00:03:27] Aman: It’s actually solving a very human problem using the most technical means possible. and so once again, thank you so much, Ghana.

[00:03:35] Aman: My first question to you would be, which is your favorite accent in English.

[00:03:42] Ganna: This is such an interesting question, which is my favorite accent. I, I haven’t thought of this one. It’s it’s, uh, I think probably. Actually my favorite accent is Polish accent.

[00:03:58] Aman: Okay. Polish accent. All [00:04:00] right.

[00:04:00] Ganna: because I know Polish is my first language and I’m, I’m actually always very sentimental. When I hear somebody talking, uh, with Polish English accent, it reminds me of home and it reminds me very much about my early days of my career, where I started working, um, in international settings.

[00:04:20] Ganna: And. You know, like lots of people started over also quite young in their careers, not so proficient. And we were all like, go, you know, just scrambling to get by with our, yeah. It’s it brings nice memories.

[00:04:35] Aman: Nice. Nice. We did have a, I think the last guest that we had was a polished person, so, you know, yeah. Check out the podcast. If you want to listen to more polished English accents.

[00:04:45] Ganna: Okay. I will. It’s a, it’s a, it’s a streak. You are a streak of like getting Polish, uh, presenters there, guests.

[00:04:53] Aman: Yeah, for sure that that was always the intention, you know? , okay. So let’s talk about the product management [00:05:00] journey. So when you first had this idea that, okay, this is a problem. Like people don’t, you know, like accents or they don’t get accents, let’s start there. What happened?

[00:05:10] Ganna: Hmm. So I, I wanna say a couple words about, say so, and I’m. And I, and then I’m gonna tell you about my journey, how, uh, how SASO came about and how accent problem came about. So at, at SASO, our goal is to make delightful speech experiences, uh, and, and we want to help people communicate clearly with each other and technology so they can communicate clearly and understand each other with ease. And then the first beachhead product that we are building that we are just about to release is a real time accent transf. And I started this, um, actually quite, uh, interestingly, because I had personal problem in my life. I was working for one of the very large technical companies here in Silicon valley. And I [00:06:00] was, uh, engaging with international teams, uh, across the globe. Uh, a lot of my colleagues, uh, from, from somewhere else, not from the us, not from , uh, and you know, people had difficulty understanding me. And I had really hard time understanding my colleagues. I, I have particularly bad ear for Australian English and New Zealand, English. Like I, I really struggle to make sense of it. And then, you know, I thought, Hey, we, we are all speak the same language. Those people on native speakers, I, I spent 20 years learning English.

[00:06:36] Ganna: But we cannot understand each other. Like we, we need to solve this problem. Uh, and I, and I went on like going to the design and going to product management before I started building anything. I went on to understand what is people’s situation with accent. So I asked, I started asking my colleagues, Hey, How do you experience accent?

[00:06:59] Ganna: Is it a problem [00:07:00] for you, how it impacts your professional personal situation? And if you were to think about technical solution to that, what would this solution be? And I had those very open ended conversations. I was not pointing people towards any direction. Um, What I heard from people that they had very difficult situation with accent absentism unfortunately is alive and well.

[00:07:23] Ganna: Uh, there are strong biases. Uh, some people told me they feel like they’re MIS missing out an opportunities of promotion or, or, um, Professional advancement. Some people told me that, uh, when they try to apply for job, they, they talk to hat, hunter, not even hiring manager and they get dismissed just because of the way they speak.

[00:07:43] Ganna: They don’t get a chance to talk to somebody. And I, and then, and from native speakers working on international teams, I heard. Very, uh, also strong problem statement because they, they told me that, uh, some people told me that with international teams, they, they spend a lot of time [00:08:00] actually avoiding life conversations, uh, that they would use slack messages and they would use like emails to communicate with each other so they can read and clearly understand and then on team meetings.

[00:08:12] Ganna: So. Uh, people would have to go through notes and they would try to capture meeting notes and they spend extra time reading through this. And some things would not be captured very well because speech recognition is not very good with access. So they would still have big question mark. They will need to follow up that people are really dreading it.

[00:08:30] Ganna: Uh, so I heard this like really big problem statement. And then for people who are non-native speakers, they were thinking of changing their own speech. They told me they give me a couple design requirements basically. So they ask me that solution. If there were a solution, technologically, it would have to.

[00:08:47] Ganna: Something that is a real time solution that they can use during calls. There is no lack of delay. Also. They said like, if it’s a tech conferencing video call, they want to have their lips synced with their voice. So [00:09:00] they don’t want to be strange. People told me they. That accent is quite important, part of their identity and personality, and they don’t want to sound all standardized.

[00:09:10] Ganna: Uh, they want to just be more clear and intelligible, so they want to preserve some of their identity. They wanted to ability to have ability to dial this up or dial down because, um, You know, they wanted to have, uh, say on how, how their speech has been changed. And they also told me they very much, don’t like to sound like Siri or Alexa, so they, they didn’t want to sound robotic.

[00:09:33] Ganna: So they, they cared about human conversation. And also people told me that. They want to have some ability to signal to their counterpart, to the people they’re talking with, that they’re using some, some technique technology to alter their speech. So when they meet face to face, there is no shock that their speech is very different on, on the call from.

[00:09:56] Ganna: Speech and live situation. So I, I, I took all of [00:10:00] those things as, as design requirements and I actually ran with it and, uh, that was kind of origin of, of design requirements for our product. And actually we’re meeting all of those requirements right now. Yeah.

[00:10:13] Aman: It’s it’s very interesting. Uh, what you said that at least two of those requirements are somewhat like a tug of war, right? Uh, one of them is that you do want to sound in intelligible, uh, and, uh, You do want to, you do have a problem with accent, so you want to, you know, fix it or whatever, like change it or whatever you call it.

[00:10:34] Aman: Right. Um, and on the other hand, you not only want to retain some of your identity, right. Which is a way of saying that you don’t want to have a very thick foreign accent, uh, as well as you also, I think it’s kind of an extension of the same requirement, which is. You want people to feel that you are using a technique?

[00:10:59] Aman: So [00:11:00] it’s like, you want people to notice that you’re being a little bit robotic, but not too much. Just a little bit foreign, but not too much. So there’s a lot of like subjective tuning. On the part of people. So they it’s ultimately making, making it a very, instead of making it like a discrete solution, what they really want is that they want a lot of control over exactly how thick the accent is and how robotic or UN robotic, um, their voice sounds.

[00:11:30] Aman: So is that, is it two different problems to say the thickness of the accent and the, um, the, and how robotic you sound? Or are they different faces of the exact same problem?

[00:11:42] Ganna: No. So I wanna clarify a little bit. So, uh, the, the technology that we are building will. Uh, is preserves people’s voice and

[00:11:51] Aman: Yeah.

[00:11:53] Ganna: and it’s it’s changes their accent and it can be any, and in first [00:12:00] iteration, we’re doing a mild accent conversion in further iterations. We will enable users to choose if they want to do.

[00:12:07] Ganna: More severe, like harder accent conversion, um, or not. And, you know, for people who are using it for their own speech, maybe like, it seems like mild version is, is, is very desirable for people who are using it to alter their colleagues or customers. Uh, they, they, they might want to have a very strong alteration.

[00:12:28] Ganna: So they, you know, like they have a very clear understanding of what the person is saying. And it’s, it’s very close to what. Their brain understands the easiest way. So, uh, in a, in a, you know, there is not really a type of war between sounding robotic because the, the technology itself should not make you sound robotic.

[00:12:47] Ganna: Uh, what we have right now is not fully perfect, so it cuts off some frequencies of speech. So there is, there is some, uh, some element of robotic. Actually, it’s not gonna be there when we launch it [00:13:00] for, for people. So I, I guess this is what you’re referring to. Yes, no, it’s, you know, we are very hard tech startups.

[00:13:08] Ganna: There is a real technology there and real innovation and there are patents behind it. Uh, it, it takes time to develop technology. It takes time to develop speech technology and in. Takes multiple alteration and, you know, to some degree, the process is actually never perfect. So if you look at standards across the board, if you look at, uh, speech recognition, for example, those products are around for 10 plus years and they’re still being like perfected and, uh, developed further.

[00:13:36] Ganna: And I, I think this will be the case we say, so the work is never done, but you can on you, you kind of aim to better and better. Standards as you go along. Yeah. Um, while you try to deliver value in, in what you offer, you know, what you release like, so I dunno if it answers the question. I hope

[00:13:57] Aman: So, uh, so

[00:13:59] Ganna: some of it.

[00:13:59] Aman: [00:14:00] from a that it, it, it kind of answers from, uh, one aspect and, uh, the other aspect is from a technical perspective. Uh, so what you, so what you basically said is that. It doesn’t actually make people sound like they’re actually using at least in the first iteration, uh, because you had to kind of prioritize the different design requirements.

[00:14:26] Aman: Right? Where if people, like, if somebody says, oh, I want to have an accent, but I still want them to feel like I am using some, like some technology to hide my accent. Like that’s probably a. Less low on the priority list than saying, Hey, I want to sound in intelligible. Um, you know, I want to sound clear to my coworkers, to my customers.

[00:14:48] Ganna: So, yeah, so improving inte our like very bottom line goal. So the, whatever we do the north star is for people to have easier time to [00:15:00] communicate with each other so they can speak clearly and understand each other with ease and that this is like a north star for anything, any effort. And. You know, because it, it creates productivity gain.

[00:15:13] Ganna: So save time, reduced frustration, improved collaboration and creativity also, uh, improved access to opportunity and effort mobility for some population. When, so when we improve eligibility, we go towards all of those goals and then like adding, uh, adding gears for people to dial off or dial down. Fiddle with like their sound, it it’s, it’s extra feature.

[00:15:37] Ganna: It’s something that they want to have and we will provide it for them. So we will provide those features that this can be integrated in, in software solutions, but it’s, it’s, it’s less of a priority than eligibility improvement.

[00:15:49] Aman: Mm, makes sense. Makes sense. I, I think that’s a good lesson, which is that, you know, And I think this is something that differentiates a lot of like AI research projects from real world [00:16:00] applications, which is that a lot of people who are building AI projects, they, tend to come from research backgrounds or whatever.

[00:16:08] Aman: And, uh, they start with a big, hairy problem, which is good, you know, because that’s what, that’s, what moves the field forward. But we have to realize that when we are building a product that uses AI. It’s first a product and then AI, right? the AI stuff has to happen within the context of solving the problem that you’re trying to solve in the first place.

[00:16:31] Aman: And AI is just a tool, uh, and you have to use the tool in a way that fits the problem. Not like to changing the problem so that it fits, uh, your research aspirations or, you know, whatever you want to build. Right. Uh that’s I think the really great lesson.

[00:16:47] Ganna: Yeah. So I, I, I can elaborate a bit on that because I love science and I love scientific projects and I, I really love heart tech. But you know, like the difference between doing research and doing product [00:17:00] is, is it’s, it’s a huge, huge difference in, in attitude on what you do.

[00:17:04] Ganna: Because when you do research, you do it to satisfy intellectual curiosity or to answer like some problem. And then you start working on it and it takes you different directions. Your curiosity always shifts also with what you find and what you discover. So you. Flow like in the ocean and natural forces push you left right forward or backward.

[00:17:25] Ganna: And then, you know, discoveries are made this way. Um, usually it’s, it’s a lengthy process of doing scientific discovery takes years with, with, you know, with being, being a startup, being a commercial entity, you don’t have luxury of having all time in the world. You don’t have unlimited funding. And when you building a product, you have to keep a really strong focus on what.

[00:17:49] Ganna: You build who you’re building for and how they will use it. And then like maybe for some listeners, it will be very obvious. Uh it’s maybe for some, it will [00:18:00] not be very obvious, but like when you’re doing something, when you’re making design decisions and when you are making like pivot decisions or where you wanna shift your research, you have to check always with your customers to be so like, honestly, you have to talk to your customers like once a day, Like, if you are an entrepreneur, try to make a goal of talking to like one new customer once a day.

[00:18:20] Ganna: And like you have to interview them. Like you have to really have curiosity as much as in scientific inquiry, you have to have curiosity about your user and your, your customer, you know, like, what are their problems? What are they struggling with? Like, don’t come with with. Some, any assumptions to this, this conversation, like try to understand how you can be helpful, how your product can be helpful.

[00:18:42] Ganna: And sometimes like very simple features that you would think are useless or like are not so attractive to you, or actually some features that users really want a need. And it’s easy to miss on that. If you don’t. Talk to them, but, but you know, talking to people, talking to users, talking to customers, [00:19:00] give, gives companies and founders like a huge advantage when the products are complete, because they already, you know, have some natural product market fit or conceptual product market fit.

[00:19:09] Ganna: And then they see customers coming to their doors and wanting to buy. And, and this is what we experience actually we say, so we have a lot of very organic. Interest and, you know, we don’t even have search engine optimization. We spent like, like really, we don’t have like word optimization. We’re like, don’t have any words or any advertising, no social media, like very small profile on LinkedIn.

[00:19:32] Ganna: And yet we have like really big and small companies coming to our door, wanting to do business or being curious and interested in us. Um, so we also had very organic tech crunch article about us. It’s. You know, but, but this is all this is happening when you try to build something that people need and would love.

[00:19:52] Ganna: And then like it, it creates natural interest. It creates much better business success. And frankly, you have to fund your research. So [00:20:00] unless you are a governmental institution, you’re somewhere in academia, you have like access to grants. Give you in a good place, but if you are a commercial entity, you have to fund your research.

[00:20:09] Ganna: That is like, so tremendously expensive. You, you better make sure that you’re building product that people want, you know, you want to survive and stick around. Yeah. It’s, it’s, it’s quite important.

[00:20:20] Aman: Yeah. So from a, you know, from a technical perspective, talk to us about how the technology itself evolved.

[00:20:28] Aman: Of course, you know, you have some things that you, that IP, which you can’t share, uh, but in general, in terms of just educating the audience about a solution like this, you know, so we can again, move the field forward. What was the process of actually solving the problem? Once you had the design requirements and.

[00:20:46] Ganna: Yeah. So, you know, I have, I have a background in it, so I, I studied it. I have undergrad in it, and that was quite generic. But it gave me opportunity to understand, [00:21:00] uh, what is happening in, in, uh, technology and read papers and understand state of the art and like research to my, uh, you know, like to my own advantage and my understanding.

[00:21:11] Ganna: So I had design requirements where I wanted to create real time technology, that transforms speech I wanted to have people’s voices and intonation preserved. And I, I didn’t want to sound to have it sound robotic. So I was, I, I realized very fast that I’m looking to create technology that didn’t exist because I couldn’t rely on natural language processing because whatever happens with natural language processing, you would have to synthesize speech on the other output of like on the other side of this, which would not sound natural.

[00:21:47] Ganna: It would sound, maybe it would sound okay. So AI is getting better, but it would not be original emotional conversation and people will have no impact on how they sound on the other side of it. And we knew it’s not it’s, [00:22:00] it’s no go. And then also it’s questionable whether you can make things in real time with an LP.

[00:22:05] Ganna: And if, if you were try to do this, it would be very computationally heavy and probably not work on prem. Um, so, so I understood that. And then I was looking

[00:22:15] Aman: uh, by the way, when you say on-prem just for anybody who’s listening, on-prem means, uh,

[00:22:20] Ganna: on premises on device

[00:22:23] Ganna: of,

[00:22:23] Aman: uh, so like there, there are some software that you can run on a. Gaming laptop with like with endless Ram and endless graphic plusing power, but running the same software on your phone.

[00:22:35] Aman: If you’re speaking to CII or Alexa, something like that, uh, you know, some things don’t transfer well to small devices then big. So that’s what she means by OnPrem know. I that’s a technical term right there.

[00:22:46] Ganna: Yeah. So actually like on mobile devices, you can do much more than on desktop of devices. Like it, it, it was a shocker to me when I, when I saw about it, like, was reading about it. But like mobile devices have much more computational [00:23:00] power than lots of desktop of devices. We, we wanted to do it on desktop of devices because that was our first iteration to add to the conferencing software and call center software.

[00:23:10] Ganna: So we wanted to do something also lightweight.

[00:23:13] Ganna: You know, ‘I came, I came in with like very strong user requirements because I knew that if we create something that user wanted users will buy and I started interviewing potential candidates could be like a technical lead or like CTO in my company and create technology.

[00:23:28] Ganna: And they were telling me, Hey, like, this is a really nice dream, but this is impossible.

[00:23:33] Aman: Mm.

[00:23:34] Ganna: We cannot build technology that is doing all of that. And then I was like, okay, I’m I’m, it was nice to meet you

[00:23:41] Aman: Yeah.

[00:23:42] Ganna: thank you. And, and goodbye and you know, like it, I, I was, I was getting a lot of those conversations. I was, I also had somebody who was trying to put together a black box AI approach.

[00:23:57] Ganna: So we throw like a really large amount [00:24:00] of data. The program. So we compare native and non-native speakers and we kind of tried to achieve some result. And this approach was not working very well for us either so

[00:24:10] Aman: What was the approach again? Can you, can you elaborate just a little bit on that approach?

[00:24:14] Ganna: yeah, so that was throwing like a lot of, we took some parallel data from native and non-native speakers. So people taught saying the same. And then like, we tried to teach AI to, to alter speech. So it sounds closer to the native speaker,

[00:24:33] Aman: What were the, what were the outputs,

[00:24:35] Ganna: that was all in a sound wave output. So we were just changing sound wave to sound wave. And that was this. This was not working very well because there is like too much complexity. And also we needed like super huge amounts of data and parallel data is, is quite. Challenging in many ways, right now you can use synthetic data, but as also synthetic data has problems, [00:25:00] uh, because it makes things sound more synthetic and there is not much diversity in speaker.

[00:25:05] Ganna: If you try to generate large forms of synthetic data, um,

[00:25:09] Aman: So, so if I understand correctly, the first approach was okay. You want at the end, a sound wave output, which has the person’s words in the right accent and right. Everything. And that’s the output of the model. And to do that, the approach that you were trying is to input the model with, um, with like what, what were the inputs.

[00:25:35] Ganna: Yeah. So the inf so like the, so the training was to train the model on parallel afters that are force aligned. So. You know, first align it. So make the model understand what’s what this word sounds like in, if it’s said by a native speaker or like this part of the speech and then, you know,

[00:25:58] Aman: ah, I see. [00:26:00] So take a word in one from one in one accent, and then try to train the model to convert that particular word into

[00:26:07] Ganna: Do the sound closer to the different accent.

[00:26:10] Aman: So include,

[00:26:11] Ganna: That was,

[00:26:12] Aman: I see. And then the speech would be build up word by word. So each word would go in, get translated into the different accent and then outcomes, the newer, and you keep adding words to a sentence that

[00:26:24] Ganna: Yeah. So we, we, so yeah, we did it on words. So it didn’t go very far because there are different words that sound similar. So ship and ship and, you know, like,

[00:26:32] Aman: Yeah.

[00:26:33] Ganna: There are lots of things like that that get distorted, but it was more of a test idea. Like how much data do we need and how, how much effort do we need and how much control do we actually have on the output?

[00:26:44] Ganna: And then the, the result of it was that we had like little control on how the output output is. And then like if you retrain model it, it would. Sometimes produce like very different output. Uh, it was hard to manage and navigate and, and [00:27:00] obviously it was slow. So we could improve on doing, making it faster, like look at more granular parts of speech noted words, but like this, this approach was a no go and, uh, It, it, you know, I was still looking for somebody else who could, who could try to do it differently.

[00:27:17] Ganna: And then I, I, I was interviewing a bunch of people I posted on, uh, angel co that I was looking for, uh, advisor, actually, I was not looking for a founder. I said, you, at that point, I was looking, just spreading the, the word who, who can build something like this. And then, uh, one my co-founder Bryce came about and he said like, Hey.

[00:27:39] Ganna: I think I can do it. Let me just go and show you. And I really love as a founder. I love when people say like, let me just go and do it. Let me show you. Instead of like, well, you know, like I need this and that. Like maybe think the lie and I will be able to do something like, but I’m busy right now. Like give me couple weeks.[00:28:00]
[00:28:00] Ganna: I, I love when people come and say like, Hey, like just let me go and show you. And. I’ll I’ll try to do this. I’m excited about this problem. So like, this was the beginning of actually my collaboration with Bryce and our current approach. Uh, he came back in two weeks or so, or even faster, something like 10 days, maybe two weeks.

[00:28:20] Ganna: And he had a proof of concept. Uh, he had a technology that was changing accent. Um, the output of it was horrific. So it sounded like. It was, it was very distorted. It sounded like major Tom talking to you from space station that you can barely understand that it’s speech, not a noise, but you could hear something that was happening to the accent.

[00:28:43] Ganna: And this, this approach was real time compatible, which was more once important. And then. You know, as a founder, I got like, super excited about it. So I was like, yeah, you can solve the problem of accent. Yay. You know, uh, then we built some other proof of concepts, uh, based on parallel [00:29:00] data. So we used the same approach, but then we, we did.

[00:29:03] Ganna: Yes. Parallel that

[00:29:04] Aman: that? Can you explain that for the audience?

[00:29:06] Ganna: Yeah. So parallel data is when you take, uh, utterance, uh, from one speaker with say part one particular accent, and then you take exactly the same utterance, uh, with different accent.

[00:29:20] Aman: Mm.

[00:29:21] Ganna: And what we did here. So like to, to reduce noises and reduce distortion, what we did here, like I found people who can produce two different accents.

[00:29:30] Ganna: So I found accent actors, and I got them to record recordings exactly. Of the same words and same sentences in two different accents. So we had exactly same voices. So the voice doesn’t change. And then we had. Good quality, clear recordings. And we used this to, to convert their speech from one accent to another.

[00:29:49] Ganna: So we trained model of this, and then we used this speech. So the model was very familiar, but like the, the, what was the output of this sounded very, very compelling and sounded very good and [00:30:00] sounded actually closed, close to what. Our product sound sounds like right now after all of this, like a year and a half of work on it, the difference between then and now is that we did like a fake, uh, experiment, almost like a proof of concept.

[00:30:13] Ganna: And it was on parallel data and the data that systems saw and new. And right now we have something that is working real time on speakers. It’s never seen and doesn’t know. And, uh, it doesn’t require any, you know, like familiarity with the speaker. That was a big amount of work to be done between now and then.

[00:30:34] Ganna: Yeah. But, um, yeah, it, it, it happened

[00:30:38] Aman: Interesting. So the ultimate, so the ultimate, uh, what worked was, uh, there’s multiple aspects to this first is you changed the outputs that you were looking for, right? um, and then you change of course, the model. And then by default, you had to change what kind of data you were looking for, uh, which was to have first [00:31:00] the right, the same person speak two different sentences in two different accents, right?

[00:31:07] Aman: With the same

[00:31:07] Ganna: Yeah. We also, you know, I have to say, because it told you about very simplistic black box approach, where you throw a lot of data on the system and it changes accent. What, what we’re doing now, the current approach we have is very different. This approach is very precise and it’s a mixture of digital signal processing and artificial intelligence.

[00:31:27] Ganna: And like, if you want to do things in real time and you want to do. With low computational complexity so they can be done on user devices. Instead on the cloud, you have to be very, very precise. In what you’re doing, because if, if you are not, it will, it will increase computational complexity a lot. And it would increase very much data requirements for input, for technology to work.

[00:31:52] Ganna: So what we’re doing right now is, is like a really like multi module AI approach. And, and we do as much as we [00:32:00] can on digital signal processing levels, uh, with sound wave directly before we actually get any AI. To work in, in this approach. Like there are, there are multiple steps. So like first is like recognizing what the person is saying, like recognizing sounds of speech because we work on sounds of speech and very granular sounds of speech.

[00:32:19] Ganna: And then like when we recognize we can convert it and, you know, we, we split voice and intonation through virtual sounds of speech of sounds of words. We convert those sounds of speech sounds of words. To. Resemble more desired by our user accent. And then we combine this with original voice and intonation of the speaker.

[00:32:44] Ganna: So there are multiple operations here. And then like after we combined, we still apply some filters and we still apply some post processing to make sound. Good to make it sound better. So there are like so many steps. [00:33:00] Uh, maybe it sounds simple, but actually there is nothing simple about it. Uh, like it it’s pretty cool.

[00:33:08] Ganna: Uh, as a research project, it’s a very cool research project.

[00:33:12] Aman: Yeah. So I think,

[00:33:13] Ganna: give us, yeah. Huh.

[00:33:14] Aman: so I think what, you’re, what you’re describing, which I’m, you know, I’ll try to summarize it for the, uh, for the audiences. Um, the problem,

[00:33:23] Aman: the way you’re solving the problem at a high level is you take speech in a cert from a certain user. You split that piece, you split that speech into different parts, right?

[00:33:35] Aman: One is the actual words that are being spoken. So the information in that speech, then there is the voice and intonation of the person, uh, which are again, two different things, there’s voice. And then there’s intonation. And last is whatever we call the accent.

[00:33:50] Aman: And I think accent itself is a very complex thing. I don’t know if people can really define. What accent really means.

[00:33:58] Ganna: Maybe I can elaborate a little [00:34:00] bit on how technology works and how we think about speech and how we think about accent. So, um, you know, accent is just part of the speech and it’s some, some sound. So we think of speech, which we process in a very organic way. So when we talk, the sound starts somewhere here.

[00:34:16] Ganna: Some, some are in diaphragm. So pre pretty much deep in your vocal track, like diaphragm and Lary.

[00:34:22] Ganna: And this create your. Um, voice and this, those parts of your body create your intonation. So emotional intent is coming from there. There is some, you know, and this is a relatively simple sound wave. It, it has a simple shape ups and downs.

[00:34:39] Ganna: This sound wave is been passed to your upper vocal tract to your throat and your mouth and your mouth shape and your mouth movement. Create sound of the speech. And those sounds can, those are very simple sounds so they can be, for example, rounded or unrounded they can be nasal lab [00:35:00] glottal. It, it, it, some vibrations that our mouth creates and, you know, the sound wave that is coming from our mouth still has some basic features of your basic voice.

[00:35:12] Ganna: And Atton.

[00:35:13] Aman: Yeah.

[00:35:14] Ganna: But your mouth movement, you know, your sound of speech, add much more complexity to it. It, it, it, it is like a very, you know, finicky sound wave. And then like, what we do is we would take this sound wave that’s coming out of your mouth and we would split it in small fragments. And then in those fragments, we would separate those parts that are to do with speech.

[00:35:37] Ganna: And separate those parts like features that have to do with voice and intonation. And it gives us ability to alter both of them in parallel. So if we wanted to, we could change also voices in, in real time in parallel, or we can just do real time, voice transformation. If you will, with this technology, like actually very easy, because changing voice is much simpler than changing accent.

[00:35:59] Aman: [00:36:00] yep.

[00:36:00] Ganna: And, you know, then after the, the transformation happens, we can by, and, and how we think about accents. So accent is, you know, like we have only certain number of sound speech that human mouth can make. And those sounds are completely the same for any language or any dialect or any accent, how we receive this speech.

[00:36:23] Ganna: Really depends on the, uh, prevalence of particular sounds and how we kind of like, you know, the order of how we group them and this, this is different for different accents in dialect. So the core technology is actually really can be applied to any language on any. Uh, you know, like any dialect, uh, the, the part where it’s is specified is that we need to understand, you know, what this particular person is trying to say in which Jackson or language they talk to.

[00:36:55] Ganna: So we need some data, uh, you know, to recognize the speech. So we need like to [00:37:00] have a module that is recognizing that, and then we can do conversion in the correct way. Yeah. But like, Really like it’s, it’s all about accents are all about like the prevalence and the combination and like, It’s it’s, we’re all, you know, like actually this we’re all speak in a very similar way.

[00:37:18] Ganna: The human brain is designed in such a beautiful way that it understands like very strongly pays attention to, to slight changes and, you know, like small variations. And if you look, you know, like how speech is changing on a plot, so like phonetic similarity, or like sound similarity, there are actually like really slight differences.

[00:37:38] Ganna: Uh, Technology makes that for the human brain perceived as very, very significant changes. So it comes up, you know, like for, for us, we have to do tests, obviously on like algorithm tests to have like data like numbers on how we, we do it. But like, we also do a lot of perceptual testing and, and those, those, [00:38:00] uh, metrics are very different from like perceptual.

[00:38:02] Ganna: It’s, it’s very like significant change.

[00:38:03] Aman: Interesting. Well, I’ve never actually thought about this. So voice and intonation, can’t really, isn’t something that you learn. It’s just something that you generate based on your emotional, uh, capacity and your genetics, right?

[00:38:18] Aman: The voices you have is fairly genetic. You know, your lifestyle affects it, but you know, we know that it’s not what, uh, makes up your accent. And the action is really. How you have trained the muscles of your mouth and your face to convert that voice into speech. Right. And so what you’re really saying here is that the action is the, the way that we have trained the muscles in our face.

[00:38:49] Aman: Right. And by repetition, by repetition, by listening and by repetition and by trying to generate the same style. We convert our, we put [00:39:00] the voice from that’s coming from our throat and our diaphragm through this mouthpiece, literally a mouthpiece, uh, which we have trained over time, uh, like a muscle, which generates our accent.

[00:39:13] Aman: And so the problem that you’re solving really is to separate the voice from the mouthpiece and like flip around different pieces. Um,

[00:39:23] Ganna: So it’s a lot of, you know, like it’s a lot of how our brain are actually trained in our original mother tongue. So for example, in Polish, like there are lots of sounds of shit and Chi and the Chi. And like, I have natural tendency when I speak other languages to create to group. Those sounds so like, I have tendency to, to kind of pronounce it, to make my mouse a little different where, where it actually doesn’t belong in English as much as in Polish.

[00:39:52] Ganna: And like every language has those particular ways. And like in many languages, like how we pronounce a or, Ooh, [00:40:00] it’s like very different. We stretch it. More or less, or like make it faster. Emphasize. And de-emphasize so it’s, it’s it really, our brains geared towards our mother tongue always. And we only can learn like very like native, like pronunciation to particular age, somewhere in teenage who most people lose this opportunity, ability to learn.

[00:40:23] Ganna: Like in a native, like capacity because our brains just mature and move on from that. And like, you just kind of stuck with like what you have as your like default way of, and then you, you see sometimes accent training, you see like, people are like the, the, the, you have to like flatten your, how to exactly what you’re saying.

[00:40:41] Ganna: Like you have to make particular like movement to create this sound in a, in a way that is prescribed by a different accent or language.

[00:40:50] Aman: Yeah. I remember like, uh, the first time, um, I, I had. Polish friend. And, uh, she taught me one [00:41:00] word in Polish, which is, uh, I’m gonna probably butcher this, but, uh,

[00:41:04] Ganna: It’s,

[00:41:05] Aman: MB, right.

[00:41:06] Ganna: this is great. You, you did it very well. It’s a cheese brush. Yeah.

[00:41:10] Aman: And I struggled with this for like two hours, like to just generate these sounds like stka the bamboo. And then she told me, oh, it actually just means toothbrush.

[00:41:20] Aman: Like something that took me like two hours to learn to pronounce. And I had noticed, and I told her some things and Hindi and she couldn’t pronounce them. So I were like, okay, we now we’re even, you know, uh, uh, but it totally makes sense. Like the way we train our facial muscles using our mother tongue.

[00:41:38] Aman: Has it direct effects on what kind of things we can pronounce and not pronounce or pronounced with difficulty in a different language. Right? So this, this makes a lot of sense.

[00:41:48] Aman: Let’s switch gears a little bit here. And, uh, let’s talk about your go to market strategy and your fundraising journey, because like you said, we said you are a startup.

[00:41:58] Aman: How did that [00:42:00] strategy evolve for you?

[00:42:01] Ganna: So, like, I, I wanna start with saying that there’s nothing simple about startup and the, if you, if you feel like being a founder, you have to just love it and you have to love your, your problem that you’re solving and you have to love your, uh, team and your work. Uh, cuz like there’s nothing easy about it anywhere you touch it.

[00:42:20] Ganna: So. You know, we started with, um, we started with the accent problem. We were solving. And along the way as we went, went on, we, we created this amazing technology that could do so much more, that we can use to change people’s voices in real time. And we can also, you know, remove noise from recordings. Uh, we can emphasize particular parts.

[00:42:42] Ganna: We can also do speech analytics. Um,

[00:42:44] Aman: Hmm.

[00:42:45] Ganna: So, you know, like also we gather data so we can be a data company. And then our, like our sales vision, our monetization vision evolved from, um, only being like accent conversion to actually being a [00:43:00] company focused on selling to software companies and developer community to help them create delightful speech experiences in their product. You know, like there is very little right now available to improve speech experiences in, in products that exist today. So there is very little to improve human to human communication, and there is almost nice thing to improve human to technology communication. You know, word is the easiest way for people to convey complex ideas yet communicating with technology, we rely on clicks and buttons, and this is very unnatural and, and, and bad for us.

[00:43:39] Ganna: So we, we could be so much more comfortable as people using spoken word. But then, you know, like, so, so the vision evolved and our, we want to provide developers and software companies with SDKs, with software development kits that, uh, that improve those experiences because our product is so new and UN [00:44:00] unknown.

[00:44:00] Ganna: Like we have to educate. People that such product exist. And we also have to have very strong metrics to show for our product, improving, um, eligibility and, and saving time for people. So right now we have the first initial rollout. We’re gonna roll out accent transformation and we’re gonna, we are gonna roll it out as.

[00:44:22] Ganna: Up. So, so we can get like good metrics and then we can get testing and then we’re gonna roll it out as SDK, uh, in, in six or eight months or so from now, um, we have design partners. We actually have customers and we, they gave us big requirements, what we need to meet in terms of quality and, and standards.

[00:44:42] Ganna: And we are building alongside those requirements and then we we’re gonna roll out the product so we can start testing it in real life environment. Feedback from users and metrics and improve upon what we’re doing. But yeah, like we have, I’m not gonna name companies, but, uh, we have three very big companies, [00:45:00] uh, really interested in adding this to their tech stack.

[00:45:04] Ganna: Um, so, so that’s, that’s, that’s kind of like the rollout strategy and then we’re gonna build on this. So we’re gonna do product led growth, uh, as, as the company origin. So. Based on feedback from users based on what is useful. We’re gonna, uh, grow some branches of this business and, and grow other branches maybe less or maybe later.

[00:45:25] Ganna: But yeah, it’s, it’s gonna be a constant conversation between us and, and between the market between the user.

[00:45:31] Aman: interesting.

[00:45:32] Ganna: is what I hope for at least.

[00:45:34] Aman: Yeah. Yeah. Always, uh, uh, I think this, this was a really good VC pitch, you know, because I, because any investor listening to this would go like, whoa, that’s a lot of, you know, potential. That’s a lot of big market, you know, that you gonna capture, you know, like a large addressable market and all that thing.

[00:45:50] Aman: So, uh, it’s a good story and it makes, it’s a sensible story. It’s not like, uh, you know, of course. Once you start with this problem, you generate the IP, [00:46:00] which can, you know, help you go after other markets down the line. So you just have to start with one cornerstone problem, which in our case happens to be, uh, accent conversion.

[00:46:10] Aman: But you

[00:46:11] Ganna: Yeah. And then.

[00:46:12] Aman: sorry, go ahead.

[00:46:13] Ganna: No, the, uh, I wanted to say quickly, quick words about market size and fundraising. So the market size for this is 174 billion. And like some of this would come from, uh, teleconferencing market and like, uh, contact center market. Also voice gaming is really big and like, uh, you know, like real live streaming and, uh, smart devices where it’s

[00:46:35] Aman: interpretation. All of that.

[00:46:37] Ganna: Yeah, and we don’t even take into consideration metaverse and we’re free because those revenues are hard to assess. So we don’t have like numbers, but those, those verticals will grow definitely opportunities for monetization. And then to fundraising, like, Hey, if you are a founder, your fundraising is never done.

[00:46:53] Ganna: So you always fundraising. We’re always fundraising, like always looking for investors, always looking for good connections. [00:47:00] Always looking for customers. So like this is ongoing and this like there’s no, there are no ranks from that.

[00:47:06] Aman: One thing, one thing that you mentioned, uh, in passing, but I, it hit me for a second was that you talked about setting the right metrics so what are the metrics that you target and how do you measure your company?

[00:47:22] Ganna: Well, you know, there, like, because I told you we have like lots of parts to the technology. So like, you know, like we measure different things. So for those modules that use AI, we always measure loss function. And we look at like, what is the loss function? And like how it’s it’s it relates to. To do the bottom line performance.

[00:47:43] Ganna: Uh, we are looking at, when we look at the holistically at the solution, we look at audio quality. So like the audio degradation, because the audio is heavy pro processed. So there is some audio degradation and like, we definitely need to improve on that metric because audio degradation [00:48:00] is still like, Much more significant, like much heavier than we want to have.

[00:48:05] Ganna: So like that that’s like a really important metric for us. And then we have our own metrics that we develop. So one isness, we measure excitedness, uh, and

[00:48:15] Ganna: we, we measure overall

[00:48:17] Aman: Just tell us a little, tell us a little bit like, like a one minute, uh, quick description of what accented nest means.

[00:48:24] Ganna: Yeah. So like actually, like it’s, it’s quite straightforward. So we measure using, using mechanical toric. So like we, we play recording for the person and I think it’s on, on a scale from zero to nine, uh, like zero is like no accent. And nine is like very heavy accent. Like how do you. Like, what is, what would you write this recording?

[00:48:44] Ganna: And you play like a mix of recordings where you play like different accents. So like you, you also have questions like, like very simple, yes. Or no questions. So like you, you can play recording and say like, is this in like accent a or [00:49:00] this is like, Not accent a, so like, this is like accent a, or this is for an accent, like, what is this?

[00:49:06] Ganna: So put it in one bucket or another bucket. So then you kind of like combine those metrics coming, like which bucket the recording is in. And then like, you know, like how accented, like on a scale from zero to nine, like how heavy the accent is, if you will. So yeah, like.

[00:49:24] Aman: Interesting.

[00:49:25] Ganna: That’s like, I know it’s not like super like groundbreaking what you have

[00:49:30] Ganna: to measure somehow.

[00:49:31] Ganna: Yeah.

[00:49:32] Aman: Why I love this story, why I love this particular story. And I’m glad I asked this question is because, you know, and this is for the audience when you’re building an AI product. You’re not, you know, you can use a lot of the metrics you can use around the techniques that other people building non-AI products use as their metrics.

[00:49:49] Aman: Right? You can combine both worlds. It’s not like everything that you change has to be in the form of a mathematical equation in a research paper that your loss function has to be something like very [00:50:00] mathematical, very, uh, signal processy or whatever you can use. Um, you know, this is the, it opens up your creativity, like how do you find.

[00:50:10] Aman: You know, how do you combine traditional software development and product management methods? And you combine those with deep learning research to choose creative, new, uh, loss functions, creative, new algorithms, uh, to optimize upon. Right. Um, so that’s, I think the real lesson.

[00:50:30] Ganna: Yeah, I would say like, Hey, like if you’re building a product for people to use or enterprises to use, like you are a better user, some like non mathematical ways of measuring, like you better use some perceptual assessment of it, or like practical assessment, give it to users to test somehow like give like human feedback, non not machine feedback, because at the end of the day, unless you’re building something for machines to use, then like you are, you are best with machine metrics, but if it is like human.

[00:50:56] Ganna: User is involved. Like you absolutely [00:51:00] better use human feedback before you evaluate your solution. Yeah.

[00:51:05] Aman: Yeah. Excellent. Well, thank you so much, uh, Ghana and, uh, to the audience, if you want to find them to buy their, uh, to buy from them or to invest in them, um, either way or, you know, contact them in

[00:51:20] Aman: general. Yeah. Or to collaborate or to work for them. Uh, they, I think their website is sayso.ai. So S a Y. O AI. Um, and, uh, yeah, as they said, they’re looking for, they’re always fundraising.

[00:51:36] Aman: They’re looking forward to it. Uh, and, uh, thank you so much Ghana for sharing.

[00:51:41] Ganna: Yeah, thank you for having me Aman. I really appreciate the time and I’m glad it was interesting and curious. I, I, I’m glad when I can, you know, make people curious about technology. I, I love technology also for any, uh, female listeners there. Um, you. [00:52:00] Technology is so cool. Uh, and you can do it. Absolutely.

[00:52:03] Ganna: You can do it and you can lead teams. Uh, you can lead male. My team is, is right now, completely male. So we’re actually looking for more diversity, uh, on the team on to keep that in mind, like, uh, absolutely as a female founder, you can lead male teams very successfully. Uh,

[00:52:23] Aman: So, uh, you know, I love that as well. Like, you know, like, uh, so this is a female founder telling you, Hey, don’t worry. Like, even if you’re in a deep tech field and you’re gonna be surrounded by dudes, like, just go for it, you know? Uh, nothing’s stopping you.

[00:52:39] Ganna: you absolutely. Yes, absolutely. Go for it and dare to do it. You can do it. If, if this is something you feel like doing, just go for it.

[00:52:48] Aman: Awesome.

Tags:

Using A.I. to Mask Foreign Accents on Voice Calls — SaySo (Ep 22)

ALSO AVAILABLE ON

Join the Age of AI Mailing List. You'll be in good company with CEOs, investors and innovators around the planet.

Become a “Tech Fluent CEO.” Read my book here.

Raw Transcript:

Tags:

Leave a Reply Cancel Reply

best down free | web phu nu so | toc dep 2017

Using A.I. to Mask Foreign Accents on Voice Calls — SaySo (Ep 22)

ALSO AVAILABLE ON

Join the Age of AI Mailing List. You'll be in good company with CEOs, investors and innovators around the planet.

Become a “Tech Fluent CEO.” Read my book here.

Raw Transcript:

Tags:

Related Posts

Using DeepFake AI to Create Ethically Unreal Videos — Colossyan (Ep 21)

Using A.I. to Make Corporate Car Fleets Greener and Safer — VivaDrive (Ep 20)

A.I. to Scale Safety-Critical Inspections in Factories — Sentin AI (Ep 19)

Leave a Reply Cancel Reply

best down free | web phu nu so | toc dep 2017