Are you a non-technical executive or entrepreneur who wants to be more AI- and tech-fluent? See our program.
This is The Age of AI Series, where we talk to the foremost entrepreneurs and innovators around the planet using ML to transform industries. (Join our special mailing list!)
In today’s episode, we’ll talk about the new generation of voice-controlled apps and devices.
We’ve all seen voice assistants explode into the consumer space recently (Siri, Alexa, Google Assistant etc come to mind). But their adoption has been slower than ideal, because they feel very “unnatural” to use — it feels like a walkie-talkie conversation where 2 people take turns to speak.
Due to the inherent delay in turn-based dialogue, there’s a lack of continuous feedback to the speaker (reassuring them that they’re being understood as they speak). Eliminating this lag, this discomfort, is the holy grail of voice control technology — which would skyrocket the adoption rates and you’ll start seeing almost everything being controlled by voice.
This brings us to my new guest, Otto Söderlund, the CEO of Speechly — a company trying to fix that discomfort using their patented technology.
(Make sure you scroll down to read the 2-minute summary and takeaways!)
Here’s what we discuss:
- 1:50 — Brief history and driving forces behind speech recognition technology
- 07:00 — How speech and “entity” models work today: a simple explanation
- 11:15 — Limits and shortcomings of current technology
- 22:55 — Why the user experience of voice technology is still poor: error handling
- 31:21 — How models like GPT-3 have advanced but also slightly hurt the field of ML
- 38:12 — What Speechly’s technology does
- 42:45 — SPEECHLY DEMO and use-cases
- 46:21 — How Speechly decided what to patent and what to keep a trade secret
- 54:55 — The machine learning bootstrapping problem, and using synthetic data to fix it
Aman’s 2-Minute Summary and Key Takeaways
The AI architecture of most voice assistants is very sequential – you first convert speech into text, then extract entities and sentiments from the text, use the entities and sentiments to “understand” the meaning of the speech, generate a response, and then do the whole thing in reverse to give the final output to the user in form of speech or text.
It’s very intuitive and makes sense. And for each block in this stack, we’ve already achieved human parity, and the toughest technical problems have been solved by the ML community.
But because of this sequential processing, there’s an inherent lag — a short “waiting time” after you’ve spoken to the device, while it makes sense of your input. Moreover, as humans when we speak, we correct ourselves all the time, and natural speech is very incoherent and choppy. Sequential systems are poor at handling incoherent speech and corrections in real-time.
Speechly’s claim to fame is that they’re redeveloped the sequential stack so that it’s 5-10 times faster, by being real-time. As you start speaking, you don’t wait — the system immediately starts understanding and streaming to you visual feedback and/or responses. And they provide an API that developers can use to stick this functionality into all their applications, from e-commerce to flight booking to social media. They also heavily use synthetic data to fine-tune their system for new use cases as they arise.
Another critical aspect we touched on was how they decided what to patent and what to keep a trade secret.
Their secret sauce is a specific deep neural network that can handle streaming audio, and certain aspects of how they do things. These can’t be patented, but they patented other parts of their stack, such as the user interface and their methodologies for training models. I find it interesting to see how patent laws apply to ML technologies.
My 2 cents: the technical details are surely impressive, but even more impressive is to see it in action — if you watch their demos, you too might get a feeling of premonition about the future of websites and apps.
Also, Speechly’s buyers are business folks who want to gain a competitive advantage and don’t understand ML, but their users are software developers, which I suspect makes for an interesting sales and marketing cycle!
(Ethics Policy: These opinions are 100% my own as an independent observer and educator. I don’t own stock in guests’ companies or their competitors, nor do I get paid by them in any form for any reason at the time of publishing, unless specifically stated. Episodes are also not intended to be an automatic endorsement of any company or its products and services.)