For years we’ve been promised a computing future where our commands aren’t tapped, typed, or swiped, but spoken. Embedded in this promise is, of course, convenience; voice computing will not only be hands-free, but totally helpful and rarely ineffective.
That hasn’t quite panned out. The usage of voice assistants has gone up in recent years as more smartphone and smart home customers opt into (or in some cases, accidentally “wake up”) the AI living in their devices. But ask most people what they use these assistants for, and the voice-controlled future sounds almost primitive, filled with weather reports and dinner timers. We were promised boundless intelligence; we got “Baby Shark” on repeat.
Google now says we’re on the cusp of a new era in voice computing, due to a combination of advancements in natural language processing and in chips designed to handle AI tasks. During its annual I/O developer conference today in Mountain View, California, Google’s head of Google Assistant, Sissie Hsiao, highlighted new features that are a part of the company’s long-term plan for the virtual assistant. All of that promised convenience is closer to reality now, Hsiao says. In an interview before I/O began, she gave the example of quickly ordering a pizza using your voice during your commute home from work by saying something like, “Hey, order the pizza from last Friday night.” The Assistant is getting more conversational. And those clunky wake words, i.e., “Hey, Google,” are slowly going away—provided you’re willing to use your face to unlock voice control.
It’s an ambitious vision for voice, one that prompts questions about privacy, utility, and Google’s endgame for monetization. And not all of these features are available today, or across all languages. They’re “part of a long journey,” Hsiao says.
“This is not the first era of voice technology that people are excited about. We found a market fit for a class of voice queries that people repeat over and over,” Hsiao says. On the horizon are much more complicated use cases. “Three, four, five years ago, could a computer talk back to a human in a way that the human thought it was a human? We didn’t have the ability to show how it could do that. Now it can.”
Um, Interrupted
Whether or not two people speaking the same language always understand each other is probably a question best posed to marriage counselors, not technologists. Linguistically speaking, even with “ums,” awkward pauses, and frequent interruptions, two humans can understand each other. We’re active listeners and interpreters. Computers, not so much.
Google’s aim, Hsiao says, is to make the Assistant better understand these imperfections in human speech and respond more fluidly. “Play the new song from…Florence…and the something?” Hsiao demonstrated on stage at I/O. The Assistant knew that she meant Florence and the Machine. This was a quick demo, but one that’s preceded by years of research into speech and language models. Google had already made speech improvements by doing some of the speech processing on device; now it’s deploying large language model algorithms as well.
Large language learning models, or LLMs, are machine-learning models built on giant text-based data sets that enable technology to recognize, process, and engage in more humanlike interactions. Google is hardly the only entity working on this. Maybe the most well-known LLM is OpenAI’s GPT3 and its sibling image generator, DALL-E. And Google recently shared, in an extremely technical blog post, its plans for PaLM, or Pathways Language Model, which the company claims has achieved breakthroughs in computing tasks “that require multi-step arithmetic or common-sense reasoning.” Your Google Assistant on your Pixel or smart home display doesn’t have these smarts yet, but it’s a glimpse of a future that passes the Turing test with flying colors.