3 Reasons Why It’s Time to Talk about Voice UI

Communicating with machines used to mean clicking a mouse. Today, talking to computers involves actually talking to computers through voice recognition—and hearing them talk back.

Article

Topics

Product Design & Delivery →

Communications →

Consumer Products →

Technology →

Topics

Product Design & Delivery →

Communications →

Consumer Products →

Technology →

As an interaction designer, my job is to help humans communicate with computers. At the start of my career, most of this communication happened through clicks. I would help users navigate a graphical user interface (GUI) by guiding them where to click their mouse and type with their keyboard. Today, I also focus on touch interfaces, helping people everywhere communicate with the tiny mobile computer in their hands through taps and swipes.

Looking forward, voice-based user interfaces (VUI) represents a massive jump in this interaction evolution. Instead of forcing users to learn the computer’s GUI in order to give it instructions, voice technology puts the onus on computers to “speak” our language.

frog has been receiving more client requests for VUI work for some time, but I still wanted a broader industry perspective. That’s why last fall, I attended the second annual All About Voice conference in Munich. Hosted by 169 labs, the event featured a variety of speakers with voice expertise. Topics for discussion ranged from current state of smart speaker technology to the design of personalities for voice assistants. Usually, I attend these types of conferences (virtually these days) with a few questions in mind. Primarily, should we care about voice? Is voice UI just a passing trend? Or will it fundamentally change how humans interact with the world?

My conclusion? We are at the cusp of a paradigm shift—and there is no going back now.

Voice of a Generation

Already, kids are being born into a world where they can talk to their phones, their lights and even their kitchen appliances. Touchscreens made human-computer interaction more intuitive, but I believe voice has the potential to flatten that learning curve completely. Anyone who can converse in their native language will be capable of voice-first interactions that feel natural. And while I, as an older millennial, may not be entirely comfortable with an Alexa in my microwave, I doubt my future kids will even question it.

While many presenters at the conference acknowledged that the underlying voice technology is still experiencing growing pains, it’s maturing quickly. We’re seeing rapid adoption of smart listening devices in homes and cars throughout the world. Andrea Muttoni of the Amazon Alexa team, reiterated this growth with his summary of the myriad smart speaker-enabled devices that Amazon released at the end of September. From optical glasses to microwaves, Muttoni didn’t hesitate in explaining Amazon’s vision to incorporate Alexa into every possible touchpoint in the consumer’s life. The goal is “Alexa everywhere” as he put it.

Another example of the power of voice was seen in Google’s release of the Pixel Buds 2, the “always listening” wireless earbuds. Plus, the company’s latest smartphone, the Pixel 4, features “raise to talk” functionality, which activates the Google Assistant as soon as you pick up the phone. This means that Google is expecting voice to become the user’s primary way to interact with their device.

Why Voice Matters Now

There is plenty of reason to believe that voice is the interaction model of the future. Here are a few examples:

1Speakers are everywhere.

During his presentation at All About Voices, Bret Kinsella, founder of voicebot.ai, pointed out that between 2018 and 2019, smart speaker installations in US homes went up by nearly 40 percent. That means that by Q3 of 2019, more than 80 million Americans, about 32 percent of the national population, had smart speakers installed in their homes. Adoption in the EU is also growing steadily, with the UK at 21.1 percent and Germany at 11.6 percent at the end of 2019.

2Voice UI is a matter of inclusivity.

High-quality VUI is not simply an issue related to convenience or preference, but one of inclusivity. For people with disabilities, such as those with impairments in vision, mobility and motor skills, voice technology represents a more personalized way to communicate and control both their physical and digital lives. For elderly and socially isolated people, VUI represents a meaningful opportunity for company and comfort.

3Talking is natural.

Compared to click-and-touch interfaces, speech is by far the most natural way we’ve interacted with computers yet. Of course, the personality of the VUI will play a role in how natural this experience feels. Adva Levin, Founder and CEO of Pretzel Labs, a company that designs voice apps for kids, said, “Designing a personality for a voice assistant is a lot like creating a character. How old are they? What is their background? How do they talk?” These are now critical design considerations.

How to Talk to Humans

As humans, we have high expectations for technologies that imitate human behavior. Compared to clicking buttons, voice is a much more intimate and emotional form of interaction because it is so fundamental to who we are as human beings. This also means that if the computer gets it wrong, most of us will get much more frustrated with an unsatisfying interaction.

Unfortunately, conversation is inherently messy, even between people who speak the same mother-tongue. While the human brain is pretty good at dealing with messy, computers are not. They prefer logical structure over emotional nuance, which leaves a lot of potential for error when interpreting voice commands.

“Voice apps are going to be judged by how they handle the misunderstandings in conversation,” said Jon Bloom, Senior Conversation Designer at Google, during his talk on error-handling.

Bloom discussed a number of key challenges involved in designing for voice, as well as ways in which voice designers can manage them. One of the biggest issues, he said, is recognition. This can either be when the device didn’t hear you (e.g. too much ambient noise in the room) or when it didn’t understand you (e.g. long pause or odd wording). Bloom explained that knowing when and how the voice assistant should re-prompt the user in different situations is critical to creating a positive experience.

For example, when a voice assistant can’t understand what the user wants, the typical strategy is to ask the user to repeat themselves or to try and rephrase the question. After a couple unsuccessful re-prompts, however, it may be better to just close the microphone and let the user start over, rather than continuing to frustrate them. Whether this is the “right” approach usually depends on how deep you are into the conversation and what type of content you’re discussing.

Another common challenge Bloom described was around attention span (or rather the lack of it). While we might like the idea of completing an entire trip reservation flow by voice, in reality, most people don’t want to wait while the computer reads out a list of 20 flight options one-by-one. Therefore, voice designers must decide when it makes sense to leverage a multi-modal approach and pull up those 20 flight listings on the user’s smartphone screen for faster browsing and selection. This requires convergent design–the ability to look across silos that may otherwise be a barrier to this type of innovation.

frog + VUI

frog works with a lot of clients in automotive, healthcare and consumer goods, who are already exploring voice in their products and services. More and more, we’re advising brands on how to approach this new technology by examining their various use cases and recommending where voice can best elevate the customer experience. When would you use voice in the home versus in the car versus at work? In which context is voice the most efficient or enjoyable input mode?

On the flipside, it is also our responsibility to understand where choosing a different interaction model might result in a more satisfying experience. In the car, for example, most core features that can be controlled by voice (e.g., GPS, media, etc.) also support redundant touch inputs for those situations when there’s too much background noise. You might ask your GPS to navigate to an address verbally, but then use the graphical map on your heads up display, instead of relying on the car to read out directions. As designers, we need to understand different contexts.

Until voice technology becomes smart enough to handle highly complex requests in challenging environments, this multi-modal approach—i.e., transitioning between different speech/listening to sight/touch—can be a powerful tool for managing limitations with our current voice assistants.

Designing Artificial Personalities for Real People

This multi-modal approach is often informed by convergent design techniques, which is a practice that integrates products, services and digital in new ways to create transformative solutions and experiences. Along with consulting clients on these strategies from an experience and innovation perspective, at frog we are sometimes asked for guidance on one of the most unique elements of this new interaction model: personality. Whereas a GUI can exhibit a certain level of brand personality through the designer’s choice of colors, typefaces and imagery, designing for voice is entirely different.

We can no longer ignore language. With voice design, we cannot continue to fall back on lorem ipsum text because dialog is the interface. Designers working with voice must understand how people respond emotionally to voice and to the characteristics of different voices, which often requires knowledge from social science fields like psychology, sociology, and linguistics—and even from humanities subjects including literature, philosophy and history.

Kinsella said in his presentation, “Voice is about being human and the tech understanding us, rather than us having to learn the language of the devices we use.” As an interaction designer, I know I am excited by the possibility of bringing human-centered voice experiences in reach—or rather within earshot—of the humans that will appreciate them most.

Author

Kim Gladow

Principal Interaction Designer, frog Munich

Kim Gladow

Principal Interaction Designer, frog Munich

Originally from Seattle, Kim currently works out of frog’s Munich studio. With over 10 years of design consulting experience, she both leads creative teams and goes hands-on to create high-level user experiences for a range of digital products and services. She is also passionate about sustainability and how circular design principles can be applied to the digital world.

	Global		EN
	France		FR
	China		CN

	Global		EN
	France		FR
	China		CN