May 09, 2018
At Voxable, we design and develop conversational user interfaces. We build everything from voice user interfaces, like Alexa Skills and Actions on Google, to custom chatbots. In our workshops, talks, and discussions with clients, we find it helpful to break down the conversational interaction loop – the way humans and machines communicate. It's essential for anyone working with this technology to understand how machines handle conversation. For a human to converse with a machine at the most basic level, the machine must:
I'm going to explore the core technology behind how machines accomplish these tasks and how that technology differs depending on whether it's a voice or text conversation.
When a human speaks to a machine, the machine needs to understand the collection of sounds coming from the human's voice and translate them into text. This is typically handled by machine learning algorithms trained to recognize human language from streaming audio – the technology is called automatic speech recognition (ASR). ASR is the process by which a computer recognizes the words a user is saying and converts them into text in real time.
ASR technology is built into browsers, operating systems, phones, and voice-first devices like Amazon Echo and Google Home. ASR accuracy has improved exponentially over the past few years with the introduction of new machine learning techniques, like deep learning, that make the technology much more reliable and usable on a broad scale. The interaction loop stops without ASR, so ASR accuracy is necessary to have highly-functional voice interfaces.
Nevertheless, ASR still faces challenges depending on the job it's being asked to do and the context in which it's being used. The wide range of voices, dialects, and accents interacting with ASR demand more from the technology. Rapid speech and background noise also pose a challenge. It’s necessary for conversational designers to understand how ASR affects their users specifically.
As you might guess, the process is somewhat simpler when a human is conversing with a machine via text because there is no translation from audio to text. Graphical user interfaces, like messaging apps, have a variety of different design affordances that improve interactions. Menus, image cards, and quick reply buttons are just some of the tools conversational designers can leverage to facilitate conversation.
Whether the text originates from a human speaking or typing, language is messy and machines have a hard time understanding it. Machines must be taught to take human language and translate it into structured data automatically. Enter natural language understanding (NLU). NLU is the technology that translates messy human language into structured data. In a conversational interface, the data that is extracted from NLU includes the information or entities that are important to accomplishing a user's goal.
The rise of accessible and affordable NLU platforms like Google's Dialogflow, IBM Watson, and Microsoft LUIS has fueled innovation in the conversational interface industry. Startups and independent developers can utilize these platforms and no longer need to invest in a team of data scientists and PhDs to build custom statistical models of human language.
Technology companies invest massive amounts of time and money to advance machine learning and increase machines' ability to understand humans. As a result, machines are learning to understand human elements of communication like the context of a conversation and the emotional state of users.
Once meaning and data are extracted from the text, a system or logic must be put in place for how the machine handles that information – this is called bot intelligence. Bot intelligence takes data extracted from the NLU and combines it with contextual data and business logic to handle the complexity of conversation. For example, a chatbot may need to ask follow-up questions to clarify the user's intent in order to perform the correct action.
In a conversational interface, bot intelligence manages the:
Bot intelligence encompasses the logic a machine requires to have in-depth conversations with humans and accomplish their goals.
After performing an action and retrieving information, the machine must generate a response that is valuable to the human. The response could communicate the state of an action, relay information, or request more information from the user.
Voice-based interactions use audio feedback, synthesized speech, and recorded audio to deliver responses. Using synthesized voice is the most cost-effective method for a voice interface to deliver responses. Synthesized voice, also known as text-to-speech (TTS), is technology able to mimic a human voice by translating written language into sounds that humans recognize as speech. This technology now generates better and more human-like voice quality, so much so that it's even able to mimic real celebrity and politician voices.
Recorded audio options, such as voice-overs and sound effects, are also important elements in a voice interface and, when used well, further reinforce the brand and improve communication.
For text-based interactions, the response is delivered through a graphical user interface like a messaging app. The various affordances of each messaging app – such as menus, image cards, and quick reply buttons – help deliver rich information and structured content.
Hopefully, this breakdown of the way conversational user interfaces work demystifies how humans and machines communicate. Successful interaction between a human and computer is similar to successful interaction between two humans: it involves listening, understanding, processing, and responding appropriately. Building a shared understanding of the users' needs and helping them achieve their goals is a lofty task, but it's core to the work of successful conversational design.