5 Considerations for Designing Multimodal Bots

Lauren Golembiewski

Designing and developing intuitive voice interfaces and chatbots is challenging, to say the least. Building a bot that interacts in both the voice and visual space adds to that challenge. People often ask me, “What’s your design approach for a multimodal bot?” I receive that question so much I was inspired to write about the considerations I keep in mind during this specific design process.

First, let me breakdown a few key terms:

Voice-first - Devices or applications with voice as the primary input and output. The Amazon Echo and Google Home are both voice-first devices. Voice-first is also a design approach in which one starts designing the voice interface before adding text or visual UI elements to ensure users can interact both hands-free and eyes-free.

Multimodal - Conversational interfaces that include both voice interaction as well as text and visual UI elements. Devices like the Echo Show, Google Home Hub, Kindle Fire TV, and smartphones all support multimodal interaction.

Multi-channel - Bots that are integrated into multiple conversational channels like Amazon Alexa, Google Assistant, Facebook Messenger, and webchat.

It’s essential for designers to consider how to capitalize on each channel’s unique voice and visual affordances when creating a conversational experience. Amazon Alexa and Google Assistant, the leading conversational channels, now provide multimodal interaction with the latest screened devices and platform updates. When visual UI elements are successfully implemented, they improve usability and help users navigate an interface more easily. Even when adopting a voice-first design approach, it's important for designers to take into account the channel's available visual UI elements that might improve interface usability.

When extending a bot to new channels, designers benefit from multimodal design considerations. A multimodal design approach helps designers determine how the interaction will change for a new channel and which UI elements should be used. Designers should familiarize themselves with the vocal, aural, and visual UI elements available in each channel as well as contemplate the following when designing any conversational interface.

1. Writing

The main difference between a voice-first and multimodal interaction is the latter has visual cues such as text, images, and buttons to assist users. In a voice-first interface, the prompts must be clear because users have no way to visually orient themselves in the application or guide themselves when they get off track. Users must know how to answer the bot’s prompts and questions for interactions to be successful. Develop standards for your team that drive a consistent bot voice, tone, and persona as well as ensure the bot’s responses move the conversation forward to achieve users' goals.

A voice interface requires the user’s attention throughout an interaction and, with modern society’s increased distraction and decreased attention span, that level of engagement is easy to lose. Consuming information aurally requires more cognitive load from users which makes clarity and consistency the pillars for writing quality voice interface content. It’s necessary to write well-considered and tested responses for a voice interface to be successful. Check out Rebecca Evanhoe's talk at VOICE Summit 2018 for valuable insight on how to write for voice interfaces.

Act out scripts written for the voice interface to assess the way it sounds. If the dialogue doesn’t seem natural between two humans, it’s definitely not going to improve when one of those humans is replaced by a machine. Perform Wizard-of-Oz usability tests on the content and interactions before building the interface. Usability testing enables designers to understand users’ mental models and prevents significant usability issues before the bot is developed—saving resources, time, and frustration. A voice-first writing approach is an effective way to refine language for bots that are multi-channel and multimodal.

2. Structure

Users can say anything to a voice interface, so it must be able to anticipate and handle a wide range of inputs–this is why natural language understanding (NLU) is a necessary component of any voice interface. A multimodal interface benefits from being able to incorporate visual UI components to assist users. Visual UI components improve usability and guide the conversation forward. Cards, menus, and quick reply buttons present users with choices and give them the power to advance the conversation with a tap or click.

Another helpful feature of a multimodal interface is the persistence of information within the interface. Historical information might move out of users’ frame of view, but most conversational channels allow users to scroll back or use menus to access historical data at any time. It’s also possible to display the structure of information in a multimodal interface whereas information in a voice interface exists in a linear stream of unstructured information.

Capitalize on the visual UI each multimodal channel affords to enhance user experience, especially when users need to complete spatial tasks (e.g. comparing furniture to purchase). Also, think about how the design might visually integrate into other types of channels like a television, a phone application, or email.

3. Context

The context of an interaction plays a huge role in the way a conversation evolves. For conversational interfaces, context varies widely depending on the device, channel, environment, user, and stage of the conversation.

How a multimodal interface is used in physical space is an essential conversational design consideration. Voice-first interactions are inherently more public than text or visual interactions due to the need to speak aloud and listen to audio. Keeping this in mind, if a conversational use case involves the need for users to share private or sensitive information, it might be more appropriate to display text and visual inputs.

Voice-first and multimodal devices can be fixed in a location like the room of a house. If a device's location is static, it becomes a social element of the environment which can alter the applicable use cases. Certain voice skills and applications should be designed to specifically integrate into the environments users interact with them most, such as home, car, and workspace. Perform contextual research to understand the setting in which users are interacting with the voice interface.

Beyond the physical space or location of the interaction, it is important for conversational designers to identify the various contexts the bot needs to recognize. Conduct user research and usability testing is to identify the important contexts of the conversational interface and how the bot will support them.

4. Brand

Due to the ever-evolving affordances and interactions available in conversational channels, multimodal interfaces give brands new ways to drive usability and recognition. Whereas highly visual interfaces can rely on more traditional branding elements such as logos, images, and brand colors, voice-first interfaces must focus on audio elements that represent the brand.

Consider developing and consistently using earcons and sound effects to anchor users back to a brand when designing a voice or multimodal interface. As voice-first and multimodal bots mature, businesses' investment in aural and multimodal brand elements will increase.

5. Feedback Loops

Feedback loops are an integral part of any interface. The bot must be able to initiate an interaction with users when it makes sense (and the user has given their permission). Feedback loops also drive the turn-taking nature of chat-based conversations–letting users know when the bot has responded and it’s their turn to talk.

Because voice interfaces require the user's attention to communicate, it’s difficult to create aural feedback loops without imposing on the user's personal space. No one wants their smart speaker to start talking or chiming at random. Therefore, smart speakers rely on external lights, screens, and deeper integrations into companion devices like a television or mobile application to give the user feedback about the interaction. When designing an experience for a voice-first device like a smart speaker, consider incorporating additional communication channels or devices to initiate an interaction with users.

Design—especially for new technology—is an evolving discipline that necessitates active participation in a community of peers. I’ve outlined my considerations when designing multimodal bots, but I am constantly refining my approach as new devices, features, and methodologies are released. I can confidently say the more designers become interested in and begin solving problems for the conversational space, the more varied approaches we’ll see that will advance the industry as a whole. I look forward to being a part of this conversational community for many years to come!