<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Context-First]]></title><description><![CDATA[Thoughts on Voice and Multimodal User Interfaces]]></description><link>https://www.context-first.com/</link><image><url>https://www.context-first.com/favicon.png</url><title>Context-First</title><link>https://www.context-first.com/</link></image><generator>Ghost 2.9</generator><lastBuildDate>Mon, 31 May 2021 08:08:21 GMT</lastBuildDate><atom:link href="https://www.context-first.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Dialogue Management: A Comprehensive Introduction (2021 Edition)]]></title><description><![CDATA[Understanding approaches to conversational voice and chat systems]]></description><link>https://www.context-first.com/dialogue-management-introduction/</link><guid isPermaLink="false">Ghost__Post__5f94594b435a8f3636a3515e</guid><category><![CDATA[Deep Dives]]></category><dc:creator><![CDATA[Jan König]]></dc:creator><pubDate>Tue, 25 May 2021 15:31:17 GMT</pubDate><media:content url="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-introduction-3.jpg" medium="image"/><content:encoded><![CDATA[<html><head/><body><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-introduction-3.jpg" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/><p>“<em>When do you open tomorrow?</em>”</p><p>“<em>We open at 9 am. <strong>Do you want to book a table?</strong></em>”</p><p>How does a conversational system decide how it should respond to a user’s request? In which cases should it ask for clarification, deliver facts, or present a follow-up question?</p><p>In this post, I want to introduce the topic of <strong>dialogue management</strong> as one of the critical ingredients of conversational systems (like voice apps and chatbots). This 3,000+ words in-depth introduction provides answers to the following questions:</p><ul><li>What is dialogue management?</li><li>What are popular approaches like finite state machines, form-based systems, and probabilistic dialogue management? What are pros and cons of each?</li><li>Is there an ideal approach?</li></ul><h2 id="multi-turn-conversations-and-ridr">Multi-Turn Conversations and RIDR</h2><p>In <a href="https://www.context-first.com/introduction-voice-multimodal-interactions/">An Introduction to Voice and Multimodal Interactions</a>, I introduce the <strong>RIDR</strong> (Request - Interpretation - Dialogue & Logic - Response) Lifecycle as a framework for the various steps involved in getting from a user request (e.g. a spoken “<em>Are you open tomorrow?”</em>) to a system response (e.g. a spoken “<em>Yes, we open at 9 am</em>”).</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-lifecycle-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-lifecycle-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>This seems to be a straightforward interaction. To make it a little more interesting, let’s add a follow-up question to the response: <em>“Do you want to book a table?</em>”</p><p>A user answering this question would kick off another flow through the RIDR Lifecycle:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-lifecycle-multiturn-conversations-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-lifecycle-multiturn-conversations-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>These kinds of strung together interactions are called <em>multi-turn conversations.</em> In that terminology, a <em>turn</em> is either the user or the system saying something. Multi-turn suggests that there is some back and forth between both parties.</p><p>For conversations like this, it becomes necessary that the system builds up some sort of memory. “<em>Yes</em>” can mean completely different things depending on what happened during the last turn.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/same-input-different-meaning.png"><img src="https://ghost.jovo.tech/content/images/2021/05/same-input-different-meaning.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>This is where the third step of RIDR comes into play: <em>Dialogue & Logic</em>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-dialogue-logic.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-dialogue-logic.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>In general, <em>Dialogue & Logic</em> takes structured input (e.g. an <em>intent</em>) from the <em>Interpretation</em> step and determines some structured output. This output is then passed to the <em>Response</em> step where it is returned to the user.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-logic-input-output.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-logic-input-output.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>As we’ve learned above, though, an intent (e.g. “<em>Yes</em>”) is not enough. We need to find ways for the system to remember the last turn (and potentially more) and take into account other contextual factors to make decisions about next steps.</p><p>This is what <em>Dialogue Management</em>, a key element of <em>Dialogue & Logic</em>, is responsible for.</p><h2 id="what-is-dialogue-management">What is Dialogue Management?</h2><p>Dialogue management (or <em>dialog management</em>) is responsible for handling the conversational logic of a voice or chat system. It usually consists of two main areas of focus:</p><ul><li><strong>Context</strong>: All data that helps us understand where in the conversation we currently are</li><li><strong>Control</strong>: Deciding where the conversation should go next</li></ul><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-context-control.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-context-control.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><h3 id="dialogue-context">Dialogue Context</h3><p><em>Context</em> is the “you are here” pin for a conversational system. Tracking and managing it is essential: If we don’t know where we are, we don’t know where to go next.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-context.jpg"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-context.jpg" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Context includes all types of data a system uses to relate the current interaction to the bigger picture of the conversation. Examples could be:</p><ul><li><strong>Interaction History</strong>: What happened before? Was the user’s request the answer to a previous question? This is not only important for the “<em>Yes</em>” example from above, but also for making sense of linguistic elements like <a href="https://en.wikipedia.org/wiki/Anaphora_(linguistics)">anaphora</a><strong>, </strong><a href="https://en.wikipedia.org/wiki/Ellipsis_(linguistics)">ellipsis</a>, and <a href="https://en.wikipedia.org/wiki/Deixis">deixis</a>.</li><li><strong><strong><strong>Request Context</strong></strong></strong>: What do we know about this interaction? What device is used, what types of modalities are supported?</li><li><strong><strong><strong>User Context</strong></strong></strong>: What do we know about the user? Do they have certain preferences?</li><li><strong><strong><strong>Environmental Context</strong></strong></strong>: What else is important for this conversation? Is it day or night? Weekend? This could also include sensory data from IoT devices.</li></ul><p>These types of data could come from different elements of the conversational system. All steps of the RIDR lifecycle could potentially read from and write into the context. This approach to data management is also called <em>information state update (ISU) theory</em> <a href="http://staff.um.edu.mt/mros1/csa5005/pdf/larsson-traum2000.pdf">in research</a>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-context-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-context-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>A note on the term <em>context</em>: In research, this is also referred to as <em>dialogue state</em>, which sometimes causes confusion with the term <em>state machine</em> (more on that approach below). <em>Context</em> and <em>state</em> are defined differently across disciplines, so let’s dive a bit deeper and find a useful definition for this and upcoming articles.</p><p>To me, there is a slight difference between <em>context</em> and <em>state</em>. While <em>context</em> is all the data the system uses to evaluate the <em>current</em> interaction, <em>state </em>is all the data the system remembers from <em>previous</em> interactions.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/context-state.png"><img src="https://ghost.jovo.tech/content/images/2021/05/context-state.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>As a rule of thumb, <em>state</em> can be seen as historic <em>context</em>. The system decides which context elements should be remembered for later use and stores them in a database or other type of memory. In the next interaction, it then retrieves the state to take into account in the current context.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-context-state-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-context-state-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>In a later section, we’re going to dive into a few approaches to managing context and state.</p><h3 id="dialogue-control">Dialogue Control</h3><p><em>Dialogue Control</em> is responsible for navigating the next steps of a conversation.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-control.jpg"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-control.jpg" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Control responds to the following questions:</p><ul><li><strong>Task Record</strong>: Is there any additional information that is missing and needs to be collected from the user?</li><li><strong>Domain Logic</strong>: Is there any data that we need that is relevant for the interaction? Do we need to make any API calls to internal or external services?</li><li><strong><strong><strong>Initiative</strong></strong></strong>: Who is leading the flow of the conversation? Does the system just respond to user requests (user initiative) or does the system guide the user through a series of steps (system initiative)? Mixed-initiative controls are also possible.</li></ul><p>There are many different ways <em>control</em> could be implemented (this task is also called <em>dialogue flow</em> sometimes). While many systems use a rules-based approach where custom logic determines next steps, there are also a number of emerging probabilistic methods.</p><p>In the next section, we’re going to examine three approaches to dialogue management that look into the differences in how context and control could be implemented.</p><h2 id="approaches-to-dialogue-management">Approaches to Dialogue Management</h2><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-finite-state-forms-probabilistic.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-finite-state-forms-probabilistic.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>If you talk to researchers and practitioners in the field of conversational AI, you quickly realize that the challenge of dialogue management is far from being solved. There are still a lot of missing pieces holding us back from building a robust solution that understands unanticipated input, interprets and remembers all sorts of contextual information, and then intelligently makes decisions about next steps in a natural way.</p><p>To get closer to this, many approaches to dialogue management have been created and tested over the years. Three popular ones are:</p><ul><li><strong>Finite State</strong>: A model that uses a state machine to keep track of the conversation. If your conversational system is designed with tools like flowcharts, it’s probably using the finite state approach.</li><li><strong>Form-based</strong>: A model with the goal of reducing the number of potential paths that have to be explicitly designed. Especially useful for <em>slot filling</em>, the process of collecting data from the customer.</li><li><strong>Probabilistic</strong>: A model that uses training data to decide the next steps of the conversation.</li></ul><p>Let’s take a closer look at each of these approaches.</p><h3 id="finite-state-dialogue-management">Finite State Dialogue Management</h3><p><em>Finite state </em>uses the concept of a <a href="https://en.wikipedia.org/wiki/Finite-state_machine">state machine</a> to track and manage the dialogue. Although technically a bit different (<a href="https://en.wikipedia.org/wiki/State_diagram#State_diagrams_versus_flowcharts">you can find a comparison here</a>), so-called <em>flowcharts</em> are often used as a visual abstraction for state machine based conversation design.</p><p>Here’s a simplified version of a flowchart of a system (white boxes) asking a user (blue boxes) if they want to make a reservation:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/finite-state-simple-chart-2.png"><img src="https://ghost.jovo.tech/content/images/2021/05/finite-state-simple-chart-2.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>For <em>context </em>and <em>state tracking</em>, finite state machines often use a single value like a string of text. When you return a response that requires an additional turn, you save the current state, for example <em>BookATable.</em></p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/ridr-state.png"><img src="https://ghost.jovo.tech/content/images/2021/05/ridr-state.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>In a flowchart this could look like a “you are here” pin that helps it remember the last node of the conversation:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/finite-state-context-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/finite-state-context-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>The <em>control</em> part of dialogue management then uses this state to determine next steps. This can be understood as following the flowchart to the next node.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/finite-state-control.png"><img src="https://ghost.jovo.tech/content/images/2021/05/finite-state-control.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>It then stores the new state for the next incoming request:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/finite-state-context-transition.png"><img src="https://ghost.jovo.tech/content/images/2021/05/finite-state-context-transition.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Many development tools like <a href="https://www.jovo.tech">Jovo</a> have <a href="https://www.jovo.tech/docs/routing/states">state management</a> built into their systems to allow for simple dialogue state tracking.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/jovo-state-code-sample.png"><img src="https://ghost.jovo.tech/content/images/2021/05/jovo-state-code-sample.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Finite state machines are a helpful holistic approach to structuring conversational systems. Flowcharts are a known concept and are relatively easy to read, understand, and communicate. This often makes state machines the tool of choice for cross-functional teams that work on conversational experiences.</p><p>The challenge of finite state machines can be their more or less rigid structure. Natural language is so open and flexible, trying to put a conversation into a 2 dimensional, tree-based process can feel mechanical and error-prone.</p><p>This is especially true for interactions that require a lot of user input. Let’s use our restaurant booking example again. We might need the following information before we confirm the reservation:</p><ul><li>Number of guests</li><li>Date</li><li>Time</li><li>Phone number</li></ul><p>Asking for each of the values step by step might feel more like filling out a form than having a conversation. In a natural dialogue, people would answer if they want to book a table in a variety of ways. Here are a few examples: <em>“Yes”, “Yes, for 4 people”, “Yes, but a little later”, “No”, “How about the day after?”</em></p><p>A flowchart representing some of the potential interactions could look more like this:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/finite-state-complicated-chart.png"><img src="https://ghost.jovo.tech/content/images/2021/05/finite-state-complicated-chart.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>As you can see from this (still very simplified) chart, Implementing all possible interactions can be difficult, even impossible for some use cases. This is also referred to as the <a href="https://link.springer.com/chapter/10.1007/3-540-65306-6_21">state explosion problem</a>. Chas Sweeting illustrates this challenge in his great post <a href="https://medium.com/voiceflow/the-heavy-lifting-required-to-build-alexa-skills-conversational-interfaces-a0e5752319e7">The heavy-lifting required to build Alexa Skills & conversational interfaces</a>.</p><p>And all the points mentioned above don’t even cover other types of context that were mentioned in previous sections of this article. Just to mention a few examples, where would we highlight differences depending on the time of day, the device, or specific user preferences?</p><p>Let’s take a look at other dialogue management techniques that attempt to solve those problems.</p><h3 id="form-based-dialogue-management">Form-Based Dialogue Management</h3><p><em>Form-based </em>(or frame-based) systems focus on the data needed to proceed, not on the conversational flow. This type of dialogue management is especially useful for <em>slot filling</em>, which is the process of collecting required user input.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/form-based-dialogue-management.png"><img src="https://ghost.jovo.tech/content/images/2021/05/form-based-dialogue-management.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>A <em>form</em> (also called <em>frame</em>) can be seen as a sheet of data that needs to be filled. The next step of the conversation is only reached when all required information is collected.</p><p>This frees up the design process of a conversational system by not having to design a new state machine branch for any potential interaction. Instead, it offers an abstraction by relying on clear rules, including:</p><ul><li><strong><strong><strong>Source</strong></strong></strong>: The information we collect could either be <em>implicit</em> (already known from previous interactions or other data) or <em>explicit </em>(stated by the user).</li><li><strong><strong><strong>Prompt</strong></strong></strong>: How we ask for the information, for example "<em>To send you a final confirmation, may I have your phone number?</em>"</li><li><strong><strong><strong>Priority</strong></strong></strong>: Some slots could be <em>required</em>, some others <em>optional</em>.</li><li><strong><strong><strong>Sequence</strong></strong></strong>: Data could be collected one by one or even in a single phrase (“<em>4 people at noon tomorrow please</em>”).</li><li><strong><strong><strong>Quantity</strong></strong></strong>: There could be multiple values for one type of information, similar to Alexa’s <a href="https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2021/01/now-available-use-multi-value-slots-to-build-natural-conversations">multi-value slots</a>.</li><li><strong><strong><strong>Validation</strong></strong></strong>: What type of data do we expect and how do we handle cases that the system can’t understand? How can we make sure a user isn’t stuck in a loop if their input isn’t accepted?</li><li><strong><strong><strong>Confirmation</strong></strong></strong>: Depending on the importance of some of the data, we could use <em>implicit</em> (“<em>Alright, 4 guests. [...]</em>”) or <em>explicit</em> (“<em>I understood 4 guests, is that correct?</em>”) confirmation to make sure we get everything right. Confirmation can happen for each individual slot, and also a final confirmation when all the necessary data is collected.</li><li><strong><strong><strong>Adjustment</strong></strong></strong>: The user should have the ability to make corrections (“<em>Ah, wait, 5 people</em>”).</li></ul><p>Form-based systems track what data was already collected in an overview called <em>task record</em>. This record is retrieved as one element of dialogue <em>context</em>. </p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/form-based-dm-context.png"><img src="https://ghost.jovo.tech/content/images/2021/05/form-based-dm-context.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>The <em>control</em> part of dialogue management then uses this record to prompt for elements that are missing.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/form-based-dm-control.png"><img src="https://ghost.jovo.tech/content/images/2021/05/form-based-dm-control.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Advanced form-based systems also don’t necessarily have the user tell the values in a specific order as defined in the task record. The customers are able to decide which slots they want to fill when, and even tell information in one sentence like “<em>3 people at noon, please.</em>”</p><p>Concepts like the <a href="https://developer.amazon.com/en-US/docs/alexa/custom-skills/delegate-dialog-to-alexa.html">Alexa Dialog Interface</a>, <a href="https://cloud.google.com/dialogflow/es/docs/intents-actions-parameters#required">Dialogflow slot filling</a> or <a href="https://rasa.com/docs/rasa/forms">Rasa Forms</a> implement versions of form-based dialogue management.</p><p>While this approach solves some of the issues of finite state machines, it’s still a mostly rules-based approach: Each conversation needs to be explicitly defined. The design and build processes might take less time because some parts are abstracted by adding rules. It still requires manual work, though, making it difficult for the system to handle unanticipated interactions.</p><p>Let’s take a look at an approach that attempts to solve this by using machine learning.</p><h3 id="probabilistic-dialogue-management">Probabilistic Dialogue Management</h3><p>Instead of relying on rules, <em>probabilistic</em> methods look at existing data to decide about next steps of the conversation.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dialogue-management.png"><img src="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dialogue-management.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>There is an ongoing debate in the conversational AI industry whether rules-based dialogue management techniques like finite state machines will ever yield good enough results.</p><p>As mentioned in a previous section, natural language is so open that it’s difficult to create rules for each potential interaction between a system and its users. Probabilistic dialogue management promises a solution for this: By relying on data and machine learning models instead of rules, it offers more natural and scalable ways to automate conversations.</p><p>Some tools that already implement this are Rasa (using <a href="https://rasa.com/docs/rasa/stories">stories</a> to train the model) and <a href="https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2020/07/introducing-alexa-conversations-beta-a-new-ai-driven-approach-to-providing-conversational-experiences-that-feel-more-natural">Alexa Conversations</a>.</p><p>Probabilistic dialogue management works by using sample data in the form of conversations. Instead of looking at just one sentence (usually the case for natural language understanding), a complete conversation across multiple turns is considered.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-sample-conversation.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-sample-conversation.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>This data is labeled by humans and then trained by a machine learning model. The advantage of this approach is that forces us to test our system very early in the process and learn from real world interactions. By looking at conversation data, we can see what our users wanted from the system. The disadvantage is the amount of training data that is potentially needed for robust interactions.</p><p>The system makes an educated guess about the dialogue context by looking at past interactions and how they might fit into the machine learning model. This is sometimes also called a <em>state hypothesis</em> or a <em>belief state</em>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dm-context.png"><img src="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dm-context.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Some approaches also calculate multiple hypotheses and assign probabilities. This is especially helpful if the system encounters at a later point that its main hypothesis was wrong. It can then go back to another hypothesis and try again.</p><p>The <em>control</em> part of dialogue management then looks at these hypotheses and determines next steps using the model based on training data.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dm-control.png"><img src="https://ghost.jovo.tech/content/images/2021/05/probabilistic-dm-control.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>By looking at the training data, it can seem as if there’s no real difference between writing stories (training data) for probabilistic dialogue management and defining paths in a state machine. <em>Aren’t both some kind of rules?</em></p><p>The big difference is the following: While rules-based systems need a clearly defined rule for each type of input, machine learning based dialogue management is trying to understand <em>unanticipated</em> input that was not clearly defined before. And the more training data is available, the better it is supposed to get at its job, promising a scalable approach to dialogue management.</p><p>While the probabilistic dialogue management approach comes with many advantages, there are also some things to consider.</p><p>First, it usually requires a lot of training data, which causes an initial investment for data labeling. This also means that bigger companies with more resources and usage are at an advantage.</p><p>Also, due to the machine learning approach, a lot of the decision making of the system is outsourced to a “black box” that can make it difficult to debug certain behavior in a deterministic way. The more training data and use cases are covered, the more difficult it might be to dive into problems and make changes.</p><p>Overall, probabilistic dialogue management is an important approach that I believe will be integrated into any conversational experience in the future. Tool builders around the world are also working on solving the issues mentioned above by combining different approaches. More on that in the next section.</p><h2 id="what-s-the-ideal-approach-to-dialogue-management">What’s the Ideal Approach to Dialogue Management?</h2><p>This post provided an overview of dialogue management with its two main tasks: context (<em>where in the conversation are we right now?</em>) and control (<em>where should we go next?</em>).</p><p>We took a closer look at these three popular approaches that manage conversations in different ways:</p><ul><li><strong>Finite State</strong>: Using a state machine (or flowchart)</li><li><strong>Form-based</strong>: Using a task record of slots to fill</li><li><strong>Probabilistic</strong>: Using training data</li></ul><p>We learned about the pros and cons of each of the methods. While rules-based dialogue management (finite state machines and form-based systems) offers full control over the user experience, it can be tricky to build it in a scalable way to respond to all sorts of unanticipated input. Probabilistic dialogue management offers more scalability with the potential drawbacks of needing large amounts of training data and giving up some control to machine learning models.</p><p>The question now is: What’s the ideal approach?</p><p>In my opinion, it doesn’t have to be black and white. For some interactions and use cases, a state machine with clear rules might make sense, for others a probabilistic method might be more effective. Powerful conversational systems mix them for the best outcome.</p><p>For example, by mixing the concepts of a finite state machine and form-based dialogue management, you can declutter the process of input collection while still having a clear, deterministic process for next steps.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-mixed-approach.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-mixed-approach.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>Or, a probabilistic approach could delegate to a rules-based system for specific interactions.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-probabilistic-rules-1.png"><img src="https://ghost.jovo.tech/content/images/2021/05/dialogue-management-probabilistic-rules-1.png" class="kg-image" alt="Dialogue Management: A Comprehensive Introduction (2021 Edition)"/></a></figure><p>There are some tools that already support a mix of these dialogue management types:</p><ul><li>Rasa offers rules, forms, and probabilistic methods.</li><li>Alexa offers the ability to build custom skill code (rules), a dialog interface for slot filling (forms) and the newly added Alexa Conversations (probabilistic) feature.</li></ul><p>I suspect that we’re going to see more tool providers add different methods of dialogue management in the near future. </p><h2 id="zooming-in-and-out">Zooming in and out</h2><p>While doing research for this post and looking at the example illustrations of mixed methods from the previous session, one thing became clear to me: One of the most important features of designing and building conversational systems is the ability to zoom in and out.</p><p>In some cases, it’s important to dive into all the details. What should I do if a user wants to make changes to their previous input? How do I prompt for a slot value? In other cases, it’s important to keep track of the bigger picture. Combining different dialogue management methodologies can help with this from both a design and development perspective. And the full value of this can only be unlocked if elements of the system are modularized in the right way.</p><p>This is why, in my next post, I’m going to introduce a topic called “<em>Atomic Design for Conversational Interfaces</em>.”</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet" data-width="550"><p lang="en" dir="ltr">Finally! I just published my longest article so far:<br><br>✨ Dialogue Management: A Comprehensive Introduction ✨<br><br>- 3,000+ words<br>- almost 30 illustrations<a href="https://t.co/AxJlQSUm4Y">https://t.co/AxJlQSUm4Y</a> <a href="https://t.co/LcpW3oICc1">pic.twitter.com/LcpW3oICc1</a></br></br></br></br></br></p>— Jan König (@einkoenig) <a href="https://twitter.com/einkoenig/status/1397551429863215105?ref_src=twsrc%5Etfw">May 26, 2021</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"/> </figure><p><em>Thanks a lot to </em><a href="https://twitter.com/lehrjulian"><em>Julian Lehr</em></a><em>, </em><a href="https://twitter.com/marktucker"><em>Mark Tucker</em></a><em>, </em><a href="https://www.linkedin.com/in/andrew-francis-48233064/"><em>Andrew Francis</em></a><em>, </em><a href="https://twitter.com/solyarisoftware"><em>Giorgio Robino</em></a><em>, </em><a href="https://twitter.com/basche42"><em>Ben Basche</em></a><em>, </em><a href="https://twitter.com/modi74"><em>Manja Baudis</em></a><em>, </em><a href="https://twitter.com/ElleForLanguage"><em>Brielle Nickoloff</em></a><em>, </em><a href="https://twitter.com/techpeace"><em>Matt Buck</em></a><em>, </em><a href="https://twitter.com/alexswetlow"><em>Alex Swetlow</em></a><em>, and </em><a href="https://www.linkedin.com/in/lars-lipinski/"><em>Lars Lipinski</em></a><em> for reading drafts of this post.</em></p><p><em>I also tried a new experiment: While working on the post, I shared insights and open questions on Twitter. I learned a lot from the feedback and discussions there! </em><a href="https://twitter.com/einkoenig/timelines/1364848300243959814"><em>You can find a collection of all tweets here</em></a><em>.</em></p></body></html>]]></content:encoded></item><item><title><![CDATA[An Introduction to Voice and Multimodal Interactions]]></title><description><![CDATA[RIDR: Request - Interpretation - Dialog & Logic - Response]]></description><link>https://ghost.jovo.tech/introduction-voice-multimodal-interactions/</link><guid isPermaLink="false">Ghost__Post__5ec6edf6435a8f3636a35011</guid><category><![CDATA[Deep Dives]]></category><dc:creator><![CDATA[Jan König]]></dc:creator><pubDate>Fri, 18 Sep 2020 19:51:36 GMT</pubDate><media:content url="https://ghost.jovo.tech/content/images/2020/09/ridr-lifecycle-voice-multimodal.jpeg" medium="image"/><content:encoded><![CDATA[<html><head/><body><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-lifecycle-voice-multimodal.jpeg" alt="An Introduction to Voice and Multimodal Interactions"/><p>“<em>Are you open tomorrow?</em>”</p><p>“<em>Yes, we open at 9am.</em>”</p><p>The conversation above seems very simple, right? The goal of most voice and chat interactions is to provide a user experience that is as simple as possible. However, this doesn’t mean that these types of interactions are as easy to design or to build. Quite the contrary. There is a lot going on under the hood that the users never see (or, in this case, hear).</p><p>And it only becomes more complex as more modalities are added: With additional interfaces like visual and touch, maybe even gestures or sensory input, the design and development challenge can become multidimensional quickly. This is why it’s important to have a clear, abstracted process when building multimodal experiences.</p><p>In this post, I will walk you through some of the steps involved in building voice and multimodal interactions. We’ll take a look at the RIDR lifecycle, a concept that we introduced with the <a href="https://www.context-first.com/introducing-jovo-v3-the-voice-layer/">launch of Jovo v3</a> earlier this year.</p><p>To kick things off, let’s take a look at a typical voice interaction and then see how this can be expanded to multimodal experiences.</p><h2 id="example-of-a-voice-interaction">Example of a Voice Interaction</h2><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/conversation-intro.png"><img src="https://ghost.jovo.tech/content/images/2020/09/conversation-intro.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>In the introduction of this post, a user asks a question (“<em>Are you open tomorrow?</em>”) and the system (e.g. a bot or an assistant) responds with a (hopefully appropriate) answer like “<em>Yes, we open at 9am.</em>”</p><p>This is what we call an <em>interaction</em>. <a href="https://www.jovo.tech/docs/requests-responses">In our definition</a>, an interaction is a single pair of a user request and a system response. </p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/request-response.png"><img src="https://ghost.jovo.tech/content/images/2020/09/request-response.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>What might appear like a simple interaction actually requires many steps under the hood to deliver a meaningful response. It looks more like this:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/ridr-lifecycle.png"><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-lifecycle.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>With the <a href="https://context-first.com/introducing-jovo-v3-the-voice-layer-bf369db4808e">launch of Jovo v3</a>, we introduced the RIDR (<em>Request - Interpretation - Dialog & Logic - Response</em>) lifecycle with the goal to establish an abstracted process to get from request to response and make it possible to plug into (interchangeable) building blocks for each step.</p><p>The pipeline includes four key elements:</p><ul><li><u>R</u>equest</li><li><u>I</u>nterpretation</li><li><u>D</u>ialog & Logic</li><li><u>R</u>esponse</li></ul><p>Let’s briefly take a look at each of the steps.</p><h3 id="request">Request</h3><p>The <em>Request</em> step starts the interaction and captures necessary data.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/ridr-request.png"><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-request.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>If we use a voice-first device as an example, there are a few things that need to be handled, like:</p><ul><li>Knowing when to record input (e.g. after a button is pushed or using wake word detection)</li><li>Recording the input</li><li>Figuring out when the person stopped speaking (e.g. with silence detection)</li><li>Processing audio to be passed to the next step</li></ul><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/request-detail.jpg"><img src="https://ghost.jovo.tech/content/images/2020/09/request-detail.jpg" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>These initial steps usually happen on the device the user is interacting with. Platforms like Amazon Alexa do all these things for you, but if you want to build your own custom voice system (e.g. voice-enabling a web or mobile app, building your own hardware), you may want to handle everything yourself. Jovo Client libraries like “<a href="https://www.jovo.tech/marketplace/jovo-client-web">Jovo for Web</a>” help with some of these elements, like recording mechanisms, visual elements, silence detection, and audio conversions.</p><p>After the input is recorded, the request containing the audio is sent to the <em>Interpretation</em> step of the pipeline.</p><h3 id="interpretation">Interpretation</h3><p>The <em>Interpretation</em> step tries to make sense of the data gathered from the <em>Request</em>.<br/></p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/ridr-interpretation.png"><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-interpretation.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>In our voice example, the previously recorded audio is now turned into structured meaning by going through multiple steps:</p><ul><li>An automated speech recognition (ASR) service turns the audio into text</li><li>The text is then turned into a structure with <em>intents</em> and <em>entities</em> by a natural language understanding (NLU) service</li><li>Additional steps could include speaker recognition, sentiment analysis, voice biometrics, and more</li></ul><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/interpretation-detail.png"><img src="https://ghost.jovo.tech/content/images/2020/09/interpretation-detail.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The <em>Interpretation</em> takes an audio file as input, runs it through various services, and then outputs structured, machine-readable data. This is usually a result of a natural language understanding (NLU) service that is trained with multiple phrases:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/nlu-intro.png"><img src="https://ghost.jovo.tech/content/images/2020/09/nlu-intro.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The service then matches the text provided by the ASR to an intent and optionally a set of entities. In our case it would be the <em>OpenHours</em> intent with additional information (“tomorrow”) in the form of an entity.</p><p>This structured data is then passed to the actual logic of the conversational app in <em>Dialog & Logic</em>.</p><h3 id="dialog-logic">Dialog & Logic</h3><p>In the <em>Dialog & Logic</em> step it is determined how and what should be responded to the user.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/ridr-dialog.png"><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-dialog.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>This step involves a couple of important things such as:</p><ul><li>User Context: Is this a new or existing user? Is any additional data about them available, like preferred location?</li><li>Dialog Management: Where in the conversation are we right now? Is there any additional input we need to collect from the user?</li><li>Business Logic: Is there any specific information about the business that we need to collect?</li></ul><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/dialog-detail.png"><img src="https://ghost.jovo.tech/content/images/2020/09/dialog-detail.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>In the current example, we would collect data about opening hours, maybe specific to the user’s preferred location (if the system manages multiple locations).</p><p>All the necessary data is gathered and then handed off to the <em>Response</em>.</p><h3 id="response">Response</h3><p>In the final <em>Response</em> step, the data from the previous step is assembled into the appropriate output for the specific platform or device.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/ridr-response-1.png"><img src="https://ghost.jovo.tech/content/images/2020/09/ridr-response-1.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>This also usually involves:</p><ul><li>Collecting data from a content management system (CMS) with e.g. localization features</li><li>Sending the output to a text to speech (TTS) service that turns it into a synthesized voice</li></ul><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/response-detail.jpg"><img src="https://ghost.jovo.tech/content/images/2020/09/response-detail.jpg" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The output is then played back to the user, either stopping the session (closing the microphone) or waiting for additional input (a new <em>Request</em>). Rinse and repeat.</p><p>This example of a voice-only interaction still seems like a manageable process to design and build. What if we add more modalities though? Let’s take a look.</p><h2 id="multimodal-experiences-beyond-voice">Multimodal Experiences Beyond Voice</h2><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/multimodal-request-response-1.png"><img src="https://ghost.jovo.tech/content/images/2020/09/multimodal-request-response-1.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The previous example shows how RIDR works with a simple interaction that uses both speech input and output. Early voice applications for platforms like Amazon Alexa focused mainly on voice-only interactions, for example when people talk to a smart speaker without a display. (<em>Note: You could argue that the LED light ring—which indicates that Alexa is listening—already makes it a multimodal interaction.</em>)</p><p>As technology evolves, interactions between users and their devices are becoming increasingly complex. As I highlight in <a href="https://www.context-first.com/introducing-context-first/">Introducing Context-First</a>, products are becoming multimodal by default. This means that we’re seeing more interactions that either offer multiple input modalities (e.g. speech, touch, gestures) or output channels (e.g. speech, visual). <a href="https://www.context-first.com/alexa-please-send-this-to-my-screen/">Alexa, Please Send This To My Screen</a> covers how multimodal interactions can either be continuous, complementary, or consistent experiences.</p><p>Let’s take a look at some multimodal experiences and how they could work with RIDR.</p><h3 id="examples-of-multimodal-interactions">Examples of Multimodal Interactions<br/></h3><p>A multimodal experience could be as little as displaying additional (helpful) information on a nearby screen:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/visual-output.png"><img src="https://ghost.jovo.tech/content/images/2020/09/visual-output.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>Or, the visual display could offer touch input for faster interactions, for example in the form of a button (they are sometimes called <em>quick replies</em>):</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/quick-replies.png"><img src="https://ghost.jovo.tech/content/images/2020/09/quick-replies.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>It gets increasingly interesting, and challenging to implement, when two input modalities are used in tandem:</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/gestures-voice.png"><img src="https://ghost.jovo.tech/content/images/2020/09/gestures-voice.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The above is similar to <a href="https://www.youtube.com/watch?v=RyBEUyEtxQo">Put That There</a> which I mentioned in my previous post <a href="https://www.context-first.com/introducing-context-first/">Introducing Context-First</a>: Someone says something and provides additional context by pointing at an object. Interactions like this are challenging to decode and interpret. The process of making sense of multiple modalities is called <em>multimodal fusion</em>.</p><h3 id="multimodal-interactions-and-ridr">Multimodal Interactions and RIDR<br/></h3><p>Let’s take a look at how RIDR can be abstracted even more to work with multimodal interactions. The goal is that each of the building blocks can easily be replaced depending on the current context of the interaction.</p><p>For example, the <em>Request </em>step does not necessarily need to contain just a microphone, it could also have a camera or sensors to collect user input. We could call each of these elements a <em>modality recorder</em>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/request-multimodal.png"><img src="https://ghost.jovo.tech/content/images/2020/09/request-multimodal.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>In the <em>Interpretation</em> step, it could also be divided into two distinct steps. A <em>recognizer</em> that turns raw input (like audio, video, even emoji or a location) into a format that’s easier to process for an <em>interpreter</em>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/interpretation-multimodal.png"><img src="https://ghost.jovo.tech/content/images/2020/09/interpretation-multimodal.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>Not every interaction would need to go through each of the steps. Different input modalities might require different treatment:</p><ul><li>Text-based interactions (e.g. chatbots) can skip the speech recognition</li><li>Touch-based interactions need to take into account the payload of e.g. a button (where was it clicked?)</li><li>Vision-based interactions (e.g. gestures) need different steps that involve computer vision and interpretation</li></ul><p>As mentioned earlier, this can get even more complex as you add multiple modalities at once. For this, an additional step for <em>multimodal fusion</em> is introduced.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/multimodal-fusion.png"><img src="https://ghost.jovo.tech/content/images/2020/09/multimodal-fusion.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>The <em>Dialog & Logic</em> step from the voice example above can stay the same for now. We’ll take a deeper look at this in the next post as there are many additional layers to dive into.</p><p>The <em>Response</em> step is another interesting one. Where voice requires a text-to-speech (TTS) service, a multimodal experience might need additional elements like visual output or a video avatar. We call services fulfilling this step <em>output renderers</em>.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/response-multimodal.png"><img src="https://ghost.jovo.tech/content/images/2020/09/response-multimodal.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>That’s how we currently envision multimodal user interfaces to work under the hood. This model will be updated and improved as we iteratively learn and experiment with the addition of new modalities. </p><h2 id="open-questions-and-outlook">Open Questions and Outlook<br/></h2><p>This post provided a first introduction to the many steps involved when building a seemingly simple voice interaction, and how this can be applied to multimodal experiences.</p><p>Again, this is a work in progress. Here are some additional questions I currently have:</p><ul><li>Right now, this only covers user-initiated (pull) request-response interactions. What if the system starts (push)? Could sensory data be used as a trigger?</li><li>Should interpretation and dialog/logic be tied together more closely? How about dialog/logic and response? <a href="https://blog.rasa.com/demonstration-of-our-ted-policy/">Rasa’s TED policy</a> is one example where the interpretation step is doing some dialog and response work.</li><li>Are there use cases where this abstraction doesn’t work at all? Do (currently still experimental) new models like <a href="https://openai.com/blog/openai-api/">GPT-3</a> work with this?</li></ul><p>While this was already getting a bit complex at the second half of the article, it was still a simple example. It was a single response to a user request. What if we wanted to provide a follow-up question, like asking the user if they want to book a table?</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/09/multiturn-dialog.png"><img src="https://ghost.jovo.tech/content/images/2020/09/multiturn-dialog.png" class="kg-image" alt="An Introduction to Voice and Multimodal Interactions"/></a></figure><p>There are many additional steps and challenges involved, especially in the <em>Dialog & Logic </em>step of the RIDR lifecycle. In the next post, I will provide an <a href="https://www.context-first.com/dialogue-management-introduction/">extensive introduction to dialogue management</a>.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet" data-width="550"><p lang="en" dir="ltr">Finally! This post has been in the making for a while:<br><br>✨ An Introduction to Voice and Multimodal Interactions ✨<a href="https://t.co/5hue7SgDaA">https://t.co/5hue7SgDaA</a> <a href="https://twitter.com/hashtag/VoiceFirst?src=hash&ref_src=twsrc%5Etfw">#VoiceFirst</a> <a href="https://twitter.com/hashtag/ContextFirst?src=hash&ref_src=twsrc%5Etfw">#ContextFirst</a> 1/x <a href="https://t.co/ZqcydeOApe">pic.twitter.com/ZqcydeOApe</a></br></br></p>— Jan König (@einkoenig) <a href="https://twitter.com/einkoenig/status/1308025029817401345?ref_src=twsrc%5Etfw">September 21, 2020</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"/> </figure><p><em>Thanks a lot for your valuable feedback </em><a href="https://twitter.com/dcoates"><em>Dustin Coates</em></a><em>, </em><a href="https://twitter.com/lehrjulian"><em>Julian Lehr</em></a><em>, </em><a href="https://twitter.com/ReamBraden"><em>Braden Ream</em></a><em>, </em><a href="https://twitter.com/rafalcymerys"><em>Rafal Cymeris</em></a><em>, </em><a href="https://twitter.com/n0rb3rtg"><em>Norbert Gocht</em></a><em>, </em><a href="https://twitter.com/alexswetlow"><em>Alex Swetlow</em></a><em>, and </em><a href="https://twitter.com/LaurenGolem"><em>Lauren Golembiewski</em></a><em>.</em></p><p><em>Photo by <a href="https://unsplash.com/@jaredkcreative?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Jared Weiss</a> on <a href="https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>.</em></p></body></html>]]></content:encoded></item><item><title><![CDATA[Introducing Context-First]]></title><description><![CDATA[Exploring voice and multimodal user interfaces in a new publication]]></description><link>https://ghost.jovo.tech/introducing-context-first/</link><guid isPermaLink="false">Ghost__Post__5f4cc956435a8f3636a35064</guid><category><![CDATA[Announcements]]></category><dc:creator><![CDATA[Jan König]]></dc:creator><pubDate>Mon, 31 Aug 2020 13:46:28 GMT</pubDate><media:content url="https://ghost.jovo.tech/content/images/2020/08/context-first-modalities-header-dark.jpg" medium="image"/><content:encoded><![CDATA[<html><head/><body><img src="https://ghost.jovo.tech/content/images/2020/08/context-first-modalities-header-dark.jpg" alt="Introducing Context-First"/><p>We live in a connected world.</p><p>We spend our days with various devices: Computers at work, phones and wearables on the go, smart speakers at home—our interactions with technology are spread across different touchpoints and interfaces.</p><p>The possibilities are endless. Yet, how we use technology often feels disconnected, even clunky.</p><p>Is this going to change? If yes, how and when?<br/></p><h2 id="towards-a-multimodal-multi-device-future">Towards a Multimodal, Multi-Device Future</h2><p>At least since <a href="https://www.youtube.com/watch?v=RyBEUyEtxQo">Put That There</a>, many have seen multimodal user interfaces as the holy grail in human-computer interaction. Being able to use any modality (speech, touch, gestures, eye gaze) that best fits your current situation is seen as liberating and inclusive for a wide range of users.</p><figure class="kg-card kg-embed-card"><iframe width="459" height="344" src="https://www.youtube.com/embed/RyBEUyEtxQo?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""/></figure><p>Of course, this has been greatly influenced by SciFi, too. When people envision future multimodal user experiences, examples like Minority Report often come to mind where people are standing in front of gigantic machines that support all kinds of input modalities at once. Is this practical today?</p><!--kg-card-begin: html--><div style="width:100%;height:0;padding-bottom:50%;position:relative;"><iframe src="https://giphy.com/embed/BLVqLi1p4Pt7i" width="100%" height="100%" style="position:absolute" frameborder="0" class="giphy-embed" allowfullscreen=""/></div><p><a href="https://giphy.com/gifs/tom-cruise-steven-spielberg-minority-report-BLVqLi1p4Pt7i">via GIPHY</a></p><!--kg-card-end: html--><p>3.5 years ago (shortly before <a href="https://www.jovo.tech/">Jovo</a> joined betaworks <a href="https://betaworks.com/voicecamp/">Voicecamp</a>), I published an article called <a href="https://www.context-first.com/alexa-please-send-this-to-my-screen/">Alexa, Please Send This To My Screen</a> with some early thoughts on voice experience design. I came to the conclusion that most voice experiences need to be embedded into wider product ecosystems to become truly useful. To me, one of the biggest differences between SciFi and technology today is that our interactions are now spread across different devices. There is no more “one machine does everything.”</p><p>The new multimodal experience is not just one device that supports everything, it’s a product ecosystem that works across devices and contexts.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/08/multimodal-then-now.jpg"><img src="https://ghost.jovo.tech/content/images/2020/08/multimodal-then-now.jpg" class="kg-image" alt="Introducing Context-First"/></a></figure><p>And it seems like things are progressing. When I published the article, smart speakers were mostly disconnected voice-only devices. Just a few months later, the <a href="https://www.amazon.com/All-new-Echo-Show-2nd-Gen/dp/B077SXWSRP">Echo Show</a> was introduced as the first Alexa smart display and Google Assistant became more deeply integrated into Android phones, allowing for some early continuous experiences like “send this to my phone.” I’m confident that it soon will be normal for any device to be shipped with a microphone.</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/08/voice-then-now-soon.jpg"><img src="https://ghost.jovo.tech/content/images/2020/08/voice-then-now-soon.jpg" class="kg-image" alt="Introducing Context-First"/></a></figure><p>With all these new devices added to the product mix, it gets more and more complex to design, build, and improve connected experiences. Which devices and modalities should you build for? It’s easy to lose track of the user's needs with all these technological possibilities.<br/></p><h2 id="context-first-a-term-and-publication">Context-First, a Term and Publication</h2><blockquote>“<em>The right information, at the right time, on the right device.</em>”</blockquote><p>The quote from above was adapted from Michal Levin’s great book <a href="https://www.oreilly.com/library/view/designing-multi-device-experiences/9781449340391/">Designing Multi-Device Experiences</a> that I read a few years ago. For us, this is what contextual experiences should be all about: delivering value on the best available device and modality.</p><p>At<a href="https://www.jovo.tech"> Jovo</a>, we've been mentioning the term <em>Context-First</em> for a while, most prominently in our <a href="https://www.context-first.com/introducing-jovo-v3-the-voice-layer/">v3 launch announcement</a>. There is so much overlap between “voice” and “multimodal” experiences that it always felt weird to us to use terms like “voice-first.” With more devices being available, context plays an even more important role in product design decisions. It was important for us to find a term that reflects this.</p><p>We want to use this publication for documentation and open learning about building context-first products. When I started working on <a href="https://twitter.com/einkoenig/status/993818598530527232">my masters’ thesis</a>, there was not a lot of actionable information available on multimodal user interfaces. It was either too concrete (focused on the “here”, e.g. simplistic tutorials) or too abstract (focused on the “there”, e.g. very visionary, futuristic essays). The industry still lacks content that focuses on the in-between (<a href="https://twitter.com/einkoenig/status/1259830479517343744">here are some counter examples</a>).</p><figure class="kg-card kg-image-card"><a href="#" data-featherlight="https://ghost.jovo.tech/content/images/2020/08/actionable-middle.png"><img src="https://ghost.jovo.tech/content/images/2020/08/actionable-middle.png" class="kg-image" alt="Introducing Context-First"/></a></figure><p>Here are some topics that we will likely cover:</p><ul><li>Challenges that need to be solved to get from “here” to “there”</li><li>Deep dives into different device types and modalities</li><li>Case studies and examples that focus on a specific vertical</li><li>Providing business context for the more <a href="https://www.jovo.tech/news">technical Jovo announcements</a><br/></li></ul><p>What are you interested in? Say hi on Twitter and let me know! <a href="https://twitter.com/einkoenig">@einkoenig</a></p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet" data-width="550"><p lang="en" dir="ltr">I just published a new blog post:<br><br>✨ Introducing Context-First ✨<a href="https://t.co/6pJhjTkDaF">https://t.co/6pJhjTkDaF</a> 1/x <a href="https://t.co/JX5NfmoIaO">pic.twitter.com/JX5NfmoIaO</a></br></br></p>— Jan König (@einkoenig) <a href="https://twitter.com/einkoenig/status/1300475469255651328?ref_src=twsrc%5Etfw">August 31, 2020</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"/> </figure></body></html>]]></content:encoded></item><item><title><![CDATA[Introducing Jovo v3, the Voice Layer]]></title><description><![CDATA[Doubling down on voice experiences that work anywhere]]></description><link>https://ghost.jovo.tech/introducing-jovo-v3-the-voice-layer/</link><guid isPermaLink="false">Ghost__Post__5ec2a085435a8f3636a34f98</guid><category><![CDATA[Announcements]]></category><dc:creator><![CDATA[Jan König]]></dc:creator><pubDate>Thu, 27 Feb 2020 17:41:18 GMT</pubDate><media:content url="https://ghost.jovo.tech/content/images/2020/05/jovo-v3-header.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><html><head/><body><strong>We just launched <a href="https://www.jovo.tech" target="_blank">Jovo</a> v3, which enables you to build highly flexible voice experiences that now work across even more devices and platforms, including Samsung Bixby, Twilio Autopilot, Web Apps, and more. To learn more, take a look at <a href="https://www.jovo.tech/news/2020-02-25-jovo-v3" target="_blank">the technical announcement</a> and <a href="https://github.com/jovotech/jovo-framework" target="_blank">our GitHub</a>, or <a href="https://www.jovo.tech/enterprise" target="_blank">contact sales</a> for enterprise support.</strong><img src="https://ghost.jovo.tech/content/images/2020/05/jovo-v3-header.jpg" alt="Introducing Jovo v3, the Voice Layer"/><p/> <p>A little over a year ago, we <a href="https://medium.com/@einkoenig/introducing-jovo-framework-v2-c98326ac4aca" target="_blank">released Jovo v2</a>, a more powerful way to build voice applications that work across platforms like Alexa and Google Assistant.</p> <p>Jovo has since become the go-to open source framework for voice app development, powering tens of millions of live interactions every month, with 20,000+ created projects and 70+ contributors to the open source code so far.</p> <p>And today, we’re excited to take Jovo to the next level: Jovo v3 is the open source voice layer that allows you to integrate voice into any product or device.</p> <figure class="wp-caption"> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/1200/1*xaxCm5BZe-dZ2LS1PtHBRA.jpeg"><img data-width="1690" data-height="694" src="https://cdn-images-1.medium.com/max/1200/1*xaxCm5BZe-dZ2LS1PtHBRA.jpeg" alt="Introducing Jovo v3, the Voice Layer"/></a></p><figcaption class="wp-caption-text">Build voice experiences for Alexa, Google Assistant, Samsung Bixby, Twilio, the web, and more with Jovo v3</figcaption></figure> <p>Here are some of the features that got added with v3 (<a href="https://www.jovo.tech/news/2020-02-25-jovo-v3" target="_blank">read the technical announcement for more details</a>):</p> <ul> <li>Jovo RIDR Pipeline: ASR, NLU, and TTS Integrations</li> <li>Build voice and chat experiences for the web, mobile, and custom hardware</li> <li>Build capsules for Samsung Bixby with Jovo</li> <li>Build phone assistants with Jovo and Twilio Autopilot</li> <li>New features that make voice app development easier and more powerful: Conversational Components and Input Validation</li> </ul> <p>You can also watch the video here:</p> <p><iframe width="500" height="281" src="https://www.youtube.com/embed/8eFaUeBD2FY?feature=oembed" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""/></p> <h3>The Most Important Trend in Voice Today</h3> <p>When we started working on Jovo, voice was mostly seen as a <em>platform</em>. Companies built apps on top of the major voice assistants mostly for marketing reasons: they wanted to reach new potential customers through Alexa and Google Assistant.</p> <p>The more we work in this space, the more we think that smart speakers can be seen as training wheels. Consumers are growing to expect to be able to use voice as a <em>modality</em> when interacting with technology. We sometimes call this the <em>Smart Speaker Ripple Effect</em>.</p> <p>When voice becomes one of the most important elements of human-machine interactions, do companies really want to put all their fate in the hands of a few gatekeepers? We’re already seeing that companies are starting to build their own specialized assistants.</p> <div class="embed-twitter"> <blockquote class="twitter-tweet" data-width="500" data-dnt="true"> <p lang="en" dir="ltr">"The most important trend in voice today" 💯 </p> <p>Moving away from just integrating with general purpose assistants, towards building owned, specialized assistants. <a href="https://twitter.com/hashtag/AAV19?src=hash&ref_src=twsrc%5Etfw">#AAV19</a> <a href="https://t.co/rCNlyK8C1L">pic.twitter.com/rCNlyK8C1L</a></p> <p>— Jan König (@einkoenig) <a href="https://twitter.com/einkoenig/status/1182572629758218246?ref_src=twsrc%5Etfw">October 11, 2019</a></p></blockquote> <p><script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"/></p></div> <p>Businesses are starting to think in product ecosystems today, and we’re providing the layer to add voice to the mix.</p> <h3>Jovo v3: The Open Source Voice Layer</h3> <p>The biggest addition in Jovo v3 is our RIDR Pipeline that enables you to build voice experiences that work with any device or platform.</p> <figure class="wp-caption"> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/1200/1*gOwDkF3muAgOUyOWZBWrdQ.jpeg"><img data-width="1920" data-height="999" src="https://cdn-images-1.medium.com/max/1200/1*gOwDkF3muAgOUyOWZBWrdQ.jpeg" alt="Introducing Jovo v3, the Voice Layer"/></a></p><figcaption class="wp-caption-text">Jovo RIDR Pipeline</figcaption></figure> <p>The Jovo Framework is free and open source and allows you to replace and customize any building block. In a time where voice becomes an essential part of product strategies, we want to help companies stay in control and embed our tools into their own infrastructure and scalable enterprise stack.</p> <p>And this is just the beginning.</p> <h3>Jovo’s Mission: Enabling “Context-First” Experiences</h3> <blockquote><p>The right information, at the right time, on the right device.</p></blockquote> <p>We’re especially excited for our web and mobile integrations because they open up a lot of possibilities for multimodal experiences.</p> <p>Almost 3 years ago, I published this article about building multimodal user experiences:</p> <p><a href="https://context-first.com/alexa-please-send-this-to-my-screen-6f4839eb415a">https://context-first.com/alexa-please-send-this-to-my-screen-6f4839eb415a</a></p> <p>A lot has changed in the last 3 years, our vision remained the same: We want to enable product teams to build product systems that deliver the right information, at the right time, on the right device.</p> <p>Jovo v3 is the next step towards connected multimodal user experiences.</p> <p>We can’t wait to continue that journey with you.</p> <h3>Next Steps</h3> <ul> <li> <strong>Get started</strong>: Follow our <a href="https://www.jovo.tech/docs/quickstart" target="_blank">Quickstart Guide</a>.</li> <li> <strong>Support Jovo</strong>: <a href="https://github.com/jovotech/jovo-framework" target="_blank">Give us a star on GitHub</a> </li> <li> <strong>Need help?</strong> Contact us for an <a href="https://www.jovo.tech/enterprise" target="_blank">enterprise support package</a> </li> </ul> <!--kg-card-end: html--></body></html>]]></content:encoded></item><item><title><![CDATA[Data schema]]></title><description><![CDATA[This is a data schema stub for Gatsby.js and is not used. It must exist for builds to function]]></description><link>https://demo.ghost.io/data-schema-page/</link><guid isPermaLink="false">Ghost__Post__5bbafb3cb7ec4135e42fce56</guid><category><![CDATA[Data schema primary]]></category><dc:creator><![CDATA[Data Schema Author]]></dc:creator><pubDate>Tue, 04 Dec 2018 13:59:14 GMT</pubDate><media:content url="https://images.unsplash.com/photo-1532630571098-79a3d222b00d?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=1080&fit=max&ixid=eyJhcHBfaWQiOjExNzczfQ&s=a88235003c40468403f936719134519d" medium="image"/><content:encoded><![CDATA[<img src="https://images.unsplash.com/photo-1532630571098-79a3d222b00d?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=1080&fit=max&ixid=eyJhcHBfaWQiOjExNzczfQ&s=a88235003c40468403f936719134519d" alt="Data schema"/><p>This is a data schema stub for Gatsby.js and is not used. It must exist for builds to function</p>]]></content:encoded></item><item><title><![CDATA[Alexa, Please Send This To My Screen]]></title><description><![CDATA[Thoughts on designing seamless voice experiences]]></description><link>https://ghost.jovo.tech/alexa-please-send-this-to-my-screen/</link><guid isPermaLink="false">Ghost__Post__5ec2a085435a8f3636a34f93</guid><category><![CDATA[Thoughts]]></category><dc:creator><![CDATA[Jan König]]></dc:creator><pubDate>Wed, 08 Mar 2017 15:46:05 GMT</pubDate><media:content url="https://ghost.jovo.tech/content/images/2020/05/alexa-multiscreen.jpg" medium="image"/><content:encoded><![CDATA[<!--kg-card-begin: html--><html><head/><body><blockquote><img src="https://ghost.jovo.tech/content/images/2020/05/alexa-multiscreen.jpg" alt="Alexa, Please Send This To My Screen"/><p>“Alexa, what’s new?”</p></blockquote> <p><em>I walk out of the bedroom and listen to the news while preparing breakfast.</em></p> <p><em>I’ve gotten very used to Alexa. It’s not that it has dramatically changed how I consume content, but it’s a great way to get things done hands-free while focusing on other stuff in the living room or the–</em></p> <p>Wait.</p> <p>My mind’s wandering again.</p> <p><em>What was Alexa just saying? Who was that guy again, the one who was just mentioned? I recognize his name, but I can’t remember what he looks like.</em></p> <p>Let’s find out.</p> <p>I’m reaching for my phone, which is on the table next to my couch.</p> <blockquote><p><strong>TL;DR: We need to think in ecosystems and build voice interfaces into continuous and complementary experiences to finally make them useful.</strong></p></blockquote> <p>Here’s why:</p> <h4>Voice as an Interface</h4> <p>There are now more than 10,000 skills for Amazon Alexa (= apps for the virtual assistant living inside of Amazon Echo). However, we’re just beginning to grasp what place voice will take in our everyday lives. Currently, it’s not the best experience. Most voice skills have enormous problems with <a href="http://www.recode.net/2017/1/23/14340966/voicelabs-report-alexa-google-assistant-echo-apps-discovery-problem" target="_blank">retention</a>.</p> <p>Voice skills will probably go through the same hype as chatbots did: most will feel like search experiences that could also be done on a website. I don’t see this as a problem, it’s typical when a new interface emerges.</p> <p>As <a href="https://medium.com/u/230bb844b2ac" target="_blank">John Borthwick</a> <a href="https://render.betaworks.com/listening-to-bots-1b22688160c#.g3a30vf1y" target="_blank">mentioned</a>, the first 6 months after the introduction of the App Store, new iOS apps rather resembled iterations of web pages. It took time to develop native experiences. And it will be the same with voice interfaces.</p> <h4><strong>What is a Voice Interface?</strong></h4> <p>With the increasing amounts of Amazon Echo devices in households, now everyone seems to be talking about voice interfaces. But what are we exactly talking about?</p> <blockquote><p>“I define a Voice Computing product as any bot-like service which has no GUI (Graphic User Interface) at all — its only input or output from a user’s perspective is through his or her voice and ears.” — <a href="https://medium.com/u/8fe961141dd3" target="_blank">Matt Hartman</a> </p></blockquote> <p>One of the most important differences I see in voice interfaces is how distinctively we can separate usage between the two forms of interaction: <em>input </em>and<em> output.</em></p> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/1200/1*l8o4yYmmxw5csmfTli6G9Q.jpeg"><img data-width="1808" data-height="889" src="https://cdn-images-1.medium.com/max/1200/1*l8o4yYmmxw5csmfTli6G9Q.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p><strong>🎤 Voice as Input is about controlling software just by saying something.</strong></p> <p>Whatever I’m doing, preparing food, cleaning the kitchen, I can still access a computer without having to reach my phone, the remote, or go to my laptop. Even <em>when</em> I’m on my laptop, I can still ask for stuff without having to stop what I’m currently doing.</p> <p><em>Today, you can already </em><a href="http://www.tomsguide.com/us/connect-philips-hue-amazon-echo,review-3471.html" target="_blank"><em>control your lights</em></a><em> or </em><a href="https://support.myharmony.com/en-us/harmony-experience-with-amazon-alexa" target="_blank"><em>change TV channels</em></a><em> with Alexa.</em></p> <p><strong>👂 Voice as Output is about consuming information just by listening.</strong></p> <p>While this isn’t something completely new (we’ve been running around wearing headphones for years), it’s still worth mentioning. I can listen to the news or weather forecasts while doing something else.</p> <p>Voice as output is especially interesting as it opens up the possibility to be connected to information at times we cannot look at a screen.</p> <p>When brainstorming new skills for Alexa, <a href="https://medium.com/u/167881cce7c4" target="_blank">Alexander</a> and I are constantly trying to separate <em>input </em>from <em>output</em> to understand how we could add the most value to current solutions.</p> <p>But currently, that’s very difficult to do with Alexa skills.</p> <h3><strong>Alexa: Input -> Output</strong></h3> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/1200/1*wL1XkNXAbKbat9hMx0GpHQ.jpeg"><img data-width="1808" data-height="889" src="https://cdn-images-1.medium.com/max/1200/1*wL1XkNXAbKbat9hMx0GpHQ.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>Currently, Alexa mostly offers <em>input </em>and <em>output</em> bundled together. We say something and get information right away.</p> <p>These two forms of interaction can work extremely well together in several contexts (in the kitchen, on the couch), but they can also be completely separated. I wouldn’t talk to my phone on the bus, but would definitely love to listen to things. And in other cases, I might want to tell my apps to do something, but consume the content later.</p> <p>And I believe this is especially interesting when we take into account that contexts can change while we’re interacting with technology.</p> <h4><strong>Shifting Contexts</strong></h4> <p>We’re always talking about the interaction with voice while doing something else like cooking or watching TV. However, what I found my Echo most useful for is when I’m preparing to leave the house: I’m asking a Berlin-based transportation skill which route to take.</p> <p>The first part of the information (which subway to take, how long it takes, when to leave) is perfect at home. But I need to consume additional information (where to transfer, where to walk afterwards) in a different context: while on the go.</p> <p>Currently, an interaction like this is not possible, as most skills lack the functionality to move content across devices.</p> <h4><strong>Current Skills are Silos</strong></h4> <figure class="wp-caption"> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*52IQqfB1VvpbD8P8O01dmQ.jpeg"><img data-width="1920" data-height="1080" src="https://cdn-images-1.medium.com/max/800/1*52IQqfB1VvpbD8P8O01dmQ.jpeg" alt="Alexa, Please Send This To My Screen"/></a></p><figcaption class="wp-caption-text">Almost no interaction between voice interfaces and other devices</figcaption></figure> <p>For skill developers (without a lot of development effort, without an existing user base in their own native apps), it’s currently almost impossible to build something other than immediate <em>input/output</em> apps.</p> <p>There’s some functionality available by Amazon with their <a href="https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/providing-home-cards-for-the-amazon-alexa-app" target="_blank">Home Cards</a> and even the possibility to display content on a <a href="http://lovemyecho.com/2016/04/29/voicecast-the-little-known-tip-that-gives-amazon-echos-alexa-a-screen/" target="_blank">Fire Tablet</a>. But that’s it.</p> <p>In our opinion, however, it’s critical to think about experiences that connect different devices that work best for each context.</p> <h3>Multi-Device Experiences</h3> <blockquote><p>“Greater benefit would come from people getting the <strong>right thing</strong>, at the <strong>right time</strong>, on the <strong>best (available) device”<em>–Michal Levin</em></strong> </p></blockquote> <p>With the beginning of mobile technology and fragmented devices and screen sizes, people started thinking about ways to provide the best experience across platforms.</p> <p>I read the book “<a href="https://www.amazon.com/dp/B00IFMZVK4/" target="_blank">Designing Multi-Device Experiences</a>” by <a href="https://twitter.com/michall79" target="_blank">Michal Levin</a> about 2 years ago and was fascinated by her thinking about using context to design product experiences across devices. And I find it especially useful now, as we need to add voice to the mix as well.</p> <p>In her 3Cs framework, Levin describes how products can be designed to serve the needs across devices:</p> <ul> <li>Consistent</li> <li>Continuous</li> <li>Complementary</li> </ul> <h4><strong>Consistent: Keep the same experience across devices</strong></h4> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*7XjZnbHx6AySuv7Ww5JDiQ.jpeg"><img data-width="1920" data-height="480" src="https://cdn-images-1.medium.com/max/800/1*7XjZnbHx6AySuv7Ww5JDiQ.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>This approach displays content in the same way across the ecosystem (with some adjustments, e.g. smaller fonts). Responsive design can still be seen as some form of consistent design, as it’s still displaying the same content across devices.</p> <p>The problem with current bots and voice skills is, that they mostly use a <em>consistent </em>approach<em>, </em>as they replicate search experiences from the web and mobile apps. And for some contexts, that’s great! For others, not so much.</p> <p>But there’s two more approaches that are interesting to look into:</p> <h4><strong>Continuous: The experience is passed from one device to the other</strong></h4> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*qnMJHf3qkmZJ8YFWw76vrg.jpeg"><img data-width="1920" data-height="480" src="https://cdn-images-1.medium.com/max/800/1*qnMJHf3qkmZJ8YFWw76vrg.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>Great continuous experiences can be ones where you do the research on your desktop at home, and then access the content while on the go (e.g. shopping list in a cooking app, Google Maps directions).</p> <p>I have always been a great fan of tools like <a href="https://www.pushbullet.com/" target="_blank">Pushbullet</a> for pushing content across devices. Or this functionality in Google Maps:</p> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*0hX-9WPLgJAGJTPXNU-aNQ.png"><img data-width="987" data-height="792" src="https://cdn-images-1.medium.com/max/800/1*0hX-9WPLgJAGJTPXNU-aNQ.png" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p><em>Continuous</em> design is perfect for use cases that could undergo shifting contexts like a change of location.</p> <h4><strong>Complementary: Devices work as connected group and complement each other in a new experience.</strong></h4> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*c6SrhIXl6-bvW56srYUasQ.jpeg"><img data-width="1920" data-height="480" src="https://cdn-images-1.medium.com/max/800/1*c6SrhIXl6-bvW56srYUasQ.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>This can be either done with <em>collaboration</em> (devices work with each other) or <em>control </em>(one device is used as some sort of remote).</p> <p>An example of complementary design could be a smart home hub that controls your lights from your tablet, or an iPad multiplayer game that uses iPhones as controllers.</p> <figure class="wp-caption"> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*G8wD1f0F5nUwOvcp_CQsJQ.png"><img data-width="595" data-height="401" src="https://cdn-images-1.medium.com/max/800/1*G8wD1f0F5nUwOvcp_CQsJQ.png" alt="Alexa, Please Send This To My Screen"/></a></p><figcaption class="wp-caption-text">Image taken from <a href="http://www.geeky-gadgets.com/padracer-ipad-game-uses-iphone-as-its-controller-29-04-2010/" target="_blank">here</a></figcaption></figure> <p>And here is where it gets interesting: Some smart home skills for Alexa are already complementary experiences.</p> <p>To have a more seamless experience across devices, I wish there would be more skills that use <em>continuous</em> and <em>complementary</em> design.</p> <h3>Voice in a Multi-Device World</h3> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*nytEFgHS5TyBMexuaMbQqA.jpeg"><img data-width="1920" data-height="1080" src="https://cdn-images-1.medium.com/max/800/1*nytEFgHS5TyBMexuaMbQqA.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>Voice will never be the only interface we use. But it will be one with significantly growing usage.</p> <p>We believe that voice should be intertwined in current technology and allow exchanging information across devices. Software should allow us to shift contexts, as we interact with it. It should offer the right interface at the right moment.</p> <p>Or, as <a href="https://medium.com/u/5c6977d2a94f" target="_blank">M.G. Siegler</a> put it, it should be “<a href="https://500ish.com/computing-in-concert-679e68c14b8a#.ez37b6jh7" target="_blank">computing in concert.</a>”</p> <h4><strong>Example: Mosaic</strong></h4> <p>One of the only examples that show signs of multi-device experiences that incorporate voice is <a href="http://saymosaic.com" target="_blank">Mosaic</a>.</p> <p>The startup offers chatbots and voice skills that help you talk to your home. When I got the most important news (in a morning workflow, which is a feature Mosaic offers), Alexa asked me if I want to have the article sent. I said “send article” and immediately got a message by the Mosaic chatbot with a link.</p> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*95ZzDIJLgtPZ1WW4e7VBvA.jpeg"><img data-width="1920" data-height="1080" src="https://cdn-images-1.medium.com/max/800/1*95ZzDIJLgtPZ1WW4e7VBvA.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>For me, this is a great, seamless experience. At a later point, I would love to immediately add articles like this to my Instapaper.</p> <h3><strong>Ecosystem Thinking</strong></h3> <blockquote><p>“Deliver a <strong>product ecosystem</strong> that serves the end-to-end user journey across devices” — Michal Levin</p></blockquote> <p>Experiences like this require product designers to think in ecosystems: What are the contexts my users could be in, and how could I deliver information in the best way across devices?</p> <p>I believe this will become more important as we’re increasingly moving between devices and interface modalities, as <a href="https://medium.com/u/2229dec1a44f" target="_blank">Chris Messina</a> put it:</p> <div class="embed-twitter"> <blockquote class="twitter-tweet" data-width="500" data-dnt="true"> <p lang="en" dir="ltr">What happens in between screens? This is the white space for 2017. <a href="https://twitter.com/hashtag/wpnext17?src=hash&ref_src=twsrc%5Etfw">#wpnext17</a> <a href="https://t.co/pL4tNiFv5E">pic.twitter.com/pL4tNiFv5E</a></p> <p>— ADLANDIA (@adlandiapodcast) <a href="https://twitter.com/adlandiapodcast/status/837695497544564736?ref_src=twsrc%5Etfw">March 3, 2017</a></p></blockquote> <p><script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"/></p></div> <h4><strong>Current Challenges</strong></h4> <p>Currently, Amazon’s <a href="https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/linking-an-alexa-user-with-a-user-in-your-system" target="_blank">Account Linking</a> only allows linking a skill to <em>one </em>account. This makes it extremely difficult for voice-first startups that don’t have existing user bases with native apps on their smartphones.</p> <p>To allow cross-device pushing, developers would need to set up their own oAuth system, develop an app, make the user download the app, and link the <em>skill-user</em> to the <em>app-user</em>. And this is quite a hassle to do for one skill.</p> <p>We figure Amazon (and other voice platforms) are currently working on a solution, but don’t know exactly how it could look like. And hopefully, this won’t result in walled gardens.</p> <h3>Conclusion</h3> <p>The challenge with current voice experiences is to build something that seamlessly fits into current product ecosystems.</p> <p>To do this, we need to think about creating <em>continuous </em>and <em>complementary</em> interfaces based on the context the user is in. And we also need to keep in mind that context can change while a user interacts with software.</p> <p>Building voice into cross-device experiences is difficult to do, but necessary to take it to the next level.</p> <hr> <h3>A few more points to think about</h3> <p>Here’s some more thoughts that fall into this category, but didn’t really fit in the text above.</p> <h4><strong>The Interaction Effect</strong></h4> <p>In the moment I’m writing this, I’m still thinking in old experiences, based on what I learned from websites, native apps, and chatbots. It’s difficult to imagine how the mainstream usage of voice could change how people use their smartphones or computers. This is why I believe it’s important to come back to these thoughts in a few months (and years) to see if anything changed.</p> <p>Michal Levin calls this the Interaction Effect: “<em>The usage patterns of one device change depending on the availability of another device.</em>”</p> <h4><strong>Notifications</strong></h4> <p>Push notifications are the thinnest available layer that allow apps to make content accessible across devices. And the prevalence of messaging apps makes notification even better. Why? You can plug into an existing platform (like Mosaic did with Facebook Messenger) and let the user decide where to consume the content.</p> <p>Some people noticed a shift from <a href="http://cdixon.org/2014/12/21/two-eras-of-the-internet-pull-and-push/" target="_blank">push to pull</a> in the last years, where we’re moving from people actively seeking information to software knowing when to push the right information.</p> <p>However, I’m wondering how the current developments of voice interfaces fit into that category, as they mostly allow us to pull information, rather than push.</p> <figure> <p><a href="#" data-featherlight="https://cdn-images-1.medium.com/max/800/1*N_BEdFDdpBedhYWRD1PSqQ.jpeg"><img data-width="1920" data-height="480" src="https://cdn-images-1.medium.com/max/800/1*N_BEdFDdpBedhYWRD1PSqQ.jpeg" alt="Alexa, Please Send This To My Screen"/></a><br> </br></p></figure> <p>It will probably always be an interaction of both push and pull. However, I find it interesting to think about when software should appear and when not:</p> <h4><strong>Be like Batman</strong></h4> <p>I believe the rise of voice will make <a href="https://en.wikipedia.org/wiki/Adaptive_user_interface" target="_blank">adaptive user interfaces</a> even more important. No matter if you call it <a href="https://en.wikipedia.org/wiki/Ubiquitous_computing" target="_blank">ubiquitous computing</a> or <a href="https://render.betaworks.com/interfaces-on-demand-336d38123080" target="_blank">on-demand interfaces</a>, it will be necessary for software to only appear when it’s needed.</p> <p><em>Like Batman.</em></p> <p>I once wrote an article about onboarding and used this quote:</p> <blockquote><p>“The signal goes on and he shows up. That’s the way it’s been, that’s the way it will be.” ― James Gordon</p></blockquote> <p>Today, I feel it has become even more important. <a href="https://medium.com/welcome-aboard/batman-onboarding-999d19f0cab9" target="_blank">Check it out here.</a></p> <p>If you want to read more about adaptive systems, I highly recommend <a href="https://www.smashingmagazine.com/2012/12/creating-an-adaptive-system-to-enhance-ux/" target="_blank">this article</a> by <a href="https://medium.com/u/bceda12eabd5" target="_blank">Avi Itzkovitch</a> (old, but gold).</p> <!--kg-card-end: html--></hr></body></html>]]></content:encoded></item></channel></rss>