The Rise Of Intelligent Conversational UI

You are here:

Burke Holland

2018-04-05T13:00:48+02:00
2018-04-05T11:14:17+00:00

For a long time, we’ve thought of interfaces strictly in a visual sense: buttons, dropdown lists, sliders, carousels (please no more carousels). But now we are staring into a future composed not just of visual interfaces, but of conversational ones as well. Microsoft alone reports that three thousand new bots are built every week on their bot framework. Every. Week.

The importance of Conversational UI cannot be understated, even if some of us wish it wasn’t happening.

The most important advancement in Conversational UI has been Natural Language Processing (NLP). This is the field of computing that deals not with deciphering the exact words that a user said, but with parsing out of it their actual intent. If the bot is the interface, NLP is the brain. In this article, we’re going to take a look at why NLP is so important, and how you (yes, you!) can build your own.

Speech Recognition vs. NLP

Most people will be familiar with Amazon Echo, Cortana, Siri or Google Home, all of which have an interface that is primarily conversational. They are also all using NLP.

Large preview

Aside from these intelligent assistants, most Conversational UIs have nothing to do with voice at all. They are text driven. These are the bots we chat with in Slack, Facebook Messenger or over SMS. They deliver high quality gifs in our chats, watch our build processes and even manage our pull requests.

Getting workflow just right ain’t an easy task. So are proper estimates. Or alignment among different departments. That’s why we’ve set up ‘this-is-how-I-work’-sessions — with smart cookies sharing what works well for them. A part of the Smashing Membership, of course.

Explore features →

Large preview

Conversational UIs built on text are nice because there is no speech recognition component. The text is already parsed.

When it comes to a verbal interaction, the fundamental problem is not recognizing the speech. We’ve mostly got that one down.

OK, so maybe it’s not perfect. I still get voicemails every day like a game of Mad Libs that I never asked to play. iOS just sticks a blank line in whenever they don’t know what exactly was said.

Large preview

Google, on the other hand, just tries to guess. Like this one from my father. I have absolutely no idea what this message is actually trying to say other than “Be Safe” which honestly sounds like my mom, and not my dad. I have a hard time believing he ever said that. I don’t trust the computer.

Large preview

I’m picking on voice mail transcriptions here, which might be the hardest speech recognition to do given how degraded the audio quality is.

Nevertheless, speech recognition is largely a solved problem. It’s even built right into Chrome and it works remarkably well.

Large preview

After we solved the problem of speech recognition, we started to use it everywhere. That was unfortunate because speech recognition on it’s own doesn’t do us a whole lot of good. Interfaces that rely soley on speech recognition require the user to state things a precise way and they can only state the limited number of exact words or phrases that the interface knows about. This is not natural. This is not how a conversation works.

Without NLP, Conversational UI can be true nightmare.

Conversational UI Without NLP

We’re probably all familiar with automated phone menus. These are known as Interactive Voice Response systems — or IVRs for short. They are designed to take the place of the traditional operator and automatically transfer callers to the right place without having to talk to a human. On the surface, this seems like a good idea. In practice, it’s mostly just you waiting while a recorded voice reads out a list of menu items that “may have changed.”

Large preview

A 2011 study from New York University found that 83% of people feel IVR systems “provide either no benefit at all, or only a cost savings benefit to the company.” They also noted that IVR systems “score lower than any other service option.” People would literally rather do anything else than use an automated phone menu.

NLP has changed the IVR market rather significantly in the past few years. NLP can pick a user’s intent out of anything they say, so it’s better to just let them say it and then determine if you support the action.

Check out how AT&T does it.

AT&T has a truly intelligent Conversational UI. It uses NLP to let me just state my intent. Also, notice that I don’t have to know what to say. I can fumble all around and it still picks out my intent.

AT&T also uses information that it already has (my phone number) and then leverages text messaging to send me a link to a traditional visual UI, which is probably a much better UX for making a payment. NLP drives the whole experience here. Without it, the rest of the interaction would not be nearly as smooth.

NLP is powerful, but more importantly, it is also accessible to developers everywhere. You don’t have to know a thing about Machine Learning (ML) or Artificial Intelligence (AI) to use it. All you need to how to do is make an AJAX call. Even I can do that!

Building An NLP Interface

So much of Machine Learning still remains inaccessible to developers. Even the best YouTube videos on the subject quickly become hard to follow with subjects like Neural Networks and Gradient Descents. We have, however, made significant progress in the field of Language Processing, to the point that it’s accessible to developers of nearly any skill level.

Natural Language Processing differs based on the service, but the overall idea is that the user has an intent, and that intent contains entities. That means exactly nothing to you at the moment, so let’s work up a hypothetical Home Automation bot and see how this works.

The Home Automation Example

In the field of Natural Language Processing, the canonical “Hello World” is usually a Home Automation demo. This is because it helps to clearly demonstrate the fundamental concepts of NLP without overloading your brain.

A Home Automation Bot is a service that can control hypothetical lights in a hypothetical house. For instance, we might want to say “Turn on the kitchen lights”. That is our intent. If we said “Hello”, we are clearly expressing a different intent. Inside of that intent, there are two pieces of information that we need to complete the action:

The ‘Location’ of the light (kitchen)

The desired state of the lights ‘Power’ (on/off)

These (Location, Power) are known as entities.

When we are finished designing our NLP interface, we are going to be able to call an HTTP endpoint and pass it our intent: “Turn on the kitchen lights.” That endpoint will return to us the intent (Control Lights) and two objects representing our entities: Location and Power. We can then pass those into a function which actually controls our lights…

function controlLights(location, power) {
  console.log(`Turning ${power} the ${location} lights`);
  
  // TODO: Call an imaginary endpoint which controls lights   
}

There are a lot of NLP services out there that are available today for developers. For this example, I’m going to show the LUIS project from Microsoft because it is free to use.

LUIS is a completely visual tool, so we won’t actually be writing any code at all. We’ve already talked about Intents and Entities, so you already know most of the terminology that you need to know to build this interface.

The first step is to create a “Control Lights” intent in LUIS.

Large preview

Before I do anything with that intent, I need to define my Location and Power entities. Entities can be different types — kind of like types in a programming language. You can have dates, lists and even entities that are related to other entities. In this case, Power is a list of values (on, off) and Location is a simple entity, which can be any value.

It will be up to LUIS to be smart enough to figure out exactly what the Location is.

Large preview

Now we can begin to train this model to understand all of the different ways that we might ask it to control the lights in a different location. Let’s think of all the different ways that we could do that:

Turn off the kitchen lights;
Turn off the lights in the office;
The lights in the living room, turn them on;
Lights, kitchen, off;
Turn off the lights (no location).

As I feed these into the Control Lights intent as utterances, LUIS tries to determine where in the intent the entities are. You can see that because Power is a discreet list of values, it gets that right every time.

Large preview

But it has no idea what a Location even is. LUIS wants us to go through this list and tell it where the Location is. That’s done by clicking on a word or group of words and assigning to the right entity. As we are doing this, we are really creating a machine learning model that LUIS is going to use to statistically estimate what qualifies as a Location.

Large preview

When I’m done telling LUIS where in these utterances all the locations are, my dashboard looks like this…

Large preview

Now we train the model by clicking on the “Train” button at the top. Do you feel like a data scientist yet?

Now I can test it using the test panel. You can see that LUIS is already pretty smart. The Power is easy to pick out, but it can actually pick out Locations it has never seen before. It’s doing what your brain does — using the information that it has to make an educated guess. Machine Learning is equal parts impressive and scary.

Large preview

If we try hard enough, we can fool the AI. The more utterances we give it and label, the smarter it will get. I added 35 utterances to mine before I was done and it is close to bullet proof.

So now we get to the important part, which is how we actually use this NLP in an app. LUIS has a “Publish” menu option which allows us to publish our model to the internet where it’s exposed via a single HTTP endpoint. It will look something like this…

https://westus.api.cognitive.microsoft.com/luis/v2.0/apps/c4396135-ee3f-40a9-8b83-4704cddabf7a?subscription-key=19d29a12d3fc4d9084146b466638e62a&verbose=true&timezoneOffset=0&q=

The very last part of that query string is a q= variable. This is where we would pass our intent.

https://westus.api.cognitive.microsoft.com/luis/v2.0/apps/c4396135-ee3f-40a9-8b83-4704cddabf7a?subscription-key=19d29a12d3fc4d9084146b466638e62a&verbose=true&timezoneOffset=0&q=turn on the kitchen lights

The response that we get back looks is just a JSON object.

{
  "query": "turn on the kitchen lights",
  "topScoringIntent": {
    "intent": "Control Lights",
    "score": 0.999999046
  },
  "intents": [
    {
      "intent": "Control Lights",
      "score": 0.999999046
    },
    {
      "intent": "None",
      "score": 0.0532306843
    }
  ],
  "entities": [
    {
      "entity": "kitchen",
      "type": "Location",
      "startIndex": 12,
      "endIndex": 18,
      "score": 0.9516622
    },
    {
      "entity": "on",
      "type": "Power",
      "startIndex": 5,
      "endIndex": 6,
      "resolution": {
        "values": [
          "on"
        ]
      }
    }
  ]
}

Now this is something that we can work with as developers! This is how you add NLP to any project — with a single REST endpoint. Now you’re free to create a bot with some real brains!

Brian Holt used the browser speech API and a LUIS model to create a voice powered calculator that is running right inside of CodePen. Chrome is required for the speech API.

See the Pen Voice Calculator by Brian Holt (@btholt) on CodePen.

Bot Design Is Still Hard

Having a smart bot is only half the battle. We still need to account for any of the actions that our system might expose, and that can lead to a lot of different logical paths which makes for messy code.

Conversations also happen in stages, so the bot needs to be able to intelligently direct users down the right path without frustrating them or being unable to recover when something goes wrong. It needs to be able to recover when the conversation dies midstream and then starts again. That’s a whole other article and I’ve included some resources below to help.

When it comes to language understanding, the AI platforms are mature and ready to use today. While that won’t help you perfectly design your bot, it will be a key component to building a bot that people don’t hate.

Great UI Is Just Great UI

A final note: As we saw from the AT&T example, a truly smart interface combines great speech recognition, Natural Language Processing, different types of conversational UI (speech and text) and even a visual UI. In short, great UI is just that — great UI — and it is not a zero sum game. Great UIs will leverage all of the technology available to provide the best possible user experience.

Special thanks to Mat Velloso for his input on this article.

Further Resources:

“Principles Of Bot Design,” Microsoft Azure (Doc written by Mat Velloso, Kamran Iqbal, and Robert Standefer)
“Conversational Design Essentials: Tips For Building A Chatbot,” Anton Skorniakov, Smashing Magazine
“Language Understanding Tutorials,” LUIS, Microsoft Azure

(rb, ra, yk, il)