Alexa Fundamentals

Ankur Jain
The Startup
Published in
5 min readNov 30, 2020

--

Overview

When smartphones first came into picture, apart from texting and calling the one thing which really made that experience magical was the birth of numerous mobile applications. Without the mobile apps it would still have been a phone but with no smarts.

Similarly when Amazon echo first came into picture, it was a cool toy which could do certain things. But what actually made Amazon echo what it is today is the capability of creating skills into Alexa. What mobile application are to smartphones, Alexa skills are to amazon echo.

The table below will help you to understand how similar these two things are even with their differences.

Once we are clear, what Alexa skills actually are let’s dive into some of its fundamentals. Specifically we will be discussing the following points in this article:

  1. What’s the flow for an actual Alexa skill?
  2. What are the various components involved?
  3. What is an intent?
  4. What is a slot?
  5. What is a skill interaction model?
  6. What actually is VUI?
  7. What’s next?

What’s the flow for an actual Alexa skill?

Image credit: developer.amazon.com

Whenever you talk to alexa, the audio is stream to the alexa service, the following things happen:

  1. ASR (Automatic speech recognition): Using ASR, alexa service is able to convert all the audio streamed to text.
  2. NLU (Natural language processing): The converted text is sent to NLU using which that text is understood and broken into understandable tokens.
  3. Using the tokens alexa service understand the request is meant for which skill, and what action/intent is expected out of that skill.
  4. The request with the action/intent is received by the backend service which does the processing and the response is sent back to alexa service.
  5. Alexa service converts the response back to speech and then streams it back to your device.

What are the various components involved?

Typically, an Alexa skill is invoked like this:

  1. Wake word: This word wakes up Alexa which means it starts streaming the audio to Alexa service.
  2. Invocation name: This is your skill name which tell Alexa service which skill to invoke.
  3. Utterance: Utterance is the sentence which tells the skill to do something.

What is an Intent?

Wake word and invocation name are pretty self explanatory. So let’s look closely into utterance. Every utterance tells your skill do to something specific or perform an action. This action is referred as Intent in terms of Alexa skill.

When you think about your skill, you will layout what all tasks do you want you skills to perform. The list of these tasks or actions is the list of intents of your skill.

For example from the above example we can defined the intent as: TellStoryIntent. In order to define an intent you tell the name of intent and then write a number of utterances which will invoke this intent i.e Alexa should know when to invoke this intent. By providing a set of utterances you are giving the hint to Alexa engine to run its ML magic and understand when to invoke the intent. So for the intent TellStoryIntent we might define the utterances like:

  1. Tell me a princess story
  2. Do you know some story
  3. I want to listen to a story
  4. Can you tell me a haunted story please
  5. I would love to hear a romantic story.

What is a slot ?

If you look at the utterances provided in the above section you will find a loop hole. Let’s say your skill can handle 10 different types of story aka princess story, romantic story, haunted story, moral story, animal story, failry story etc. If you start writing the utterances to include everything soon you will find that the number of permutations become so much that it’s unmanageable. In order to overcome this you change your story type to a variable i.e Slot which can be defined to take in some values.

For example let’s say you define your slot as : StoryType, which can take values like princess, romantic, haunted etc. Now you can change your intent utterances to something like this:

  1. Tell me a {StoryType} story
  2. Do you know {StoryType} story
  3. I want to listen to a story
  4. Can you tell me a {StoryType} story please
  5. I would love to hear a {StoryType} story.

What is an Interaction Model?

Interaction Model for a skill is nothing but a collection of intents and slots for your skill. It’s possible that your skill would be performing more than one actions and hence your skill will have more than one Intents. Similary your skill can be configured to have multiple slots. A json document containing the list of all these intents and slots is known as the Interaction Model.

{
"interactionModel": {
"languageModel": {
"invocationName": “story time",
"intents": [
{
"name": “TellStoryIntent",
"slots": [
{
"name": “StoryType",
"type": “StoryTime_StoryType"
}
],
"samples": [
"Tell me a {StoryType} story”,
“Do you know {StoryType} story”,
"I want to listen to a story”,
"Can you tell me a {StoryType} story please”,
"I would love to hear a {StoryType} story"
]
},
{
"name": “StoryCountIntent",
"slots": [
{
"name": “StoryType",
"type": “StoryTime_StoryType"
}
],
"samples": [
“Tell me how many {StoryType} stories do you have”,
“How many {StoryType} stories do you know",
“Count the number of {StoryType} stories"
]
},
],
“types”: [
{
“name”: “StoryTime_StoryType”,
“values”: [
{
"name": {
"value": “princess"
}
},
{
"name": {
"value": “romantic"
}
},
{
"name": {
"value": “fairy"
}
},
{
"name": {
"value": “haunted"
}
},
{
"name": {
"value": “christmas"
}
},
{
"name": {
"value": “animal"
}
},
{
"name": {
"value": “indian"
}
},
{
"name": {
"value": “moral"
}
},
]
}
]
}
}
}

What actually is VUI ?

Just like a GUI designer carefully plans the website in order to provide a nice user experience, as an Alexa skill developer it is of utmost importance to design you VUI i.e voice user interface. As a skill developer you need to plan what all intents will be supported by the skill and how can those be easily invoked by your skill user. You need to plan for all the various paths the user can take during interacting with your skill and make a voice design path according to it.

By doing this activity before hand will help you as a developer to write the best interaction model you can and provide your user with a rich and a delightful experience.

What’s next?

In this tutorial i tried to explain what are the main elements of a voice skill are. You can always refer to alexa official documentation to learn more: https://developer.amazon.com/en-US/alexa/alexa-skills-kit/start

In the next tutorial we will go though the basics of setting up a “Hello world” Alexa skill! Till then, Chao Chao!

--

--

Ankur Jain
The Startup

Hi, I am a software developer and i like to learn and share :)