How Does My Voice Assistant Know What I’m Saying?

By
A sound wave pattern

Editor’s Note: In “Make Me Smarter,” a writer looks at a piece of smarthome technology they don’t understand, and talks to experts about how it actually works.

For more than half a decade, I’ve used Siri to set my alarms.

Every night, before going to bed, I hold down the button and tell my phone to wake me at a given time, with very little understanding of how that actually happens. If anybody asked me, I would stumble through a vague answer along the lines of “something something waveform analysis, something something machine learning.”

Now, several years and dozens of preset alarms later, I’ve set out to get a better sense of a tool I’ve come to rely on, and of the type that has generated some popular concern.

In my quest for knowledge, I learned quite a few cool things.

Alexander Rudnicky, a research professor with the Carnegie Mellon Speech Group explained to me how programs like Alexa or Siri translate the words I say into text, how a program analyzes that text to understand what I want it to do (in this case, wake me at 8:00) and to let me know that it’s done the job (the depressingly familiar announcement “Your 8:00 a.m. alarm is on”.) And Meredith Broussard, a data journalist and professor at NYU, brought up the “unreasonable effectiveness of data”: These programs become so good at what they do — improving their knowledge of the way people speak — by gathering and analyzing as many examples as possible.

To get a better handle on how these programs work, I spoke individually with Broussard and Rudnicky. A transcript of our conversations — edited for clarity and length — appears below.

How does Siri, or any other voice assistant, translate the words I’m saying into something it can understand?

Meredith Broussard: The computer takes in the waveform of your speech. Then it breaks that up into words, which it does by looking at the micro pauses you take in between words as you talk.

Alexander Rudnicky: Let’s say we have a particular speech sound, like the word “one.” If I have a couple thousand examples of a one I can compute the statistics of its acoustic properties, and the more data — the more samples of one — I have the more precise the description become. And once I have that I can build fairly powerful recognition systems.

Once it’s “heard” me, how does a voice assistant act on what I’m saying?

Broussard: It has to use a kind of grammar to put those words into what the computer understands as concepts, in order to be able to decide what to do in response as a command. And that used to be very, very difficult to do because we only had rules for voice commands. So If I said ‘Siri, send an email,’ maybe it could recognize that. But if I said ‘Hey Siri, I want to send an email to Joseph,’ it couldn’t recognize that. Now what we have are advances in artificial intelligence to allow the computer to guess “Siri, send an email” and “hey, I want to send an email to Joseph!” are probably the same thing.

We’ve made really great mathematical advances to allow AI to seem like it’s smarter; one of the important things to remember is that AI is actually just math. It’s easy to get confused between what is real in AI and our Hollywood images of AI. Just because the computer can take sound and turn sound waves into representations, into words, doesn’t actually mean there’s a brain inside the computer.

We know that Alexa’s always listening, even if it’s not always recording what you say. How does it know when to actively listen?

Broussard: There’s always a pattern. There’s an invocation word that basically turns on the program. With Alexa, you say “Alexa, play ‘California Girls’” and it’s listening and recognizing the waveform of “Alexa.” It knows when it sees that waveform to start the program that takes in several seconds of recording and turns it into a command given to the computer, which executes that command.

Why does Siri sometimes have trouble understanding me (or, apparently, most of Scotland)?

Broussard: Voice recognition and responding to voice commands are really hard problems computationally. And we’ve made amazing strides forward but there’s still a fair way to go. Accents, for example, are still really hard for these kinds of systems, and complex commands are really hard for these systems.

There are fundamental limits to what you can do with math, and there are really infinite variations out there in the world when it comes to people’s voices and people’s accents. Even if you managed to record every single person in the entire world saying every single word that exists right now — which, by the way, you couldn’t — then you would still have the problem of new words. And then there are things that happen with speech. I might learn Korean next year and then I’d start speaking in Korean, but I’d have a terrible accent. Or what if I had some sort of accident that affected my voice, like a stroke. It would change the ground truth. And computers are not great at catching up. The world changes really fast.

How Do Siri and Alexa Actually Know What You’re Saying