home smart home

‘Okay Google, Play ‘Dura.’’: Voice Assistants Still Can’t Understand Bilingual Users

Photo-Illustration: Select All

I’m hoping to get the latest Daddy Yankee song going on my Google home because nothing gets my house cleaner than a steady stream of reggaeton. The promise of bumping along to uninterrupted jams from room to room is one of the reasons I allowed my husband to set up three devices in our one-bedroom apartment. The problem is, most times it can’t understand me when I pronounce Spanish words in Spanish.

This time, the virtual assistant apologizes for being unable to find songs from Dora the Explorer. I try again, saying the Spanish word with a heavy American accent. Instantly my Google Home begins streaming the song.

It’s frustrating because as someone who doesn’t get the chance to practice my Spanish enough, I want the few times I do to be correct. I probably wouldn’t have even noticed if it weren’t for the ease in which my nonimmigrant husband, who grew up in the Midwest, uses voice commands with 99 percent accuracy. My stepdad, of similar descent, uses Siri to call me. In his phone, my name is shortened to “Ximy,” the way someone might abbreviate Ryan to Ry. The correct Spanish pronunciation is “Him-E.” The system only understands if he pronounces it “Zim-E.”

It seems Siri, Google Home, and Alexa require butchered Spanish in order for their systems to work. At best, it feels like an erasure of language. At worst, it’s exclusion of people who speak more than one language. This tech is supposed to simplify lives but it isn’t accessible to everyone, which seems backward considering 43 percent of the world’s population speaks at least two languages — 3 percent more than those who only speak one.

The Washington Post recently released an “accent gap” study in which the paper discovered Chinese and Spanish accents are the hardest for Alexa and Google Home to understand. Even though these are the top two languages spoken in the world, these non-native English speakers are being left behind in the voice command revolution. The reason this happens is quite literally, by design.

“These tools were ‘standardized’ by, and consequently for, a very specific audience,” says Carolina Barrera-Tobon, an assistant professor on modern languages at DePaul University in Chicago. “[It’s for] people who speak standard American English. This sheds some light on who these companies think are important, and to whom they are marketing.”

According to the Post on why this happens, it’s because the standard used to program devices is considered “broadcast English,” which the paper classifies as “predominantly white, nonimmigrant, non-regional dialect of TV newscasters.” This leads to a homogenization of language and lack of comprehension when it comes to understanding people who don’t fall into these categories.

“Language is connected to social privilege and that perpetuates itself in these tools,” says Barrera-Tobon. “Is correct pronunciation important? Should Alexa be able to pronounce ‘Despacito’ correctly? Absolutely. But I think a better question that the developers should have considered from the beginning is what is English, and what does it sound like? ‘Correct’ is a social construct and loosely related to language.” Put simply: Assistant AI comes with bias, whether intentional or accidental.

“Many AI systems are created today using algorithms, training modelsm and data sets that may be flawed due to unresolved biases,” explains Tiffany Li, attorney and resident fellow at Yale Law School’s Information Society Project, specializing in tech and AI. “These voice assistants likely were trained on understanding speech using data sets that primarily included people with standard American accents. It’s likely developers did not realize that this would create a problem for users who have other accents.”

Barrera-Tobon says there are two issues at play. First is the difficulty voice assistants have in comprehending various accents. The other is how hard it is for AI to recognize code-switching, or changes between two or more languages in the same conversation.

The process to make AI bilingual is complicated, in part because of the intricacies of language and how much structure can vary from one to the other. Add layers of idioms, colloquial sayings, and regional dialects and it’s like trying to catch a fish with your bare hands. According to Steve Davis, an expert on machine learning and co-founder of Signafire, a data analytics firm, breakthroughs have only begun to happen in recent years due to the increased popularity of smartphones and Internet of Things devices. Voice recognition and natural language processing are the two factors at play when assistant AI tries to learn a new language. These aren’t new concepts but the strategies to improve them are.

“Things like the order of nouns and adjectives, how verbs are conjugated, and tenses expressed vary wildly,” Davis told me via email. “As a result, the models that we’ve built for English have relatively little applicability to Latin-based languages like Spanish. They’re almost entirely irrelevant for a language like Chinese.” And yet, advances are happening.

Google says it’ll offer more language support by the end of this year. And in the Harvard Business Review, Richard Socher, chief scientist at Salesforce, argues AI will grow smarter as it takes on harder tasks. Which is great, but what about the proper pronunciation of Spanish words — or Chinese, French, Arabic, or any other language when spoken alongside of English? While virtual assistants may be able to speak more languages, they’ll still have difficulty understanding users who do. This isn’t necessarily an issue of AI learning to be bilingual, but rather a need to promote diversity and cultural sensitivity among the workforce that makes these systems and devices.The homogeneity of Silicon Valley is well documented, but that shouldn’t be permitted to excuse this lack of mindfulness around non-native English speakers. Especially when you consider how Google Assistant has no problem saying or understanding the name of German writer Goethe, pronounced “Ger-ta.”

I’m reminded of this prioritization whenever I take the bus in Chicago, pass the street of the same name and the pre-recorded PA announcement says its accurately. If the Chicago Transit Authority could hire an announcer — whose self-appointed responsibilities were to put himself in the shoes of the commuter and learn to say words properly — back in 1998, then a tech giant can surely figure it out.