Alexa, Play Music
- Natalie Lee
- Sep 12, 2020
- 3 min read
Updated: Sep 21, 2020
Alexa, how does speech recognition work? If you ask her this, she will provide you with a lengthy definition of speech recognition. But how does she determine what you said? Speech recognition to a computer is not as easy as it is for a human. In this article, we’re going to delve into the complexity of speech recognition, its challenges, and how it can be improved.

The first progress in speech recognition actually began with Alexander Graham Bell back in the 19th century. Although he failed to create a successful device to communicate with his deaf wife, his research with transforming sound into images ultimately spurred the creation of his famous invention: the telephone.
To explain how speech recognition works, we must first understand the concept of speech itself. Speech is a group of sounds strung together to make a comprehensible thought or idea. Speech is composed of phonemes, bits of sound that can create words when put together—you can think of them as the building blocks from speech.
Fun Fact: Each language contains a different amount of phonemes: English has about 46 distinct phonemes whereas Spanish uses about 24.

Before a computer can recognize speech, it has to convert what you say into data it can understand. Using an analog-to-digital converter, it translates your analog sound waves into digital data. Oftentimes, when you are speaking, there are other sounds being projected around you; the computer must filter out the unwanted noise by isolating your frequency. It also has to modify the speed of the digitized version of your speech so that later on, it is easier to match your speech to the sounds within its library.

After finalizing the digitized version of your speech, the computer combs through its stored phonemes to find sounds that match what you said. This is the most challenging step because there are numerous phoneme combinations, even for what you said. The two most common issues are homophones and continuous speech. For instance, “red” and “read” are homophones—they sound the same—so it is difficult for the computer to distinguish those two words. It is also hard for the computer to separate words if the speaker talks fast without many breaks in between words. In order to bypass these two challenges, the computer uses statistics and probability to calculate the combinations that make the most sense. Referring to the example above, the combination “red apple” is used a lot more often than “read apple.”
Even with those two issues resolved, there are many other obstacles in speech recognition that have yet to be overcome by modern technology. If the person doesn’t speak clearly and loudly enough, it is difficult for the computer to determine what the person is saying, especially if there is background noise. If multiple people are talking at the same time, the computer won’t be able to distinguish who is trying to address the computer.
Currently, speech recognition is used in popular “personal assistant apps” such as Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana, and Google’s Google Assistant and Google Home. Some smartphone keyboards also use speech recognition to allow people to type by simply speaking out loud. This feature is useful for long paragraphs that people don’t want to type out on their phones, but due to the mistakes that the software makes when recognizing the speech, people have to read over the text and fix all the mistakes, which can take time.
Fun Fact: As of 2020, roughly 70% of smart speaker users in the United States use an Amazon Echo.
Speech recognition is still in its early years, and there is much to improve: for example, the computer has to process a lot of information—it’s not the most efficient system. There is a lot of potential in the future of speech recognition including universal translators and digital friends that can interact and continue a conversation with people. I hope you learned something new today! Keep a lookout for the next post!
Comments