We live in a time where our smartphones and other technology do what we want just by talking to them — locating the nearest café, calling your mom or writing out a text message that you’re dictating while driving. Computers can even answer our questions or act as our virtual assistants; even though it’s frustrating when they get things wrong sometimes.
It’s called voice recognition and it’s an example of computer learning. And despite the occasional awkward response or failure to understand some accents, this type of technology has made a lot of progress in the last couple of years; just look at recent versions of Apple’s Siri, Amazon’s Alexa, Microsoft’s Cortana and Ford’s SYNC infotainment system.
So, how does our phone, tablet or car decipher the things we say and respond accordingly? Here’s the basic idea:
- It’s not really about the sound — it’s actually about the sound wave that comes out when we say something
A sound is created through tiny changes in air pressure, and it enters our ears as one continuous sound wave. But computers aren’t like people, so they need a way to “hear” the words that are said and turn them into text. So when sounds enter the devices we use, its computer measures that sound wave at one point in time, stores it and measures it again, and does this again and again with each sound. The result: the sound you made is now digitalized for the computer to understand. As you can imagine, this is a very precise process and our smart devices can sometimes mistake what we say. If the computer detects a gap in the wave, what gets measured may not be correct.
- The sound of a word vs. the sound of something else
Once a sound is recorded digitally, the computer has to figure out what sounds it has to pay attention to, using algorithms. To determine if chunks of digitized sound are actually words, rather than sounds from a car engine or a radio, the computer applies a bunch of mathematical operations to separate what is speech and what isn’t.
- Same word, different accents
Voice recognition works by breaking up the speech into small segments called phonemes. In English alone, there are about 40 different phonemes. The computer is trained to recognize what each speech segment looks like digitally, but they’re not always the same. For instance, sounds vary with different accents, placement in a word and even spellings (i.e. “to” vs. “two” vs. “too”). Based on a dictionary word list and contextual relationships, the computer in your gadgets can make an assumption of what you’re saying. So, if your friend Mary is in your contact list, the command “call Mary” is linked to “Mary” and not “merry”.
“With enhanced voice recognition, you can talk to SYNC 3 with simple real-world voice commands and the system responds naturally to your voice,” says Mark Porter, Supervisor, Asia Pacific Infotainment Systems, Ford Motor Company. “It’s even been fine-tuned to deal with the Australian accent, and in China, it can understand a string of Chinese characters written by hand on its graphical interface.”
- Predicting what the next word in a sentence might be
There can be many different word combinations in a single speech stream simply because there are lots of phonemes that sound similar to one another when said quickly. Sometimes the result can be a wacky sequence of words that don’t really make sense. To avoid this, the computer system applies models based on how people actually talk to figure out how likely one word is to follow another.
- Presenting the best result as quickly as possible
Once all the calculations are done and the guesses are made, the computer can finally present its best result, whether it’s on a screen, from a pre-set menu or coming up with a vocal response. “New, state-of-the-art voice recognition technology can achieve incredibly fast response times and are more intuitive than ever before,” explains Mr. Porter. “A user of SYNC 3 can command their car to ‘Tune to <frequency> FM’, while other systems still require you to say ‘Radio’ then points you to another list and prompts you again to say the frequency of the radio station you want to listen to.”
With more real-time and accurate technology now available, voice activated commands are making our lives better in a myriad of different ways. Although at times it may seem like your device is just out to annoy you with its bizarre answers, consider all the tedious calculations and complex transformations it has to do behind-the-scenes to recognize a single word, let alone an entire sentence. For your gadget to be even remotely able to decipher what you say and then piece together a semi-coherent response is amazing, especially since some humans are still trying to master this skill.