Google voice recognitionIf you have an Android phone (which is pretty likely if you’re reading this), you might have used Google voice recognition. The command “OK Google” opens a wide range of possibilities for interacting with your device, but do you know how it works?

Voice recognition is a technique which has been used for years as a way of interacting with computers. In the world of smartphones in general, and Android in particular, voice recognition began to be particularly useful following the release of JellyBean, when recognition errors were reduced by 25%. Let’s take a look at how this is done.

The general concept: algorithms, patterns and vocabulary

In order to understand the basis of how a voice recognition system works, think how our brains work when we learn a new language. Have you ever tried to watch a foreign film in the original language? You might have encountered one of several situations: either you understand everything, you understand some things, or you don’t understand anything at all.

By playing the video slower and with headphones, you might find it easier to identify which words are being said. If you managed to identify more words with headphones and at a lower speed, it means that your weak spot was your brain algorithms. Now, do you know what the words you identified mean? If not, it’s because you also need to know the vocabulary.

Brain algorithms can be defined as shortcuts your brain uses to decipher what is being said (even if you don’t know what it means). By watching the video at a slower speed and with headphones, you make it a little easier for your brain to understand what is being said, and you rely less on your language skills to identify what is being said.

In order to identify the maximum number of words being said there are two key points: refining your algorithm by practicing so that your brain works better and also being able to recognise patterns, or remember how words sound so that when you hear a conversation in another language you can compare what you hear with the sounds you recognise, and decide which is the most probable word being said. This comparison is made mathematically using cross-correlation (similar to the method Shazam uses to identify a song, here is a very interesting article on the subject), and this is the same process your brain carries out without you even realising it.

Cross-correlation formula (Source: Wikipedia)

Cross-correlation formula (Source: Wikipedia)

Going back to the example of Shazam, brain algorithms allow a song to be recorded correctly, with the least amount of noise and distortion possible, to compare it with its database of songs and decide which song it is.  This means that there needs to be an enormous database of songs, and in the case of learning a language, we need to know the pronunciation of a large variety of words to be able to recognise what is being said.

Shazam audio capture process (Source: www.toptal.com)

Shazam audio capture process (Source: www.toptal.com)

As well as brain algorithms we have the vocabulary of the language (which might seem like the same thing as the “database of sounds”, but in this case I am referring to the meaning). If we are capable of identifying what is being said in a conversation without any problems, but we do not know the meaning of what is being said, there is not much we can do. This part usually requires more stamina than skill. Anyone can learn 1,000 Swedish words and be able to recognise them when they see them written down, but that doesn’t mean they will be able to understand a conversation in Swedish because their brain algorithms are not trained to do so.

The same happens in the world of tech: it is more complicated to identify words correctly than it is to know what each one means, because it doesn’t cost a computer anything (apart from a few bytes) to save the meaning of a word, but it is more difficult for its algorithms to be good enough to isolate and identify words, not to mention establishing the complex semantic relationships of human language. For example, you might learn a word and recognise it in a conversation, but not know what it means. For computers, once they recognise a word the remaining process to understand the meaning is immediate and will never be forgotten, and it is almost impossible for them not to recognise a word, as in just a few megabytes they can store whole dictionaries. But just like humans, they do have problems with understanding the word in context.

Google and neuronal networks

After that introduction, let’s not focus on your smartphone. How does Google’s voice recognition work so well? It relies on a lot of patterns in order to make comparisons, and many computers which work as a neural network, which process the information they get to find a common result. We have Vincent Vanhoucke and Jeff Dean to thank for applying this technology to Google voice recognition, but the concept of neural networks is not only used in voice recognition. For example, it has also been used to improve Google’s algorithms for detecting human and cat faces (in order to do so the system used thousands and thousands of YouTube videos to look for patterns).

Face patterns detected by the neural network (Source: www.phonearena.com)

Face patterns detected by the neural network (Source: www.phonearena.com)

Google systems based on neural networks can adjust their behaviour to improve, learn and evolve. For example, a voice recognition system could give different results depending on the person using it, for example by using their search history to help understand what the user has said. Ultimately it all comes down to reducing conversion errors, and probability plays a big role in this.

Another fundamental point in this technology is that the more data it has access to, the more precise it is. In this regard, the amount of data Google has to work with is simply enormous (and getting bigger by the day).

When it comes to practice, when you speak to Google’s voice recognition motor, what it does is convert your voice from an analogue to a digital signal, pass it from the time domain to the frequency domain, split your voice spectogram into parts, and send them to different Google computers around the world.

Example of a time-frequency spectrogram (Source: Wikipedia)

Example of a time-frequency spectrogram (Source: Wikipedia)

These computers process audio based on the neural network model, and try to identidy the individual components of the voice (vowels and consonants) in a neural network layer. Next, another layer tries to identify the groups these fundamental sounds are in until a final estimate of the word is obtained.

Neural network architecture (Source: www.egomachines.com)

Neural network architecture (Source: www.egomachines.com)

The power of neural networks lies in their ability to predict what a user wants, what a sound means, how a word changes the meaning of a sentence, how a molecule will behave in a certain situation, the weather, the stock exchange, and so on. Obviously they are not human brains, but they try to minimise the difference between their abilities and ours by asking: What is the most likely thing that will happen?

And that’s just the beginning…

Everything I’ve talked about here is just the tip of the iceberg. Google’s ability to understand us is constantly improving, and this is changing the way we interact with our devices. Natural language is becoming within reach, and the importance of context is increasing, so we will soon be less conscious of the fact we are talking with a machine. This is in relation to voice recognition, but translating voice to other languages is the next big revolution, which is constantly improving thanks to neural networks which mean the system continues to learn in order to minimise errors, and become more like a human being (here is a video showing how the system easily translates from English to Chinese during a presentation).

A device will never take the place of a face-to-face conversation with another person, but it can make things easier if you have a clear idea of what you need and the limitations of the device. So now you now how voice recognition works, try asking Google to do things for you like searching for information on the internet. When you do so, think about the incredible technology which makes it possible.

Carlos Ávila is an information systems administrator with 10 years of experience in the field. Passionate about science and technology, he maintains his own blog which he uses to share his knowledge about the subject. He also collaborates with BQ, writing articles related to smartphones, tablets, networks and technology.