🎙️
AI🔬 Ages 11-13Intermediate 9 min read

How AI Recognizes Speech

How AI recognizes speech, step by step: from sound waves to digital numbers, to phonemes and words, plus why it mishears and how language models help fix it.

Key takeaways

  • Speech recognition turns the sound of your voice into written text
  • A microphone first changes sound waves into a stream of numbers
  • The model learns to map sound patterns to the small units of speech, then to words
  • A language model helps choose the most likely sentence among similar-sounding options
  • It still mishears with noise, accents, and rare words, so it is not perfect

From a sound in the air to words on a screen

When you talk to a phone and watch your words appear as text, something genuinely clever is happening. Speech recognition, sometimes called automatic speech recognition or speech-to-text, is the technology that turns the sound of your voice into written words. It powers voice typing, captions on videos, and the first step of every voice assistant. If you want the bigger picture of voice assistants, see How Voice Assistants Work. Here we zoom in on the hardest part: how the machine figures out which words you said.

The journey has several stages, and each one is a small puzzle.

Stage 1: Sound becomes numbers

Your voice is a wave: air pressure wobbling back and forth. A microphone measures that wobble and a computer samples it, meaning it records the wave's height many thousands of times every second. A common rate is 16,000 measurements per second for speech. The result is a long list of numbers that describes the sound.

Computers only work with numbers, so this step is essential. From now on, your voice is not "sound" to the machine; it is data, the same kind of data that all AI learns from.

Stage 2: Finding the important features

A raw list of numbers is noisy and huge. The next stage cleans it up and pulls out the features that matter for speech. The computer slices the audio into tiny overlapping windows, often about 25 milliseconds each, and measures the mix of frequencies in each window, low rumbles and high hisses and everything between. Vowels, consonants, and silences all have different frequency fingerprints.

This produces a kind of picture of the sound over time, sometimes drawn as a spectrogram. You can almost read speech off a spectrogram once you know what to look for: it shows where the energy sits at each moment.

Stage 3: Sounds become speech units

Now comes the machine learning. Spoken language is built from a small set of basic sound units called phonemes. English has roughly 44 of them; for example the word "cat" is three phonemes: "k", "a", "t". A model, usually a neural network, is trained on enormous amounts of recorded speech that has been matched to written transcripts. By comparing sounds to their known text, it learns to map each slice of audio to the phonemes it most likely represents.

This is supervised learning in action: the recordings are the examples and the matching transcripts are the labels. If the idea of labelled training is new, read Supervised vs Unsupervised Learning.

Modern systems often skip naming individual phonemes and learn to go almost straight from sound to text, but the principle is the same: huge amounts of paired audio and text teach the model the connection.

Stage 4: Speech units become words and sentences

Here is the deep problem. Speech is full of phrases that sound nearly identical. Say "recognise speech" out loud, then "wreck a nice beach". The sounds are almost the same. So the model needs more than ears; it needs a sense of what people are likely to say.

That job belongs to a language model, the same family of technology behind chatbots. It has learned which word sequences are common and which are rare. Given two possibilities that sound alike, it favours the more likely sentence. "Please recognise my speech" is a normal request; "please wreck a nice beach" almost never is. The system weighs the sound evidence against the language evidence and picks the most probable sentence. To go deeper on how those language models work, see How Chatbots Work.

Why it still gets things wrong

Speech recognition has improved dramatically, but it is not magic, and being honest about its limits matters:

  • Background noise covers up the sound features, so the model has less to work with.
  • Accents and dialects that were rare in the training data are harder to map correctly. A system trained mostly on one accent performs worse on others, which is a fairness issue rooted in training data and bias.
  • Rare names, new slang, and technical words may never have appeared in training, so the model has no idea they exist.
  • Homophones like "their", "there" and "they're" sound identical, so the system must guess from context, and sometimes guesses wrong.

It is also worth remembering that recognising words is not the same as understanding them. The text-to-words step can be flawless while the meaning is completely missed; that is a separate problem for a different system.

Try noticing it yourself

Next time you use voice typing, watch closely. You will often see the text change as you keep talking: the system updates its guess once later words give it more context. That live correction is the language model doing its job, balancing sound against likelihood in real time. The same careful, step-by-step problem solving you practise in coding is exactly the thinking that built each of these stages.

Quick quiz

Test yourself and earn XP

What is the very first thing that happens to your voice in speech recognition?

What is a phoneme?

Why does 'recognise speech' sometimes get heard as 'wreck a nice beach'?

What does the language model add to speech recognition?

Which situation makes speech recognition harder?

FAQ

Speech recognition only converts sound into text; it does not understand meaning. A separate system, often a language model, handles understanding and responding. That is why a device can type your words perfectly yet still give a useless answer.

Powerful speech models are large, so some run on company servers in the cloud. Many phones now also include smaller models that work offline, but they may be a little less accurate than the cloud versions.