Training, Testing and Accuracy
Learn how AI is built: how a model trains on examples, why it must be tested on data it never saw, and what accuracy really means (and when it lies).
Key takeaways
- Training is when a model studies examples and adjusts itself to get answers right
- Testing checks the model on fresh examples it never saw during training
- If you test on the same data you trained on, the score is misleading
- Accuracy is the share of answers a model gets right, but it can hide serious problems
- On rare events, a high accuracy can still mean the model is almost useless
Building an AI in two big phases
Most modern AI is built in two big phases: training and testing. Training is when the model learns. Testing is when you find out how good it really is. Getting both right, in the right order, is the difference between an AI you can trust and one that just looks clever.
If the idea of a computer learning from examples is new to you, start with What Is Machine Learning?. This lesson picks up from there and shows how engineers actually measure whether the learning worked.
Training: the learning phase
During training, the model studies a large pile of examples and slowly adjusts itself to make fewer mistakes. Picture building a model that decides whether an email is spam or not spam. You gather thousands of emails, and a person has marked each one as "spam" or "safe". Those correct markings are called labels.
The training loop looks like this:
- Show the model one email.
- Let it guess: "spam" or "safe".
- Compare its guess to the real label.
- If it was wrong, nudge its internal settings so it is a little more likely to be right next time.
- Repeat thousands or millions of times.
Step 4 is the heart of learning. Each tiny nudge is small, but after millions of examples the model gets good at spotting patterns that go with spam, like urgent demands for money or strange links. Importantly, the model never memorises a fixed rulebook from a human; it discovers the patterns from the examples themselves. The examples you choose matter enormously, which is why Training Data and Bias in AI is worth reading too.
Testing: the honesty check
Here is the trap that catches beginners. After training, you might want to ask: "How good is my model?" If you test it using the same emails it trained on, it will probably score brilliantly. But that score is a lie.
Why? Because the model may have simply memorised those exact emails. Memorising the answers to a specific test does not prove you understand the subject. It just proves you have a good memory. The real question is whether the model can handle emails it has never seen before, because that is what it will face in the real world.
So engineers split their data into two parts before training begins:
- A training set (often around 80% of the data) that the model learns from.
- A test set (the remaining 20%) that is locked away and never shown during training.
Only after training is finished do you unlock the test set and see how the model does on those fresh examples. That score is honest, because the model could not have memorised answers it never saw. This split is one of the most important ideas in all of machine learning.
Accuracy: a number that can fool you
Once you have a test score, you need a way to describe it. The simplest measure is accuracy: the share of answers the model got right.
Accuracy = correct answers ÷ total answers
If a model judges 100 test emails and gets 95 right, its accuracy is 95%. Simple. But accuracy hides a dangerous trap, and learning to spot it is what separates a careful thinker from a careless one.
Imagine a medical test for a rare disease that only 1 person in 100 actually has. Now imagine a lazy model that always says "no disease" for everyone, without even looking. How accurate is it? It is right 99 times out of 100, so its accuracy is 99%. That sounds amazing! But the model is completely useless: it never finds a single sick person. The one patient who truly needed help was missed.
This is why experts almost never trust accuracy alone. When one answer is much rarer than the other, a high accuracy can hide total failure on the part that matters most.
Better ways to measure
To see past the accuracy trap, engineers look at what kind of mistakes a model makes. There are two very different kinds:
- A false positive: the model shouts "yes!" when the answer was "no" (a safe email sent to the spam folder).
- A false negative: the model says "no" when the answer was "yes" (a dangerous email that slips into your inbox).
These mistakes are not equally bad, and which one matters more depends on the job. For a spam filter, a false positive (losing a real email) might be worse. For a disease test, a false negative (missing a sick patient) is much worse. A good engineer asks, "Which mistake hurts more here?" and measures that, rather than hiding everything inside one accuracy number.
Two measures help here. Precision asks: of all the times the model shouted "yes", how often was it right? Recall asks: of all the real "yes" cases out there, how many did the model actually catch? A model can have brilliant precision and terrible recall, or the other way around. Going back to the rare-disease example, the lazy model that always says "no disease" has zero recall: it caught none of the sick people, no matter how high its overall accuracy looked. Reporting precision and recall together makes that failure impossible to hide.
Why the data split must be done fairly
There is one more subtle trap worth knowing. Splitting data into training and test sets only works if the split is fair. Suppose you are building a model to recognise people's handwriting, and the same person's samples end up in both the training set and the test set. The model could learn that person's particular style during training and then "recognise" it at test time, scoring high without proving it can read a stranger's handwriting. To avoid this, engineers make sure that closely related examples stay together on the same side of the split. The golden rule is simple: the test set must represent the kind of new data the model will really meet, with no sneaky overlap from training.
A quick comparison
| Idea | What it is | Why it matters |
|---|---|---|
| Training | Model learns from examples | This is where the model gets its skill |
| Testing | Model is judged on unseen data | The only honest measure of real performance |
| Accuracy | Share of answers that are correct | Easy to read, but can hide bad failures |
| False negative | Missed a real "yes" | Often the most dangerous mistake |
The honest takeaway
The next time you read that some AI is "95% accurate," do not just nod. Ask three questions. Was it tested on data it had never seen? How rare is the thing it is looking for? And which kind of mistake does it make? Those questions reveal whether the number is a real achievement or a clever illusion. Thinking this carefully about evidence is a skill that helps far beyond AI, in science, maths and everyday life.
Quick quiz
Test yourself and earn XP
What happens during training?
Training is the learning phase: the model sees examples, makes guesses, checks them against the correct answers, and adjusts its settings to do better.
Why must test data be different from training data?
If a model is tested on the exact examples it memorised, it scores high without proving it can handle anything new. Fresh test data shows how it will really perform.
A spam filter is 95% accurate. What does that mean?
Accuracy is the fraction of all decisions that were correct. 95% accuracy means 95 of every 100 judgements matched the true answer.
Why can high accuracy still be misleading for rare events?
If only 1 in 100 emails is dangerous, a model that always says 'safe' is 99% accurate yet never catches a single dangerous email. Accuracy alone hides this.
What is the purpose of splitting data into training and test sets?
You split the data so the model learns from the training part and is then judged honestly on the test part it has never seen.
FAQ
It is a third slice of data, separate from training and final testing. Engineers use it while building the model to compare different settings and choices. Keeping a final test set untouched until the very end means the last score is not influenced by all that tweaking.
Usually it is a warning sign, not a triumph. Real-world data is messy, so a perfect score often means the model was tested on data it had already seen, or the test was too easy. Honest models on hard problems rarely score 100%.
Keep exploring
More in AI