What is a validation set?

It is a third slice of data, separate from training and final testing. Engineers use it while building the model to compare different settings and choices. Keeping a final test set untouched until the very end means the last score is not influenced by all that tweaking.

Is 100% accuracy a good thing?

Usually it is a warning sign, not a triumph. Real-world data is messy, so a perfect score often means the model was tested on data it had already seen, or the test was too easy. Honest models on hard problems rarely score 100%.

Training, Testing and Accuracy | AI

Building an AI in two big phases

Most modern AI is built in two big phases: training and testing. Training is when the model learns. Testing is when you find out how good it really is. Getting both right, in the right order, is the difference between an AI you can trust and one that just looks clever.

If the idea of a computer learning from examples is new to you, start with What Is Machine Learning?. This lesson picks up from there and shows how engineers actually measure whether the learning worked.

Training: the learning phase

During training, the model studies a large pile of examples and slowly adjusts itself to make fewer mistakes. Picture building a model that decides whether an email is spam or not spam. You gather thousands of emails, and a person has marked each one as "spam" or "safe". Those correct markings are called labels.

The training loop looks like this:

Show the model one email.
Let it guess: "spam" or "safe".
Compare its guess to the real label.
If it was wrong, nudge its internal settings so it is a little more likely to be right next time.
Repeat thousands or millions of times.

Step 4 is the heart of learning. Each tiny nudge is small, but after millions of examples the model gets good at spotting patterns that go with spam, like urgent demands for money or strange links. Importantly, the model never memorises a fixed rulebook from a human; it discovers the patterns from the examples themselves. The examples you choose matter enormously, which is why Training Data and Bias in AI is worth reading too.

Testing: the honesty check

Here is the trap that catches beginners. After training, you might want to ask: "How good is my model?" If you test it using the same emails it trained on, it will probably score brilliantly. But that score is a lie.

Why? Because the model may have simply memorised those exact emails. Memorising the answers to a specific test does not prove you understand the subject. It just proves you have a good memory. The real question is whether the model can handle emails it has never seen before, because that is what it will face in the real world.

So engineers split their data into two parts before training begins:

A training set (often around 80% of the data) that the model learns from.
A test set (the remaining 20%) that is locked away and never shown during training.

Only after training is finished do you unlock the test set and see how the model does on those fresh examples. That score is honest, because the model could not have memorised answers it never saw. This split is one of the most important ideas in all of machine learning.

Accuracy: a number that can fool you

Once you have a test score, you need a way to describe it. The simplest measure is accuracy: the share of answers the model got right.

Accuracy = correct answers ÷ total answers

If a model judges 100 test emails and gets 95 right, its accuracy is 95%. Simple. But accuracy hides a dangerous trap, and learning to spot it is what separates a careful thinker from a careless one.

Imagine a medical test for a rare disease that only 1 person in 100 actually has. Now imagine a lazy model that always says "no disease" for everyone, without even looking. How accurate is it? It is right 99 times out of 100, so its accuracy is 99%. That sounds amazing! But the model is completely useless: it never finds a single sick person. The one patient who truly needed help was missed.

This is why experts almost never trust accuracy alone. When one answer is much rarer than the other, a high accuracy can hide total failure on the part that matters most.

Better ways to measure

To see past the accuracy trap, engineers look at what kind of mistakes a model makes. There are two very different kinds:

A false positive: the model shouts "yes!" when the answer was "no" (a safe email sent to the spam folder).
A false negative: the model says "no" when the answer was "yes" (a dangerous email that slips into your inbox).

These mistakes are not equally bad, and which one matters more depends on the job. For a spam filter, a false positive (losing a real email) might be worse. For a disease test, a false negative (missing a sick patient) is much worse. A good engineer asks, "Which mistake hurts more here?" and measures that, rather than hiding everything inside one accuracy number.

Two measures help here. Precision asks: of all the times the model shouted "yes", how often was it right? Recall asks: of all the real "yes" cases out there, how many did the model actually catch? A model can have brilliant precision and terrible recall, or the other way around. Going back to the rare-disease example, the lazy model that always says "no disease" has zero recall: it caught none of the sick people, no matter how high its overall accuracy looked. Reporting precision and recall together makes that failure impossible to hide.

Why the data split must be done fairly

There is one more subtle trap worth knowing. Splitting data into training and test sets only works if the split is fair. Suppose you are building a model to recognise people's handwriting, and the same person's samples end up in both the training set and the test set. The model could learn that person's particular style during training and then "recognise" it at test time, scoring high without proving it can read a stranger's handwriting. To avoid this, engineers make sure that closely related examples stay together on the same side of the split. The golden rule is simple: the test set must represent the kind of new data the model will really meet, with no sneaky overlap from training.

A quick comparison

Idea	What it is	Why it matters
Training	Model learns from examples	This is where the model gets its skill
Testing	Model is judged on unseen data	The only honest measure of real performance
Accuracy	Share of answers that are correct	Easy to read, but can hide bad failures
False negative	Missed a real "yes"	Often the most dangerous mistake

The honest takeaway

The next time you read that some AI is "95% accurate," do not just nod. Ask three questions. Was it tested on data it had never seen? How rare is the thing it is looking for? And which kind of mistake does it make? Those questions reveal whether the number is a real achievement or a clever illusion. Thinking this carefully about evidence is a skill that helps far beyond AI, in science, maths and everyday life.

Training, Testing and Accuracy

Key takeaways