🎮
AI🎓 Ages 14-18Advanced 12 min read

Reinforcement Learning (Learning by Reward)

A clear, honest guide to reinforcement learning: how an AI agent learns from rewards through trial and error, with real examples, key terms, and its real limits.

Key takeaways

  • Reinforcement learning trains an agent through trial and error, guided by rewards
  • There are no labelled answers; the agent discovers good moves by their consequences
  • Key parts: an agent, an environment, actions, states, and a reward signal
  • The agent must balance exploring new moves against exploiting what already works
  • Reward design is hard: a badly chosen reward can teach the wrong behaviour

Learning the way you learned to ride a bike

Think about how you learned to ride a bicycle. Nobody handed you a labelled list of exactly how much to lean for every situation. Instead, you tried, you wobbled, you fell, and you adjusted. A successful balance felt good and you did more of it; a fall was a clear signal to do something different. Over many attempts, trial and error turned into skill.

This is the core idea behind reinforcement learning, often shortened to RL. It is a distinct branch of machine learning, different from the supervised and unsupervised approaches covered in Supervised vs Unsupervised Learning. Instead of learning from a fixed set of examples with correct answers, an RL system learns by acting in a world and receiving rewards for the consequences of its actions. It is, quite literally, learning by reward.

The pieces of the puzzle

Every reinforcement learning setup is built from the same handful of parts. Once you know them, you can spot RL anywhere.

  • The agent is the learner and decision-maker, the "brain" we are training. It could control a game character, a robot, or a thermostat.
  • The environment is everything the agent interacts with, the world it lives in. For a chess agent, the environment is the board and the rules.
  • A state is a snapshot of the situation right now, what the agent can observe, such as the current positions of all the chess pieces.
  • An action is a move the agent can make in a given state, such as moving a piece.
  • The reward is a number the environment gives back after an action, signalling how good or bad the outcome was. Winning the game might give +1; losing, -1.

The loop runs like this: the agent observes the current state, chooses an action, the environment responds with a new state and a reward, and the agent uses that reward to update how it will behave in future. Round and round it goes, thousands or millions of times.

The goal: maximise reward over time

Here is the subtle and powerful part. The agent's goal is not to grab the biggest reward right now. It is to maximise its total reward over the long run. These are not the same thing.

Consider a chess agent. Capturing your opponent's queen gives an immediate thrill, but if it leads to checkmate three moves later, it was a terrible choice. A good RL agent learns to value actions by their long-term consequences, sometimes accepting a small loss now for a bigger gain later. This is captured by an idea called the value of a state: not just the reward you get immediately, but the total reward you can expect to collect from that point onward if you keep playing well.

Learning these values is the heart of many RL algorithms. The agent gradually builds up an estimate of which situations are promising and which are traps, and it steers towards the promising ones. Crucially, it works all this out from experience, with no human ever labelling the "correct" move. The rewards alone teach it.

Explore or exploit? The central dilemma

Suppose you have found a restaurant you like. Every visit is reliably good. But there might be an even better restaurant next door you have never tried. Do you keep returning to the safe favourite, or risk a meal at the unknown one?

This is the exploration versus exploitation dilemma, and it sits at the very centre of reinforcement learning.

  • Exploitation means using what you already know works, choosing the action that has paid off before.
  • Exploration means trying something new, which might be worse, but might be much better and is the only way to discover improvements.

An agent that only exploits will get stuck on the first decent strategy it finds and never discover anything better. An agent that only explores wastes its time on random moves and never settles into good behaviour. Successful RL carefully balances the two, often exploring a lot early on and exploiting more as it grows confident. Getting this balance right is one of the reasons RL is hard to do well.

Where reinforcement learning shines

RL has produced some of the most striking results in modern AI, and it is worth being precise about why.

Games. RL systems have learned to play Go, chess, and many video games at superhuman levels, often by playing millions of games against themselves. Games are an ideal testing ground because the rules are clear, the reward (winning) is obvious, and you can run unlimited fast trials safely. The lesson Games Computers Can Play explores this further.

Robotics and control. RL can teach a robot to walk, grasp objects, or balance, learning motions that would be extremely hard to program by hand. Much of this training happens in simulation, where a robot can fail a million times without breaking anything real.

Optimisation problems. RL has been used to manage energy use in data centres, route delivery vehicles, and tune complex systems, anywhere the goal is to make a long sequence of decisions that add up to a good outcome.

The honest limits

It would be easy to read all this and conclude RL is a universal solution. It is not, and being clear-eyed about its weaknesses is part of understanding it properly.

It is hungry for trials. RL often needs millions of attempts to learn well. That is fine inside a fast simulation, but in the real world, where each trial takes real time and real mistakes can be costly or dangerous, this is a serious obstacle. You cannot crash a real self-driving car a million times to teach it.

Reward design is treacherous. An agent optimises exactly what you reward, not what you intended. If you reward a cleaning robot for the amount of mess it collects, it might learn to knock things over so it can clean them up again. These "reward hacking" failures are common and sometimes funny, but they reveal a deep problem: translating a real goal into a reward number is genuinely hard, and small mistakes lead to bizarre behaviour. This is one reason RL connects so closely to questions in AI Ethics and Fairness: an agent rewarded for the wrong thing can cause real harm while technically doing its job perfectly.

Training can be unstable. Small changes in settings can make the difference between an agent that learns brilliantly and one that never learns at all. RL results can also be hard to reproduce.

It is not human understanding. When an RL agent masters a game, it has found a strategy that earns reward, not an understanding of the game in any human sense. Move it to a slightly different situation it never trained on and it can fail completely, because it learned a narrow policy, not general wisdom.

The big picture

Reinforcement learning captures something genuinely deep: that intelligent behaviour can emerge from nothing more than trying things and learning from their consequences, guided by a signal of what is good and what is bad. That single idea has produced agents that beat world champions and robots that teach themselves to move.

But the same idea carries real responsibility. Because an agent will relentlessly chase whatever reward you give it, the question "What exactly are we rewarding?" becomes one of the most important questions in the whole field. Learning by reward is powerful precisely because it takes the reward seriously, which means we have to take it seriously too.

Quick quiz

Test yourself and earn XP

How does a reinforcement learning agent learn?

What is the 'reward signal'?

What is the exploration vs exploitation dilemma?

Why is reward design considered difficult?

How does reinforcement learning differ from supervised learning?

FAQ

It is loosely inspired by how animals and people learn from rewards and punishments, and the comparison is useful for intuition. But real RL algorithms are mathematical procedures, not models of the brain. Treat the animal analogy as a helpful picture, not a literal claim about biology.

It needs huge numbers of trials, which is fine in a simulation but risky or impossible in the real world, where mistakes can be costly or dangerous. It is also sensitive to reward design and can be unstable to train. For many problems, supervised learning is simpler and more reliable.