πŸ—œοΈ
CodingπŸŽ“ Ages 14-18Intermediate 11 min read

How Data Is Compressed

How data compression works: lossless vs lossy, run-length encoding, Huffman coding, and why JPEG and MP3 throw data away. Clear worked examples and a quiz for teens.

Key takeaways

  • Compression makes files smaller by storing the same information in fewer bits
  • Lossless compression rebuilds the original exactly (ZIP, PNG); lossy throws away detail you barely notice (JPEG, MP3)
  • Run-length encoding replaces repeated values with a count; Huffman coding gives common symbols shorter codes
  • Lossy compression can shrink media far more, but each save can lose a little more quality

Why we squeeze data

Every file on a computer is stored as bits β€” 1s and 0s. A photo, a song, or a document might be made of millions of them. Compression is the art of storing the same information using fewer bits, so files take less space and travel faster across a network.

Compression is everywhere: ZIP archives, the JPEG photos on your phone, the MP3 and streaming audio you listen to, and the video you watch online. Without it, a single high-quality movie could fill an entire hard drive.

There are two big families of compression, and the difference between them matters a lot.

Lossless: get it all back

Lossless compression shrinks a file in a way that is completely reversible. When you decompress it, you get back exactly the original data, bit for bit β€” nothing is lost. This is essential for things where every detail counts: text, program code, spreadsheets, and images with sharp lines.

ZIP files, PNG images, and FLAC audio all use lossless compression. The trick is to find and remove repetition and patterns in the data, then describe them more briefly. Let's see two classic methods.

Run-length encoding

Imagine a row of pixels in a simple image, where W is white and B is black:

WWWWWWWWWWWWBBBWWWWWWWWW

That is 23 letters. Run-length encoding (RLE) replaces each "run" of the same value with the value and a count:

12W 3B 9W

Now we store three short pairs instead of 23 letters. RLE works brilliantly on data with long runs β€” plain backgrounds, scanned documents, simple icons. It works poorly on noisy data where values keep changing, because then there are no long runs to shorten.

Huffman coding

Normally every character uses the same number of bits β€” for example 8 bits each. But in real text, some symbols appear far more often than others. Huffman coding takes advantage of this by giving common symbols short codes and rare symbols long codes.

Suppose a message uses only four letters with these frequencies:

A: very common   β†’  code 0
B: common        β†’  code 10
C: rare          β†’  code 110
D: rare          β†’  code 111

The letter A now takes just 1 bit instead of 8. Because A appears so often, the average number of bits per letter drops well below 8, and the whole message gets smaller. Decompression still works perfectly because no code is the start of another code, so the decoder always knows where one symbol ends. Real ZIP tools combine ideas like this with pattern-matching to do even better.

Lossy: throw away what you won't miss

Lossy compression takes a bolder approach: it permanently discards some data to make files much smaller. The cleverness is in choosing data your senses are unlikely to miss. You cannot get the original back exactly β€” but if it is done well, you cannot tell.

JPEG (photos). Human eyes are very sensitive to brightness but much less sensitive to fine colour detail and tiny changes between neighbouring pixels. JPEG keeps the important structure of an image and throws away subtle detail we are unlikely to notice. This can shrink a photo to a small fraction of its original size. Push it too far, though, and you start to see blocky artefacts β€” that is the lost data showing through.

MP3 and streaming audio. These use a model of human hearing. If a loud sound and a much quieter sound happen at the same moment, you cannot hear the quiet one β€” it is masked. Lossy audio compression simply removes sounds you would not have heard anyway, plus frequencies that are too high or too low to matter, saving enormous space.

Video (MP4, etc.) goes further still by noticing that most of one frame looks almost identical to the frame before it, so it only stores what changed.

Lossless vs lossy: choosing well

LosslessLossy
Reversible?Yes, exactNo, data is discarded
Typical useText, code, PNG, ZIPPhotos, music, video
Size savingModestOften huge
RiskNoneQuality loss if overdone

The rule of thumb: use lossless when every bit matters, and lossy when a small, unnoticeable loss is worth a much smaller file.

A limit worth knowing

Compression depends on patterns. Data that is already random β€” or already compressed β€” has almost no patterns left to exploit. That is why zipping a folder of JPEGs barely shrinks them, and why no algorithm can compress every possible file. Compression trades away predictability, and you can only do that once.

Try this activity

Be a compressor. Take a short string with lots of repetition, such as AAAAABBBBBBBBCCAAAA, and write its run-length encoding by hand. Count the characters before and after. Then try a string with no repeats, like ABCDEFG, and explain why RLE makes it longer. Finally, list three files on your device that are probably lossy (photos, songs, videos) and three that must be lossless (a document, your code, a spreadsheet).

To understand the 1s and 0s being compressed, see How Images and Sound Are Stored as Data, and for the patterns algorithms search for, Lists and Arrays.

Quick quiz

Test yourself and earn XP

What is the goal of data compression?

What is the key feature of lossless compression?

How does run-length encoding shrink data?

In Huffman coding, which symbols get the shortest codes?

Why is a JPEG usually much smaller than the original photo data?

FAQ

It can. Compression relies on patterns and repetition. Data that is already random or already compressed (like a JPEG inside a ZIP) has almost no patterns left, so the file may shrink only slightly or even grow a little because of the extra bookkeeping the format adds.

JPEG is lossy, so each save throws away a bit more detail. Re-saving a JPEG decompresses the already-damaged image and compresses it again, adding new losses on top of the old ones β€” a build-up sometimes called generation loss.

Use lossless (PNG, ZIP, FLAC) when every bit matters β€” text, code, spreadsheets, or images with sharp edges and text. Use lossy (JPEG, MP3, MP4) for photos, music, and video where small, unnoticeable losses are an acceptable trade for much smaller files.