How Data Is Compressed
How data compression works: lossless vs lossy, run-length encoding, Huffman coding, and why JPEG and MP3 throw data away. Clear worked examples and a quiz for teens.
Key takeaways
- Compression makes files smaller by storing the same information in fewer bits
- Lossless compression rebuilds the original exactly (ZIP, PNG); lossy throws away detail you barely notice (JPEG, MP3)
- Run-length encoding replaces repeated values with a count; Huffman coding gives common symbols shorter codes
- Lossy compression can shrink media far more, but each save can lose a little more quality
Why we squeeze data
Every file on a computer is stored as bits β 1s and 0s. A photo, a song, or a document might be made of millions of them. Compression is the art of storing the same information using fewer bits, so files take less space and travel faster across a network.
Compression is everywhere: ZIP archives, the JPEG photos on your phone, the MP3 and streaming audio you listen to, and the video you watch online. Without it, a single high-quality movie could fill an entire hard drive.
There are two big families of compression, and the difference between them matters a lot.
Lossless: get it all back
Lossless compression shrinks a file in a way that is completely reversible. When you decompress it, you get back exactly the original data, bit for bit β nothing is lost. This is essential for things where every detail counts: text, program code, spreadsheets, and images with sharp lines.
ZIP files, PNG images, and FLAC audio all use lossless compression. The trick is to find and remove repetition and patterns in the data, then describe them more briefly. Let's see two classic methods.
Run-length encoding
Imagine a row of pixels in a simple image, where W is white and B is black:
WWWWWWWWWWWWBBBWWWWWWWWW
That is 23 letters. Run-length encoding (RLE) replaces each "run" of the same value with the value and a count:
12W 3B 9W
Now we store three short pairs instead of 23 letters. RLE works brilliantly on data with long runs β plain backgrounds, scanned documents, simple icons. It works poorly on noisy data where values keep changing, because then there are no long runs to shorten.
Huffman coding
Normally every character uses the same number of bits β for example 8 bits each. But in real text, some symbols appear far more often than others. Huffman coding takes advantage of this by giving common symbols short codes and rare symbols long codes.
Suppose a message uses only four letters with these frequencies:
A: very common β code 0
B: common β code 10
C: rare β code 110
D: rare β code 111
The letter A now takes just 1 bit instead of 8. Because A appears so often, the average number of bits per letter drops well below 8, and the whole message gets smaller. Decompression still works perfectly because no code is the start of another code, so the decoder always knows where one symbol ends. Real ZIP tools combine ideas like this with pattern-matching to do even better.
Lossy: throw away what you won't miss
Lossy compression takes a bolder approach: it permanently discards some data to make files much smaller. The cleverness is in choosing data your senses are unlikely to miss. You cannot get the original back exactly β but if it is done well, you cannot tell.
JPEG (photos). Human eyes are very sensitive to brightness but much less sensitive to fine colour detail and tiny changes between neighbouring pixels. JPEG keeps the important structure of an image and throws away subtle detail we are unlikely to notice. This can shrink a photo to a small fraction of its original size. Push it too far, though, and you start to see blocky artefacts β that is the lost data showing through.
MP3 and streaming audio. These use a model of human hearing. If a loud sound and a much quieter sound happen at the same moment, you cannot hear the quiet one β it is masked. Lossy audio compression simply removes sounds you would not have heard anyway, plus frequencies that are too high or too low to matter, saving enormous space.
Video (MP4, etc.) goes further still by noticing that most of one frame looks almost identical to the frame before it, so it only stores what changed.
Lossless vs lossy: choosing well
| Lossless | Lossy | |
|---|---|---|
| Reversible? | Yes, exact | No, data is discarded |
| Typical use | Text, code, PNG, ZIP | Photos, music, video |
| Size saving | Modest | Often huge |
| Risk | None | Quality loss if overdone |
The rule of thumb: use lossless when every bit matters, and lossy when a small, unnoticeable loss is worth a much smaller file.
A limit worth knowing
Compression depends on patterns. Data that is already random β or already compressed β has almost no patterns left to exploit. That is why zipping a folder of JPEGs barely shrinks them, and why no algorithm can compress every possible file. Compression trades away predictability, and you can only do that once.
Try this activity
Be a compressor. Take a short string with lots of repetition, such as AAAAABBBBBBBBCCAAAA, and write its run-length encoding by hand. Count the characters before and after. Then try a string with no repeats, like ABCDEFG, and explain why RLE makes it longer. Finally, list three files on your device that are probably lossy (photos, songs, videos) and three that must be lossless (a document, your code, a spreadsheet).
To understand the 1s and 0s being compressed, see How Images and Sound Are Stored as Data, and for the patterns algorithms search for, Lists and Arrays.
Quick quiz
Test yourself and earn XP
What is the goal of data compression?
Compression encodes the same information in fewer bits, so files take less space and transfer faster.
What is the key feature of lossless compression?
Lossless compression is reversible β decompressing restores the original data bit for bit.
How does run-length encoding shrink data?
Run-length encoding stores 'this value, repeated N times' instead of writing the value N times.
In Huffman coding, which symbols get the shortest codes?
Huffman coding gives frequent symbols short codes and rare symbols longer codes, lowering the average length.
Why is a JPEG usually much smaller than the original photo data?
JPEG is lossy: it discards fine detail and subtle colour changes we are unlikely to see, which saves a lot of space.
FAQ
It can. Compression relies on patterns and repetition. Data that is already random or already compressed (like a JPEG inside a ZIP) has almost no patterns left, so the file may shrink only slightly or even grow a little because of the extra bookkeeping the format adds.
JPEG is lossy, so each save throws away a bit more detail. Re-saving a JPEG decompresses the already-damaged image and compresses it again, adding new losses on top of the old ones β a build-up sometimes called generation loss.
Use lossless (PNG, ZIP, FLAC) when every bit matters β text, code, spreadsheets, or images with sharp edges and text. Use lossy (JPEG, MP3, MP4) for photos, music, and video where small, unnoticeable losses are an acceptable trade for much smaller files.
Keep exploring
More in Coding