It is not that complicated, to make a simple example with strings:
AAAABBBABABAB takes up 13 spaces, but write (compress) it like 4A3B3AB take up 6 spaces compressing it more than 50%.
Now double it like AAAABBBABABABAAAABBBABABAB with 26 spaces and write it as 2(4A3B3AB) with 9 spaces it takes only 30% of the space.
Compression algorithms just look for those repetitive spaces.
Takes those letters and imagine them being colored pixels of a picture to compress a picture
Once you get into audio, images and video it revolves a lot around converting temporal and/or positional data into the frequency domain rather than simple token replacement.
Fair enough. The general idea is correct, I just found that example rather jarring... It is generally more difficult to compress an already small amount of data anyway.