Benford’s Law is really quite amazing, at least at first glance: for a wide variety of kinds of data, about 30% of the numbers will begin with a 1, 17% with a 2, on down to just 5% beginning with a 9. Can you spot the fake list of populations of European countries?
|List #1||List #2|
|Bosnia and Herzegovina||3,964,388||3,014,202|
|Isle of Man (UK)||73,873||43,345|
|Faroe Islands (Denmark)||46,011||32,668|
Looking at these lists we have a clue as to when and how Benford’s Law works. [spoiler]
In one of the lists, the populations are distributed more or less evenly in a linear scale; that is, there are about as many populations from 1 million to 2 million, as there are from 2 million to 3 million, 3 million to 4 million etc. (Well, actually the distribution isn’t quite linear, because the fake data was made to look similar to the real data, and so has a few of its characteristics.)
The real list, like many other kinds of data, is distributed in a more exponential manner; that is, the populations grow exponentially (very slowly though) with about as many populations from 100,000 to 1,000,000; then 1,000,000 to 10,000,000; and 10,000,000 to 100,000,000. This is all pretty approximate, so you can’t take this precisely at face value, but you’ll see in the list of real data that, very roughly speaking, in any order of magnitude there are about as many populations as in any other– at least for a while.
Data like this has a kind of “scale invariance”, especially if this kind of pattern holds over many orders of magnitude. What this means is that if we scale the data up or down, throwing out the outliers, it will look about the same as before.
The key to Benford’s Law is this scale invariance. Data that has this property will automatically satisfy his rule. Why is this? If we plot such data on a linear scale it won’t be distributed uniformly but will be all stretched out, becoming sparser and sparser. But if we plot it on a logarithmic scale, (which you can think of as approximated by the number of digits in the data), then such data is smoothed out and evenly distributed.
But presto! Look at how the leading digits are distributed on such a logarithmic scale!
That’s mostly 1’s, a bit fewer 2’s, etc. on down to a much smaller proportion of 9’s.