Skip to main content

Insignificant figures

This is a very niche data gremlin but is something that turns up time and time again when I'm working with open datasets. It can seem unimportant but it has a surprising impact that I'll come to later. My complaint is about significant figures. No, not privacy and anonymisation issues. I'm not referring to people; this is about numbers. Too many numbers.

Open datasets - particularly geographic ones - often list latitudes and longitudes to an excessive number of decimal places. What do I mean by excessive? It is very common to see 15 decimal places given (e.g. 53.796782814910291, -1.533757130238925) but I've seen as many as 51 decimal places! None of this is because a human has decided to provide that many numbers or because the measurements are that accurate. It is always down to the default export options of the software generating the dataset. Most of those decimal places are due to errors in the way computers store numbers internally and aren't 'real'. How do I know that? Well, let's work it out.

The Earth has a radius of 6,371 km at the equator. We'll assume it is a sphere (it isn't quite but it's a good enough approximation for this back-of-the-webpage calculation). That means it has a circumference (2*pi*radius as you might remember from school) of around 40,000 km. There are 360 degrees in a circle. That means every degree of longitude is about 111 km on the ground (at the equator). So, the fifth decimal place in the longitude represents around 1 metre on the ground. The fifteenth decimal place is about a tenth of a nanometre which is about the size of a single atom. We have good measurements these days but we don't know the coordinates of bus stops or grit bins to atomic precision. 51 decimal places is equivalent to around 10-46 metres. That is really really tiny. In fact it is roughly a hundred billion times smaller than the Planck length, which Wikipedia describes as:

...if a particle or dot about 0.005mm in size (which is the same size as a small grain of silt) were magnified in size to be as large as the observable universe, then inside that universe-sized "dot", the Planck length would be roughly the size of an actual 0.005mm dot.

This over zealous quoting of decimal places is amusing but it has a practical effect. Those unnecessary, insignificant, numbers increase the sizes of files. How much by depends on the amount of other data included but I often find that I can reduce the size by 20-40% just by limiting the coordinates to 5 or 6 decimal places. This ONS dataset of ward boundaries (accurate to 20 metres) could be reduced from 46.5 MB to 24.8 MB if they truncated their decimal places. As many of the things I do are web visualisations, it is good to save bandwidth for both me and the end user (especially if they are paying per MB on mobile devices).

In summary, dear data publishers, please check your export options and give a precision that matches the accuracy of your data. Also, please check out our Open Data Tips for more suggestions (and add to it).