Skip to main content

Make it easy for people to get your data

Back in February, we published a blog post all about that phrase 'eating your own dog food' - if you made dog food and had to feed it to your dogs, you'd make it as good as it can be, right? Everyone benefits. In terms of open data, this means using the same data that you publish for others. Your open data isnt an extra - it is built in from the beginning and you rely on it. That means you quickly find any problems with what you are publishing and have an incentive to fix it.

Use the web as it was intended

Taking these thoughts a step further, you should be publishing your good quality/useable open data on the web in a way that is easy to find, easy to access, and in a widely accepted format i.e. something that could be opened by anyone and by a wide range of software. You want as many people to be able to re-use the data as possible in as many contexts.

Unless it makes no sense, provide a simple download of the data at a URL - one of the building blocks of the web. That doesnt mean you cant also have fancy application-programming-interfaces (APIs) that filter, slice, and dice the data. But, if you are expecting people to make use of a big fraction of your open data set, having a URL that people can use to grab a copy is using the web as it was intended. That makes it possible for people to link to your data so they can let others know about it and link back to original sources. Having a URL allows coders to automatically use it in apps/software/web pages. Avoid putting your "open" data behind a password or api key as much as possible. Provide clever api access if you can, but the basic thing should be a simple download at a URL.

Cool URI's don't change

If you are a regular user of published datasets you've probably come across a common problem; people move things around on their websites or publish new versions of datasets at seemingly random URLs. When the URL of a dataset changes that requires people to manually find the new address next time they need to get a copy of your dataset. This is time consuming for them and the effort is multiplied by the number of people relying on your dataset. Say you update your citys Business Rates data monthly. Anyone who wants to see the latest version - from citizens, to developers, to local journalists - has to go and find it every month. That is a waste of time and effort.

A way to help users of your data is to have URLs that stay fixed. As founder of the web - Sir Tim Berners-Lee - says, "Cool URIs dont change". Try to have fixed URLs for things. Do your best to avoid "link rot". Think like an archivist or librarian. You want things to be findable in as many ways as possible not just via someone trying to navigate your site search engine.

If you have a regular snapshot of data that youd like people to use the most up-to-date version of, you could have URLs for the archive versions and a URL for the "latest" version. Say you were Balamory Council publishing the Business Rates of Josie Jump, Edie McCredie and Suzie Sweet. You might create the url that always returned the latest copy of the data. You might put historical versions at, say, in a consistent way that people can even guess. You'd then keep those URLs for as long as you could. If you did have to move things at some point in the future, you'd put in redirects to make sure that people going to the old links got sent to the right places.

Breaking the web

These issues all came into sharp contrast this week when Public Health England changed the way they were publishing data about confirmed cases of coronavirus in the UK. People are desperate for data during the coronavirus outbreak, not just on a national scale, but regionally and locally. Public Health Englands dashboard was the definitive source for England. They were publishing the raw data as CSV files at fixed URLs. They provided links to those. People were using them in live visualisations and analysis e.g. we had a Local Authority COVID-19 dashboard and Trafford Data Lab had a COVID-19 monitor. Then PHE unveiled a new dashboard - which broke everyone else's, including ours.

PHE removed the CSV files and replaced them with JSON. JSON has some advantages and disadvantages over CSV. JSON is better for less tabular data and it lets them put the history of values in for each Upper Tier Local Authority. But they could have also kept producing the CSV file that people were actively using. PHE were quick to reinstate the links to CSV files but with a slight caveat - these links weren't real URLs. They are effectively buttons on a page that you have to click to download a CSV file. Unfortunately that doesn't work for robots or automated scripts, and alternative methods (you can go to a URL that serves up some XML, parse through the response, and try to find the real URL of the latest JSON data) exclude the users who don't have the technical knowledge. If they used a fixed URL (e.g. they could keep it as accessible by as many people and robots as possible.

There might be a fixed URL in the future but for now the decisions made in this example has led a lot of folk to change how their own visualisations or dashboards work. They might have to do this again and again with every change made to the data source. But following the data practice outlined above would go a long way to maintaining a level of reliability and would reduce a lot of wasted time and effort. By making your data easy to find, easy to access, and easy to use, you increase the number of people who can (and will) use it, thus increasing the value you get back from it. Win-win.