Let's chat about open data publishing
As much as we love open data not all of it is created equal. There is a lot of useful and usable data out there, but we often find ourselves jumping through hoops created by unhelpful publishing. And we're not the only ones - this blog post is actually inspired by Lesley-Anne Kelly's Twitter thread on her frustrations with open data publishing.
What goes wrong?
Open data can be published in a lot of formats but sometimes that means that it isn't compatible with certain systems or excludes some use-cases. In Lesley-Anne's example, the data was only available in JSON and XML. Those formats aim to take us towards a future Linked Data utopia but they also exclude a lot of people in the present. They require quite a bit of specialist knowledge and most people don't have graphical analysis tools for them. For Lesly-Anne this was fixed by an extra step of converting the data to CSV. And if that was the only extra step that might be fine - but she continues to have to join on other files to make sense of the data and translate gender codes and people codes into more meaningful expressions. All of these steps require some research and as a result - time.
She's not alone in these problems. In our work, we often come across files that exist in hard-to-use formats. Recently, COVID-19 vaccination data: This is published as a many-sheet Excel file and styled to be more easily understood by humans than machines - they include empty cells, explanatory text, and multiple tables in single sheets. While this is great for humans, this makes feeding such files into software much more difficult since unnecessary data needs to be removed. To include this data in our COVID-19 dashboard requires a human to spend some time extracting and cleaning the data each week.
How does this happen?
On one hand, publishing open data might be more of a by-product rather than a core function for some organisations. As a result, they likely publish whatever format they used to avoid any extra work even if it isn't as useful to outsiders as it could be. And that's not necessarily bad because at least that data is openly available but it's a bit inconvenient. On the other hand, examples like that of the COVID-19 vaccination data are clearly aimed at a more general audience - who are used to nicely presented reports - rather than for automated processes.
How do we fix it?
We need to look at the data we're trying to publish from an outside perspective. In the present (and probably in the future too) the "audience" for your data is likely to be both people and machines. Try not to make things hard for either. Here are some suggestions:
- The most portable file type is CSV (for tabular data) which can be read by robots, opened in a simple text editor, and opened in Excel, LibreOffice, etc. If your data is a table, provide it as CSV.
- When you provide data as a CSV avoid putting lots of explanatory text in the CSV file as that makes it hard for robots to know where the table begins and ends - put that in a separate metadata file*.
- Avoid having multiple header lines as these make it hard for machines to know where the data starts.
- Data portals or publishers could do the obvious joins at their end rather than make most end-users each do that themselves.
- Where things have standardised codes, you could include both the code and a more human-friendly value such as a name e.g. for people, gender, constituency etc. This isn't to say you shouldn't also publish a separate lookup table - you should - but think about what a human will need to do to understand the values. Codes are really good for software to avoid ambiguity in spelling, case, and formatting. For example, the local authority York gets written in many different ways - "York", "City of York", "York, City of" - in various datasets so having the ONS code E06000014 helps automated processes.
- If you include ONS geography codes it is also helpful to use a column heading that matches the ONS heading style (e.g. LSOA11CD, MSOA11CD, LAD20CD, PCON17CD, WD19CD...) as these helpfully encode the type of code along with the year that the geography was for.
We have plenty more Open Data Tips with advice on publishing tabular data, GIS data, metadata, and using international date formats so be sure to check those out.
Of course, it is difficult to anticipate exactly what people want in open data but that's why we're having this conversation and should continue to do so. It's the best way to figure out what works or doesn't and what may waste (even your own) time. So it's always best to listen.
* If you really really must include metadata in the CSV file you might consider adding "#" symbols at the start of rows that aren't the table. This isn't a standard feature of CSV but some CSV parsing libraries (Ruby, R, Pandas, Node) can treat these lines as comments and know to ignore them when trying to get the data.