Council spending data
In June 2010 the UK Government told local authorities that they should publish items of spending over £500. That came into affect in January 2011 although only 293 of the 326 councils met that target.
At The Bureau Local's local journalism hack day I found myself wrangling the files published by Leeds City Council to help a local journalist who was tired of copying and pasting from multiple files. Thankfully, because they are all published in the same dataset on Data Mill North, it wasn't too hard to write code to grab all the files. Combining them was harder (more on that below).
I ended up making a visualisation of Leeds City Council spending that showed a monthly summary as well as let you search by beneficiary/supplier. I then had people from neighbouring authorities ask for versions for Bradford and Calderdale. I've since added Newcastle, Sheffield and Transport for London.
One of my biggest headaches has been inconsistencies in formatting of CSV files. Column headings vary with time and between councils. The precise names of beneficiaries vary within a council often due to random truncation of text or changes in case. The "Organisational Unit" within a Council can change name over time. It really looks as if there is quite a bit of manual entry of fields that should be well defined.
Within columns I've had to deal with a large number of unnecessary complications that our Open Data Tips document warns against. One example being around dates. Councils use a huge range of date formats (sometimes within one CSV file). Here are some of the date formats I've found that would all refer to the same date:
- 14 June 2012
- 14th Jun 2012
- 41072 (this is the number of days since 1900)
The American format dates were the ones that caught me most by surprise (looking at you Sheffield City Council). As a result I now pre-parse a CSV file looking for XX/ZZ/YYYY dates and then counting up how many of them had a number greater than 12 in the XX and ZZ parts of the date to decide if the file is American or British. If there is a mix of American and British within a file I have no chance of knowing which is which. Thankfully that hasn't happened... yet. Ideally, everyone would be publishing dates using the international standard ISO8601, as recommended by the Government Digital Service, to remove the uncertainty. I'd love it if the Microsoft Excel programmers could make CSV export default to storing dates as ISO8601.
My next problem was the Amount column which goes by a variety of names. These often contained numbers with thousands placeholders. This is fine in human-friendly Excel but please don't do this in CSV files. It just adds to potential ambiguity (some countries use "." for the thousands separator and "," for the decimal point); just use a decimal point. Also, please don't include currency symbols in each cell. These can be hard to parse especially if the character encoding of the document has got confused.
All of these problems are not insurmountable by hand but the sheer number of exceptions becomes harder to cope with in code. When the format of files change from month to month you almost feel like someone is deliberately trying to make it hard for you to compare data. In reality it's more likely down to someone simply clicking "export" to tick a box without thinking of what happens next. The lack of consistency in the spending data has been in sharp contrast to my recent Brownfield land register project where the format is defined nationally. That made it easy to create a tool that worked for multiple authorities.
When you publish data have a mind to the people - journalists, programmers, and citizens - who will be using it and how they might do that. Make it easier for them and they'll create tools for you. Make sure you provide feedback mechanisms so they can let you know where the errors are and you can increase your data quality. Transparency is about more than just publishing some files.