Project Cygnus - Energy (under)Performance Certificates
The eagle-eyed among you will have noticed that we didn't publish our third Project Cygnus blog post last week. That's because it has been rolled into a two-parter - a final way to polish off the work and share our findings. 'Surely you didn't need an extra long blog post looking at just EPC data?' I hear you ask. The answer is far more complex than that, and has revealed data gremlins that would make you weep. There is a lot of potential for EPC data though, if those gremlins can be banished.
As with a lot of our projects, we can learn just as much from the problems building a thing as the thing itself. Right back in the first blog post, we had lofty ambitions to build a data-driven tool that people could use to explore EPC data. We highlighted back then that there would be caveats and conditions to making a tool, the biggest of which being the condition of the data. So how did the team get on?
To say that we were pulled into various rabbit holes would be an understatement. In our efforts to be thorough with the data. and also look for interesting connections or stories, we just kept finding dead-ends or mysteries befitting an Agatha Christie novel.
We're not alone. In a startling coincidence*, EPC data and retrofitting houses were mentioned during our most recent #RadicallyOpenRecovery session where Sarah Longlands, Director of IPPR North, shared the findings from their Northern Powerhomes report. You can watch the session for yourself to hear Sarah's comments in full but from their exploration of the data, it's obvious that it needs to be better if we are to use it when making decisions about retrofitting, heating networks, and more. Sarah's experience echoes our own as she and the team at IPPR North found that EPC data could be great but was riddled with common data problems.*by coincidence we really mean that Tom Forth mentioned on purpose.
Positives and Problems
EPC data is scary and can be intimidating but really we need to appreciate how good we have it. In reality having open access to millions and millions of observations about energy efficiency is not a given. Yes, EPCs are collected across the EU (the one that got away), but often these registers are locked away behind walls of bureaucracy. In England, however, we are in the comfortable position that the MHCLG is pursuing an open data strategy. This means we can access, analyse, and attempt to understand this data. Given the size of the dataset, there were some problems that became immediately apparent.
A prominent issue we encountered when venturing into the data was the lack of standardisation. This is likely due to many of these certificates being filled in by hand, which opens them up not only to a range of different naming conventions, but quite naturally human error. As a result, many of the columns contain a range of unique values that could have been grouped together or don't necessarily reflect the heading at all. This might be a symptom of yet another prevalent issue EPC records bring to the table - the lack of helpful metadata. Yes, there are guidance notes online that offer some helpful variable definitions. However, they are missing some key pieces of information, such as measurement units, or the fact that building reference numbers can refer to entire buildings or individual flats (which we discovered halfway through our exploration). What they do highlight are nine different versions of missing value descriptions which frequent the data. In fact, there are even a lot of EPCs missing from the data. This is largely due to it being made mandatory to lodge EPCs on the full register in September of 2008, but also certificate holders opting out of disclosing their records. This means that we really only get a limited view of the bigger picture. But in the end it's not the size of the dataset but how you use it, right?
Depending on the quality of the data, the time & resources available, and the use cases, (amongst other factors); having such a large dataset can be either a curse or a blessing - in our case, it turned out to be a mixture of both. Its quite rare to come across a publicly accessible dataset with over 19 million records and 90 fields - especially when it's a government department publishing it! The question is where do you actually start? For us, that wasnt an easy question to answer.
The first thing we did was to download the entire dataset from MHCLGs Open Data Communities (a 3.7GB ZIP file), and then unzipped it, revealing over 300 CSV files, totalling around 30 gigabytes of data.
Since EPC data is geographically-based by its very nature, we decided that the most logical way to begin exploring the data would be to simply limit the data to one particular area - perhaps unsurprisingly, we choose Leeds! This seemed like a sensible decision, as the downloaded EPC data was split into one folder per Local Authority District, with each folder containing a 'certificates.csv' file. So, instead of unzipping the entire dataset, we could simply extract a single CSV file containing all of the published EPC records for Leeds, reducing memory and storage usage massively. The other benefit was that we know Leeds pretty well!
So, now we had a sample of around 290,000 out of 19 million total records - still a large quantity of data, but much more manageable to explore. But for each of those 290,000 rows we still had 90 columns to look at; so the question remained - where do we start?
In the EPC dataset, the geographical location of each property is given by address and postcode - which seems logical when you consider that each EPC is for one property, and in the UK, the most common way of referring to a property's location is by its name and address. Unfortunately, Royal Mail own the intellectual property rights for postcodes and addresses, and of course address data is also classed as personal data. To avoid having to navigate licensing and GDPR issues, we decided to map all of the postcodes to LSOAs , then delete all of the address data for each record. We chose LSOAs because they have fixed definitions (since 2011), have non-overlapping boundaries, and contain an average of 650 households, thereby giving us areas of geography small enough to be able to draw meaningful conclusions and aggregations, but large enough to reduce the risk of re-identification of personal data. Using LSOAs also gives us the advantage of easily being able to aggregate to a number of other geographies in the future - such as MSOAs, wards, and Local Authority Districts.
With all of these challenges and workarounds, what could the team do? You can read all about the next steps in part 2 of this blog.