EPC - lessons learned

Over the last two months, we have taken some time to understand, explore and gather some insights from the available EPC data for Leeds. Since we have published our code and several blog posts, we decided to summarise and review our work, to set the groundwork for future initiatives. To put our experience with EPCs into context, we aimed to, as always, be radically open by involving experts we have met along the way.

The friends we made

With any challenging situation it is often best to not approach it alone. Luckily, EPC data being so tricky means that theres lots of people out there who have been in the same position as we were at the beginning of our exploration. And the ones we connected with were more than willing to help: we wouldn't even be here without Icebreaker Ones Project Cygnus and their initial discussion of the importance of EPC data. The ONS happily shared their research methodology with us and encouraged us to give feedback on EPC data quality. Dr Kate Simpson from the Alan Turing Institute pointed out some helpful research papers and offered to help along the way. Friends we made at IBM through the Emergent Alliance offered us access to IBMs Cloudpak for Data environment, which gives us the processing power we need to extend our research beyond just Leeds. This once again shows the value of being open and building a cooperative network.

The data

The openly available EPC data is massive. It contains 30GB of CSV files, and as if 90 columns and 19 million observations for England and Wales (290k for Leeds alone) weren't enough, this is only a subset of all existing EPCs. This amount of data makes it an incredibly useful source, but it also invites pitfalls like standardisation issues (e.g. with missing value expressions), duplicates and inconsistencies. Especially the ambiguity of building reference number assignment has proven to be an interesting challenge. Experienced EPC users like Kate Simpson agree on these shortcomings of this data (also because it only focuses on carbon) but are keen to highlight its value as a relatively new tool and dataset that can enable policy change.

What we learned

Our Leeds based EPC exploration ranged from baseline insights about average energy efficiency ratings and age bands, to more specific outputs, like the spread of heating systems and reasons for EPC inspections (which can all also be found in our data mapper). The ONS confirms these results and shows similar trends for all of England and Wales. Since we did not want to just replicate existing research, we used this as a warm up for some more specific analysis. This allowed us to assess the effectiveness of Green Deal measures and highlight standardisation issues. Inspired by this, we explored two types of inconsistencies concerning New Dwellings and unusual current and potential value relationships. To deal with at least some of this ambiguity in the EPC data we developed some quick basic cleaning code. Here, especially a paper by Hardy and Glew gave us an overview of potential errors and helped us avoid biasing the data.

Final thoughts

The main thing that can be said at this point of our EPC journey is that there is a lot of room for improvement - for EPCs; but also for us. This has been a massive learning experience that has taught us that there is still a lot more that can be done. And a lot of what we have done so far can be done better. As always we want to release outputs and tools that have a positive and useful impact, which means we are looking forward to your feedback and ideas!