EPC you later!


No you are not seeing double (or triple, or even more than that), this is yet another blog post on EPC data. That's because it is time for us to wrap things up and close the chapter that is EPC exploration. But this isnt because we have run out of patterns to spot or problems to crack. To put a long story short trying to map all of the ins and outs of EPC data could fill a novel (or several blog posts it seems). And while we always aim to go above and beyond we have had to learn when to stop. So we have decided to take this moment to share a few final bits of work (that you can find on our github) and concluding thoughts.

Cleaning

At this point in our very extended look into EPCs, we live and breath energy efficiency data and our standardisation nightmares are slowly fading away. But it took us a while to get there and we decided that nobody else should have to go through this. That is why we decided that it was time to write some basic cleaning code to pave the way towards a better EPC experience.

In this we attempted to standardise missing value expressions, minimise the amount of columns where appropriate, clean out duplicate observations where it was reasonable, and add more insightful geographies while also upholding privacy. We kept this as basic as possible to avoid making decisions that might bias different research objectives.

However, we did have to make some assumptions: In the interest of convenience and processing speeds we based our code on the Leeds EPC subset. This means the missing value expressions we found might be incomplete in the context of all lodged EPCs. This may be something users might want to double check since we remove columns with a certain percentage of missing values.

Moreover, when it came to cleaning out redundant observations we focused on the removal of duplicated full cases. We considered also removing older incomplete duplicates of the same building reference number that were lodged within the same day or week, since research suggests that these were updated. But due to the general confusion over building reference numbers (and recurring nightmares of biasing the data by accidentally removing legitimate observations) we decided against this.

Error Exploration

When we came up with the cleaning code we discovered occasional anomalies in the Leeds based EPC data and decided to explore two of them.

One of them is the transaction type "new dwelling". Typically you would assume that EPCs done for new dwellings would be a building reference numbers' first registered observation. We decided to explore the building references where this was not the case and discovered that a majority of them were repeatedly registered as new dwellings since their first EPC. This implies that some of them might be updates meaning older iterations could be redundant. Beyond this the most common transaction type preceding "new dwelling" is that indicating a marketed sale which might imply this being a provisional EPC. However, the next most common first transaction type "private rental" is a bit more of a head scratcher, and might mean that something went wrong when these EPCs were lodged.

The second inconsistency we looked into is the relationship between current and potential values lodged in the register. Here, the main problem is that some of the catalogued potentials reflect *worse* performance than currently achieved by the respective dwellings. Since this disagrees with the essence of potentials, it indicates that something might have gone wrong here. There is some good news though because hardly any EPCs show this kind of error in the energy efficiency column. The other columns measuring current and potential values show a significant amount of inconsistency with the concept of a potential, especially when it comes to Hot Water Cost. If someone were to look into this further it would be interesting to find out if there is a consistent pattern in potential value errors across individual dwellings, which may allow for their exclusion. The same goes for observations where currents consistently equal potentials, especially if these are valued at 1, which may hint at provisional EPCs. But these outliers again need to be handled with care.

Some Analysis

Despite the seemingly endless hurdles in actually being able to do some meaningful analysis on the data, we did manage to link the EPC data to the Valuation Office Agencys Housing Stock 2020 data. Part of our cleaning process was mapping each record to its respective LSOA, which tends to be the geographical granularity used by many national statistics datasets, meaning we could easily link our version of the EPC data to a range of open datasets. For example, the data could be linked with IMD (Indices of Multiple Deprivation) data, which is LSOA-based, or any other number of open geographical datasets from the ONS, the VOA, MHCLG, Data Mill North etc.

We chose the VOA Housing Stock data simply because it was an easy dataset to work with - its released under an Open Government Licence, well standardised and documented, and didnt require us to do any data cleaning. We decided to look at 2 questions:

  1. What impact does property age have on energy performance?
  2. In any given LSOA, what percentage of properties have a published EPC?

The answer to the first question might seem fairly obvious - older properties tend to have a lower EPC rating than younger ones. The answer to the 2nd might surprise you though - in Leeds, only about 46% of properties have a published EPC. If you want to delve into the more technical side of things, weve created an open Jupyter Notebook, containing our code, methodology, and outputs.

However, our ultimate aim in doing this small, (and frankly reasonably simple) piece of analysis was not to answer the questions mentioned above. We wanted to demonstrate what could be done with the EPC data after lots of time and work on our part with regards to data exploration, cleaning, and trying to understand vague documentation.

Future potential & Conclusions

The EPC dataset is a great resource to have, and although it might sound as though weve just been complaining about it - we cant forget the positives - its a huge dataset with over 19 million EPC records covering the whole of England of Wales, the data quality is generally good, it has easily accessible metadata, and even has a well-documented API.

As we mentioned, we have tried to give an example of what could be done - but even working as a team with some data science experience, it took us a lot of trial-and-error, data wrangling, and cleaning just to get the data into a usable format. Our question is - what if we didnt have to do that? If the data was to be made more accessible (ie. if you didnt have to register to access the data), and MHCLG were able to document the errors and limitations we have discovered (or even better, fix them - but we know this is much easier said than done), then it would make it much easier to use, link, and analyse the data, ultimately requiring a lot less time and work. Similarly, the National Data Strategy underlines this importance of UK wide standardisation in resource preservation and bureaucratic efficiency. Check out ODI Leeds response to the National Data Strategy here.

The ONS have already done some great work analysing the data, and, having read their methodology, have found the same errors and limitations we have. We've been in contact with them and they're very keen for feedback - so please, if you have any thoughts get in touch with us and well pass it on.

In the future, provided that enough people see this work as being useful, wed love to be able to expand it to the whole of the UK. Through our work with the Emergent Alliance, IBM have very kindly given us access to their Cloudpak for Data platform, which we could potentially use to perform this analysis on a much larger scale.

For now, we would love to hear your thoughts, feedback and ideas - this ultimately determines what we do next, so please do get in touch.