Open data foundations - building a housing tool
The goals were:
- Demonstrate that planning-related events can be linked together
- Show that information about property counts can be extracted from planning data
- Build a confidence score based on assumptions drawn from such a derived dataset
- Test the architectural design
- Demonstrate that a prototype can be created that meets these needs
We started with data sources. Through a combination of research, crowdsourced suggestions, and conversations with people who worked in planning departments or with planning data, we compiled everything into an openly accessible Google Doc. As we evaluated the datasets for their relevance and usability, we kept encountering the same issues:
- Commercial licensing, especially with address and postcode data
- Missing metadata
These might not come as a surprise to a lot of folk who regularly work with open data and/or housing data. Of all the datasets we assessed, we chose two from Leeds City Council - Council Tax Registrations, and Planning Applications. These datasets were chosen as they are good proxies for Planning and Completion of schemes.
Now that we had our datasets, the next step was building a prototype to test the theory of extracting useful information from linked data. By building something, we encounter (and fix) problems as we go and can constantly evaluate the success. Where it was possible and prudent, we made improvements to the underlying architecture but also documented what wasn't working (and why), and what the alternatives could be.
Some issues are outlined below:
- In many cases, multiple planning permissions are associated with a single geographic area. In some cases, these are updates to the already planned scheme. In others, however, there is no relationship between the former and latter schemes. In this case they should be treated as separate schemes.
- Inferring concrete data from unstructured data (text description of planning application in this instance) proves to be difficult. The property counting algorithm that we created was often confused by multiple numbers in a planning summary, often including irrelevant details.
- Datasets are rarely published with geospatial data. As this is required for matching, the prototype had to make use of other sources of geospatial data (e.g. API calls). Even the use of UPRN, which is meant to tie to geometry, is limited. This appears to be associated with a lack of clarity around licensing for geographical data. Many geospatial datasets have licenses that explicitly preclude publishing open datasets derived from (or at the least including) the geographical data they include. While URPN was intended as a means of unlocking this, many data managers are unwilling to publish URPN in the belief that this is covered by the same terms. The same is true of address: in particular postcode.
Despite the challenges, a working prototype with a simple UI was built. It successfully demonstrated that datasets could be linked, and that useful information could be assumed from the linked data. In the screenshot below, the example planning scheme shows that a number of properties were identified from the planning application description text. Using geospatial data as the common factor between the two datasets (planning applications and council tax registrations), the number of identified properties was then checked against the number of properties registered for council tax in the geospatial area and within the timeframe of the planning application. This operates under the assumption that council tax bands have not been recalculated in Leeds for at least 10 years, thus indicating new registrations as newly completed properties.
Additional helpful UI elements included a 'timeline' of events, which essentially chronicles a planning scheme from the start of a process to completion (though this proved visually problematic when there were a lot of properties associated with a single scheme).
One of several potential next steps for this work is to add further events to this timeline based on data from other sources, such as utility connections, registration for bin collection, etc. These events would also contribute to the overall confidence that these properties exist.
Code for the prototype elements is shared in four separate git repositories on the ODI Leeds owned Open Production GitHub organisation, so you can take a look at the approach we used.
So what happens next?
There was a lot of interest in the tool from potential end-users (local authorities, planning departments, data analysts, housing associations) so a thorough exploration of user needs would be a solid next step to building relevant functionality into future versions of the tool. There is also the scope for the tool (at least the underlying architecture of matching data and event aggregation) to be applicable across other aspects of planning, or even more broadly, to help derive deeper insight about specific topics.
Something can also be said for improving the education/understanding about data licensing and the need for good metadata. From the experience of working with the open data collaboration group at ODI Leeds, a lot of progress was made when local authorities worked together on data standards for business rates, so there is scope for a similar collaboration between those interested in making planning and housing data better.
Of course, we would also encourage more data to be available openly. This would remove a great deal of barriers, both in terms of accessing data and sharing it. When barriers come down, people are free to innovate and experiment. Open data and innovation creates a surplus - of energy, ideas, data. From this surplus comes value that everyone can benefit from.