Anonymisation and re-identification
We work to encourage local authorities and government bodies to open up data. Data exists on a spectrum and most often when we talk about open data we mean non-personal things such as bus fares or bin collections. Once in a while a local authority or combined authority will ask us about datasets that have a wider use but involve personally-identifying information.
Protecting privacy is critical. Generally speaking, all personal data should be removed. However, there are some cases where the dataset may be near useless without some kind of notion of individuals but you shouldn't be able to match those to real-life people. In these cases people turn to anonymisation methods.
Anonymisation can involve removing unnecessary fields (e.g. names) and aggregating other fields (e.g. providing postcode outcodes instead of full postcodes). These methods mostly group people and avoid small groups (by age, gender, geography etc) in the data. However it falls prone to the possibility of re-identification as more fields are included and if individuals turn out to be unique when the dataset is looked at as a whole. If it is possible to link any of the fields to external data with extra information this could also lead to re-identification.
Sonia Duarte of ODI HQ recently published a blog post about the perceived risks of re-identification and on ODI HQ R&D project about managing the risks of the re-identification. That is a three year project running until March next year and you can get in touch with them to provide feedback.
Tips
If you are currently trying to anonymise data, here are some tips and things to keep in mind:
- Don't include obviously identifying things such as names, postal addresses, and unique IDs.
- If geography has to be included you have a few ways to group people:
- Postcode outcodes e.g. "LS9" instead of "LS9 8AG" (there are around 3000 of these in the UK so each will roughly contain around 20,000 people);
- LSOA - Lower Layer Super Output Area - is a common geography created by the UK's Office of National Statistics. Many ONS datasets are available with data by LSOA so this is useful for matching to, say, Index of Multiple Deprivation.
- Wards contain an average of 5,500 people so may be good enough depending on your dataset.
- Postcode outcodes e.g. "LS9" instead of "LS9 8AG" (there are around 3000 of these in the UK so each will roughly contain around 20,000 people);
- Age can sometimes be useful in published datasets e.g. to see if some age groups are losing out. You should not include dates-of-birth as these are clearly very identifying. Publishers often attempt to retain some useful information by putting people into age brackets e.g. 0-17, 18-24, 25-44, 45-64, 65-84, and 85. This works if you are publishing data that doesn't allow an individual to be tracked across time - either within the dataset itself or across updated releases in the future. However, if individuals can be tracked over time in your dataset(s), this can lead to their date-of-birth "leaking" out when they change age categories. You could avoid this in a couple of ways:
- Removing the month and day when calculating the age bracket i.e. use the 1st January of their birth year.
- Providing their birth decade because this doesn't change over time.
- Time-based information could identify people in a larger dataset if combined with other fields or other, external, information. It could even be used to add more precision to aggregated geography if, say, the events happen along a bus route or in some sequence. Only provide times at a necessary resolution. The ISO8601 date format is good for reducing resolution.
- Gender can be identifying in cases where people have non-binary gender or have changed gender within published datasets. You can remove these individuals from your datasets to avoid them being re-identified. Be aware that this will bias your dataset with regards to non-binary/trans groups.
- If you create a unique ID/hash for an individual, do not use any of the individual's personal data to create the ID/hash as there is a small (but increasing with time) chance of someone being able to reverse engineer this with advances in computing power. Also be aware that if you assign unique IDs in a specific order, the ordering may add information to your dataset e.g. if your dataset was originally ordered by name when an ordered ID was assigned, it's placement relative to other individuals gives some idea of what the name was.
- Don't include small groups. When all fields are combined, individuals shouldn't be in groups small enough that they can be re-identified.
- Be aware of hard edges in the data. If you have aggregated fields into groups with "hard boundaries" (e.g. age brackets) make sure that individuals can't be matched across these boundaries and allow the original value (e.g. date-of-birth) to be recovered.