At ODI Leeds we like open data. However, that doesn't mean we think all data should be open. Data exists on a spectrum. Personal data - from banking, to medical records, to location data - should often be kept under lock and key because there can be pretty bad consequences for some individuals if it becomes available.
Before I go further, ODI HQ have principles for handling personal data and these give important context on consent and transparency regarding the sharing of personal data.
The issues around personal data can, quite rightly, cause tension when trying to open up some "public" data sets. Public bodies collect data about our use of their services. Rightly, we want to be able to check that those services are being used, not being abused, and also not excluding some groups via badly-worded policy or biases (conscious, unconscious, societal etc). To check for demographic bias we need to know about demographics. But having the demographics also makes it possible to, potentially, identify individuals.
There are several ways you might "anonymise" data:
- Strip out all identifying data including information that might be able to link pieces of information in your dataset e.g. unique IDs, geo-location, demographics, relationships with others etc. Perhaps just provide aggregate summaries rather than every item in a data set. This will probably result in a data set of very limited use.
- Censoring specific data or individuals if they exist in small groups of, say, five or ten in the data set. This is often used by organisations dealing with vulnerable people such as children.
- Aggregate data into groups e.g. rather than dates of birth, age groups could be used.
- Degrade the resolution. If you are collecting information about travel use, say with an Oyster-type travel card, providing a timestamp of a transaction to the nearest second could be very identifying when combined with other information in the data set. Degrading the time as much as possible whilst still making it possible to do some useful analysis may mean degrading the time into 10 minute or half hour groups.
- Perturbing the data with statistical noise or 'fudges' e.g. data.police.gov.uk change crime locations to the nearest point in a master set of locations to help protect the victims of crime.
These methods of anoymising data are good but we must also think carefully about how they are applied. It can be possible to uniquely identify people with only a few pieces of information.
If you want to anonymise a dataset, it is crucial that you understand the data. In fact, know it inside out.
In "anonymising" data we need to think like a person (or machine) trying to de-anonymise the data. Our anonymisation methods can, counter-intuitively, add information to a data set that may allow for de-anonymisation later. Are we accidentally leaving clues?
Let's take date-of-birth as an example. We may have replaced a date-of-birth column with an age or age category. This degrades the resolution so doesn't, for an isolated data point, let us work out someone's date-of-birth. If someone is placed in, say, a 50-55 age category (sporting events often do this), we may think we've solved the problem. However, if the data set spans a long enough period of time, a person's age will vary. Every time their age crosses a boundary (say from 60-65 to 65-70) we are given information that can help us recover the date-of-birth much more accurately. If you want to protect date-of-birth, we need to make sure an individual's records are not able to be linked across the artificial boundaries introduced to "anonymise" the data.
In the same way, leaving gaps (for groups of 10 or less) could even add information to a data set. If there are enough pieces of information, and the data set is not truly random, we could recover some of the omitted information. Take, for example, a sudoku puzzle.
A sudoku puzzle is, effectively, a partially anonymised data set. At the start, more than half the numbers have been removed. However, a sudoku has rules that determine both how numbers get removed and rules which link neighbouring data sets. We can apply the logic of the underlying data, and the rules for removal, to fill in the blanks.
Real-life data isn't usually as rule-based or deterministic as a sudoku puzzle. But, if we have enough information and data, we may be able to recover some of the gaps.
This blog post is not intended to scare people off sharing data sets as open data. We should be sharing data sets that help make our public and, in some cases, private services better. At the same time, we must be mindful of how de-anonymisation can work and try to counter it by not inadvertently leaking information.
We will be collating our thoughts on anoymisation and privacy along with our other open data tips on Github. You are welcome to contribute. You can also read the ICO's Anonymisation Code of Practice and the UK Anonymisation Network's Anonymisation Decision-making Framework which provides a useful and practical look at how to go about anonymisation. Check out the ODI's worked exercise on anonymisation with Titanic passenger data.