Exploring methods for synthetic A&E data
This guest blog post - written by Jonny Pearson, Senior Analytical Manager for NHS England - is the next instalment of the 'SynAE' project with NHS England, exploring the technical methods for creating synthetic data. There is a vast amount of new insight to be gleaned from datasets that contain sensitive information, and synthetic data presents an opportunity to explore those datasets whilst still keeping people's information safe but also providing the platform for catalytic innovation to make a real difference across the health services. We encourage everyone who is interested in synthetic data to contribute to the online collaboration doc or register for the workshop in March.
Open data has immense value for development and innovation. As part of my role in the NHS I come across a huge variety of rich data and wanted to investigate how to open up the value within these data, whilst keeping the patient's personal data secure.
As part of my role in the NHS I come across a huge variety of rich data and thought it would be ideal to investigate how to open up the value within these data whilst keeping the patients personal data secure.
I decided to start with patients flowing through A&E departments including some information of their arrival and departure. The first step was then to link A&E and admitted patient sources of data together. These data are collected by hospitals and submitted to NHS Digital each month. The data is then checked and processed into the secondary user service (SUS) data tables, which are made accessible to a variety of users including myself. The data at this point has had all patient identifiable information stripped away and a derived pseudo NHS number has been created to enable records to be linked together. Therefore, I can use a join in SQL to bring together all emergency patients flowing through A&E with any available admitted information. To make the extract large enough, and to include some seasonal effects, I took four years' worth of A&E data (from April 2014 to March 2018) and only linked the first episode of any admitted records which occurred within 24 hours of the A&E attendance.
At this point I have over 70 million records with some patients having multiple records across the time-period. These data contain a range of insights including:
- who attends A&E?
- why they attend?
- when they attend?
- if they are admitted? and
- if they are admitted, what for?
This will allow users of the data to use predictive models on most likely attenders, or who is most likely to be admitted based on what we know when they first enter A&E, or perhaps to build classifiers for patient types and usage, or even to highlight why some records appear to be outliers.
However, a person could still be identified from these data by finding the unique records and matching it to other available data. Therefore, I need to ensure the data is completely secure whilst at the same time maintaining as much of the information in the data as possible to allow users to start investigating the above questions.
Making the data non-identifiable
As some of the individual fields could enable personal identification, I had to ensure that no single field could be used to identify a unique record.
Where a patient lives
I started with the postcode of the patients resident lower super output area (LSOA). This is a geographical definition with an average of 1500 residents created to make reporting in England and Wales easier. I wanted to keep some basic information about the area where the patient lives whilst completely removing any information regarding any actual postcode. A key variable in health care inequalities is the patients Index of Multiple deprivation (IMD) decile (broad measure of relative deprivation) which gives an average ranked value for each LSOA. By replacing the patients resident postcode with an IMD decile I have kept a key bit of information whilst making this field non-identifiable.
The next field to look into was the individual hospitals. It wouldn't be too difficult to work out an approximate locality for an individual record based on the location of the hospital. As each hospital has its own complex case mix and health system, using these data to identify poor performance or possible improvements would be invalid and un-helpful. Therefore, I decided to replace the hospital code with a random number. Obviously, it would still be possible to start to match these random numbers to the original hospitals by using the number of people attending and comparing to official records, and so for the final extract I'll only provide X% of the records which along with the other changes to the data will make it impossible to accurately re-identify this field.
Time in the data
The next obvious step was to simplify some of the time information I have available as health care system analysis doesn't need to be responsive enough to work on a second and minute basis. Thus, I removed the time information from the 'arrival date', mapped the 'arrival time' into 4-hour chunks, and rounded the 'time in A&E' to 10-minute chunks.
Large value outliers
I also noticed that some of the interval fields such as 'length of stay' and 'distance from patient residence to nearest A&E' had outlier values, especially very large ones. By capping these fields to 180 days and 200 miles respectively, and additionally capping the number of investigations, diagnosis, and treatments to a maximum of 10, I can remove many unique records from the dataset whilst removing very little value in the overall data.
I next looked at the patients' age and sex. I decided to only include records with a sex of male or female in order to reduce risk of re identification through low numbers. For the patients age it is common practice to group these into bands and so I've used a standard set - 1-17, 18-24, 25-44, 45-64, 65-84, and 85+ - which although are non-uniform are well used segments defining different average health care usage.
Health care coding
The final area of additional pseudo-anonymisation and grouping which I investigated was to look at health care grouping, including health reference group (HRG) in A&E, Primary Diagnosis Code (ICD-10) and treatment function code for the admitted patient. As these variables have the largest range of input codes (and therefore most potential value) I left these to the end in order to see how re-identifiable the data was at this point and thus how much I needed to do to these fields.
To do this I grouped the data fields into patient demographic, time and date information, and health care coding. For each of the three groups I calculated how many possible derivations of the fields within them there were and counted the number of records relating. When I found unique records, I would group the smallest healthcare codes together (e.g. if less than 10 patients with treatment function code "570" exist, then this is grouped with the most similar code to it). By iterating this approach, I eventually found grouping which meant no unique records were present. This is important as it means that there is no combination of health care coding which can be related to time and person information to identify a single case, thus it is impossible to identify any particular record relating to any real-world event.
Swapping the data
At this point Im confident and pleased that the data can't be used to identify an individual record against any single event. However, I can't say the data is synthetic in fact it's still patient data, just with much of the detail stripped away. In order to make it synthetic I need to process the data in a way that no original data exists in the extract whilst maintaining the richness of the data itself.
Originally, I wanted to apply a 'random forest' at this point. This algorithm builds multiple decision trees and merges them together to get a more accurate and stable prediction. Each decision tree attempts to build a set of rules identifying for a subset of the data which variables are key to the final outcome. By combining the learning from each decision tree into a Random forest I could find which variables had the least impact on the data and thus which we could alter easiest without impacting the underlying relationships. However, I found that I could only push ~250,000 records through the randomForest algorithm in R. Other applied algorithms faced a similar restriction and wouldn't produce an output.
I also followed the excellent work produced by the Simulacrum project on cancer pathways. Here the team have produced synthetic data by starting with a blank canvas and producing data based on the profiles and inter-relationships between variables they found from historic data. This approach has produced a good dataset and I look forward to comparing the approaches with the aim of setting which is better suited for future similar work.
In the end I have used a stepwise Pearson's correlation to identify the variables with the least relationship to any other variable in the extract. Several variables were found to have very low coefficients of correlation meaning that these variables are showing little to no direct relationship with any other variable. The profile of these variables was then extracted and the data column overwritten using sampling with replacement from the variable profile. Thus, I end up with the variable profile being maintained but the inter-relationships of these variables with the rest of the data being destroyed. As these have been identified to have little, if any correlation, then I aim to lose as little value in the data as possible.
I now run the data through a duplicate function in STATA to see if any record after the swapping has remained the same as before. If no record is the same then I can say that, at the very least, the data now contains new records which are identical to real data records. If duplicates remain then I go back to the stepwise correlation and incrementally swap more variables until no duplicates exist. The question is then how much value have I lost in order to ensure no record is original.
I am working with ODI Leeds to bring together experts and interested colleagues from a wide range of backgrounds to discuss and comment upon this methodology. The aim of these discussions is to find a balance between idealism and pragmatism for a future standardised synthetic methodology. These discussions are open to all through the workshop on 5th March as well as the open online documents.
Guest blogger - Senior Analytical Manager, NHS England