Reproducible Analytical Pipelines! Reproducible Analytical Pipelines! Reproducible Analytical Pipelines!
There’s a lot of talk about Reproducible Analytical Pipelines at the moment, ever since the Goldacre Review identified them as a crucial component of a modern, open health service.
Promote and resource 'Reproducible Analytical Pathways' (RAP, a set of best practices and training created in ONS) as the minimum standard for academic and NHS data analysis: this will produce high quality, shared, reviewable, re-usable, well-documented code for data curation and analysis; minimise inefficient duplication; avoid unverifiable 'black box' analyses; and make each new analysis faster.
Now, I’ll confess I have a slight problem with them. Not the concept, which is excellent and will go a long way to improving quality and efficiency.
No: my problem is with the term. Firstly, it tends to get shortened to RAP, which is OK as an acronym, although sometimes leads to a confusion about whether the conversation is about the urban musical form or data processing. (On reflection, maybe that’s just our office culture.)
This minor complaint, however, pales into insignificance when considering the expanded name: Reproducible Analytical Pipelines. I don’t even mind that it’s a bit of a mouthful. It’s just that, every time I hear it, this is the image it conjures:
(If you’re none the wiser, this short clip from the cartoon Family Guy will explain everything. Or maybe not.)
So, having got that off my chest, and with a brief apology for implanting that meme in your brain, let’s look at Reproducible Analytical Pipelines.
A RAP is essentially a description in code and configuration of the production of a dataset or report. This code ideally describes the entire process, from the source system or systems where the data resides to the target of the report. Even more ideally, this entire process can run without any human intervention. The benefits are a reduction in busywork, and consistent quality.
Before we dive into how RAPs are built, let’s take a look at how this kind of thing is typically done. Very often, production of reports and visualisations use office productivity software: mainly spreadsheets. There’s nothing wrong with this, as many people have access to these tools, and have a reasonable understanding of how to use them. That said, this excellent Excel as a Database web comic by Rory Blyth paints an all-too-recognisable picture of productivity software abuse. Take a read… I’ll wait.
An example of a manual pipeline could be something like
- Extract data from the customer and booking systems as CSV files.
- Copy and paste these two files into Excel in separate tabs.
- Add a formula to cross-reference between customers and bookings
- Calculate waiting time based on specified types of contact.
- Create a pivot table / chart showing waiting time broken down by postcode.
- Export the chart as a picture and paste into a word document
- Save the word document as a PDF
- Upload the PDF to the team share point site.
There are potential fracture points in these manual processes:
- Much of the organisational knowledge resides in a potentially small subset the people.
- It may be difficult to inspect and reason about the processes.
- This means that provenance of datasets is not clear. There may be all kinds of ‘adjustments’ being made.
- The process may become over-reliant on those “in the know”. If they are unavailable the process will not run and the organisation will not have access to the latest data.
- People tend to be introduce errors1 into the process.
So, these kind of Irreproducible Manual Pipelines (I may have just inadvertently coined the term IMP) are inherently fragile.
By building a RAP, we are capturing the organisational knowledge in code. It is immediately possible to see how a dataset / report is created. If this code is built into a re-runnable pipeline, then a job that takes a number of hours could be significantly shortened, at least in terms of human attention. Take the next step to running this automatically, and suddenly people are freed up to focus on more important things like answering questions posed by data.
Given RAPs have originated in government, there is a preference for open source tools. This is no bad thing, as it means that anyone with access to the source data and your code can check your reasoning, which would be more difficult with proprietary code. At Open Innovations, we have recently been using some excellent Python libraries such as pandas to manipulate data. We’ve even used Python libraries to undertake fuzzy matching when joining datasets. Given we are also working completely in the open, we even use Github Actions to run our pipelines.
I described the aim as consistent quality. This bears some unpacking. There may still be errors in the pipelines, but at least the data is comparable between runs. Contrast this with a human-run pipeline where all sorts of errors could be introduced, and we’d never know. These could be transcription errors, problems with copy and paste, bugs in the formulas used, or simply mislabelling files or putting them in the wrong locations. If we notice an error in a RAP, we can fix it, and if desired relatively easily recreate the historical data by rerunning the batches.
Interested and want to find outmore? The Government Analysis Function has some excellent guidance on RAPs. We’re also hosting an Open Data Saves Lives Unconference on the topic of RAPs next week.
sic, i.e. this error intentionally left in ↩