Reproducible Analytical Pipelines (RAP) case studies

Reproducible Analytical Pipelines (RAP) use software engineering best practice to deliver efficient, high quality and transparent analysis.

Developing RAPs can be difficult for teams. So we have shared some case studies on this page to demonstrate how teams have overcome challenges to produce RAPs.

If you think you have a story that would make a good case study, please contact us.

A team of analysts from the Office for National Statistics supported Cabinet Office and Number 10 by supplying slides, data and notes for speakers for the public press briefings on coronavirus (COVID-19) that happened throughout the pandemic.

From April to July 2020 the press briefings were daily. The situation on the ground and the resulting policy response changed rapidly. As a result, the main messages for the public and the information needed to support them were highly variable.

Content and information requests were subject to change at short notice. Data feeds for the press briefings were entirely manual at first, came in from more than 20 different providers and formats, and content changed frequently as it evolved to meet the emergency. The time available to prepare for the briefings was short – a window of about six hours each day to set up and quality assure data, prepare and sign off slides, data and notes, including getting sign off from all of the data providers, ministerial private offices and senior leaders including the Chief Medical Officer and Chief Scientific Adviser, and send content through for broadcast and presenters.

The analysts had two main aims – to ingest, process and quality assure the input data needed for slide production, and to design, produce, assure and sign off the slides, data pack and speaker notes for the briefings against a very tight deadline. The analysis team had a mix of advanced skills including coding, data visualisation and statistical production, but this was unevenly distributed.

Initially, we focused on end-to-end automation of the workflow to produce charts for the briefings, but quickly realised that this was the wrong solution. The charts, data pack and speaker notes were subject to frequent last minute change requests, which meant that automation of this last step was inefficient and impractical. Rather, what the briefing team needed was clean, standardised data templates in Excel that they could use quickly to generate and tweak the charts and data for the slides.

We divided the work of the team into slide production and automatic data generation, with the relevant specialists on each side. For automation, we prioritised the slides and the manual workflows that took the most time, such as complex multiple charts on a single page, calculation of moving averages or reformatting of inputs.

Using Reproducible Analytical Pipelines (RAP) principles, we built open-source, modular, reproducible analytical pipelines to ingest, process and generate Excel workbooks with standard tabs, formatting and graphs to enable rapid slide production across more than 20 different data feeds from across government and beyond. Non-government data included international data from Johns Hopkins University and the European Union, and mobility analysis from Google and Apple. We generated the most complex small multiple charts and maps, which would take a lot of time to produce manually in R. We built the pipelines in R with version control in git and unit testing of the bespoke functions and classes that we wrote.

Using RAP automation meant that we were able to rapidly adapt. It increased the team’s ability to meet very challenging deadlines and time-critical requests for new things. RAP removed the burden of time-consuming manual processing, enabling the analysts to concentrate on responding to changing demands. It enabled us to re-use modular code and quickly adapt it to new data requirements. It also gave us confidence that the data we presented were properly quality assured through the automated validation and visualisation checks that we built into the pipelines.

This was our first experience of using RAP principles in an ad hoc environment with enormous time challenges. In fact, using RAP improved our ability to respond quickly and with confidence, meant that everybody in the team understood the processes and helped us to onboard new team members very fast because documentation was built into the workflows.

In May 2020, the Office for National Statistics and academic partners launched the COVID Infection Survey (CIS). This survey estimates the prevalence of COVID-19 in the UK. The partners built a pipeline within the Secure Research Service (SRS) to process and analyse the survey data, using the software Stata.

This pipeline provided results, but there were concerns about its scalability, sustainability and the quality of its outputs. The pipeline required a lot of manual intervention and analysts worked overtime. A team assembled to develop a new data processing pipeline. We wrote the new pipeline using Python and PySpark in the ONS secure Data Access Platform (DAP).

The original scripts were long and disorganised. The new pipeline is modular instead. We wrote functions and grouped related code into modules. Users specify which pipeline stages to run using a config file. This config file also contains the names and locations of the data. When developing, we used dummy versions of the input data.

Our collaboration was improved by using Git. We used an open GitHub repository for the first time with code to be used on the ONS Data Access Platform. The team agreed on standards for writing code. We automatically enforced these standards and checked for secrets before we uploaded code to GitHub. We also checked code quality by writing unit and regression tests. As well as automated quality assurance, at least one team member reviewed each code change. We also presented regular updates to experts and stakeholders.

The team comprised analysts, data engineers, data scientists and python developers. Many of these were seconded from other parts of the office part time. It was important that we worked well together and shared our different skill sets. We set up a weekly paired programming day and encouraged collaboration during the rest of the week. This group working helped to keep part time members of the team up to date and focus their time on the project.

The team worked hard to develop the new pipeline; however, we faced several obstacles. There were often quality issues with the data we received. Our aim was to push data quality issues to the source, rather than build around them, but this was a challenge.

Another challenge was communication between and within the teams. Initially, team members and stakeholders were often not included in the decision-making process. One example of this was when several months were spent designing a database for the process. When designs were completed and development was about to start, we discovered that we could not have a suitable database within the secure environment.

A better approach to creating the pipeline would have been to work in an Agile way from the beginning. The team initially spent about six months gathering requirements. However, because of the changing nature of the pandemic, these requirements were often out of date when we started developing. Several deadlines were missed because of the delays this caused. It would have been better to develop a minimal pipeline at pace, then incrementally add in additional requirements. In the end we adopted this approach. This led to better coherence between code from the various developers in the team, through focussed sprints. It also meant that we always had a working product to demonstrate our progress and could add value from early on in the project.

Gathering requirements was difficult because we were getting them from the existing code. This pipeline was written under immense pressure by external academics. It was often unclear to us what it was doing. We ideally would have gathered requirements from the users of our data. However, these teams were under great pressure to deliver their own outputs. This siloed working was evident elsewhere. When other teams provided methods, there was poor shared understanding and communication. In general, it would have been very beneficial for us to take a DevOps approach. This means working more closely with the teams facilitating and operating the pipeline.

Despite these difficulties, the team worked hard to improve communications. As time went on, the team became more self-managing. The responsibility for allocating work moved from management to the team members. There was also more cross-functional working, where originally analysts and developers were separated. We got better at communicating with our users and stakeholders, to refine the pipeline requirements. This simplified the work. It also meant we could offer an improvement to the users rather than exactly the same product.

Automated execution of the new pipeline has removed manual intervention in those parts of the weekly data processing and has allowed us to run this processing outside of office hours. Up to now, this saves around 4 hours per run, meaning that the downstream processing can be completed 4 hours earlier.

Despite a challenging IT environment and limited access to software, our RAP champion and a small support team have  developed publication ready material that has saved considerable staff time and limited errors.

I cannot wait to get a wider deployment of software so we can develop a full RAP from data ingest to final report for many of our statistical products. This will save large amounts of statisticians’ time. We will be able to invest this time in statistical development and support projects that we have been unable to progress whilst producing statistical reports without use of RAP.

Dr Simon Clarke, Chief Statistician at the Health and Safety Executive (HSE)

Analysts at the Health and Safety Executive (HSE) have successfully started using RAP practices despite significant technical barriers.

The existing IT infrastructure made coding and development of RAP challenging. Their production environment was not designed to meet different professional needs. The team had no access to common coding languages or tools. Analysts in the statistics team only had access to programming languages through shared off-network laptops. But it was not possible to update software or install packages on these computers or communicate directly with colleagues.

A data scientist was able to build a prototype RAP using openly available data in their home computer environment. Using this RAP they were able to recreate existing publications in a more automated way using RMarkdown. While they were only able to produce existing publications in this way, these prototypes functioned as a proof of concept, showing how RMarkdown can be used to automate creation of publications. Within 5 months of this work starting, they were able to create similar templates for 15 different publications.

The analysis team were then able to use this proof of concept to convince colleagues to start using RAP for regular publications. Since then, the analysis sections of around 75% of the statistics team’s publications are automated, with development on others starting soon.

This progress has helped the team make the business case for getting more coding tools installed on their work computers. This will help reduce reliance on off-network machines. By demonstrating the benefits of RAP the team also has the necessary support to start automating data pre-processing. RAP is currently saving approximately 20 hours of analysts’ work per document, per year. Furthermore, efficiency savings are expected to increase when more of the pre-processing is automated.

The success of this work was dependent on demonstrating the value of RAP early on and throughout development. This was done through presentations, sharing seminars and demonstrations. By concentrating on the parts of the process that are easy to automate, using the current IT infrastructure, analysts could quickly make the necessary progress to get support for future development.

The analysts are now looking to get support for automated deployment of pipelines. This will enable them to use RAP as the default approach for new and ad-hoc analyses.

The Coronavirus (COVID-19) in the UK Dashboard is the official UK government website for data on COVID-19. It is a single source of essential data and statistics. It provides weekly updates to public and professional users on testing, cases, deaths, vaccinations, and healthcare.​​

The task of getting the data onto the dashboard is hugely complex. We receive multiple inputs from over 20 sources. ​​The process starts with the first data ingest at around 8:00am. This automatically triggers the next stage in the pipeline. In total, we create over 200 different metrics and publish to a 4:00pm deadline. The dashboard is currently updated each Thursday, but at the peak of the pandemic we published data 7 days a week. Reproducible analytical pipelines (RAPs) were and are essential to process these high volumes of data. Having an enterprise data platform was also important for this work.

As a public-facing source of information, the dashboard is open to a high level of scrutiny. A team of analysts quality assure the incoming data. Throughout the day, they check the progress of the pipeline and troubleshoot any issues. They also quality assure the output, with the help of automatically generated reports.

We built the dashboard pipeline on the NHS Foundry platform. We had separate areas for development and production on the platform. We were able to see a diagram of the pipeline known as a directed acyclic graph (DAG). This single view improved the overall end-to-end visibility. Our data pipelines are also open to colleagues in the health system. For example, we made the case numbers data pipeline available to NHS colleagues, who made their healthcare data pipelines available to us.

The success of the dashboard is the result of several teams working together. ​​These teams consist of analysts, data engineers, developers, user researchers, and content designers. The strong relationship between analysis, development and operations roles enables analysts to highlight requirements. Solutions can then be built, tested, and rolled out rapidly.

We recommend using RAP. It has enabled us to be far more productive by concentrating on the most important outputs. We have been able to work together on a single codebase, enabling end-to-end visibility for the entire team. ​The RAP approach has helped us to produce reliable data pipelines and sleep well at night!

There are several things to consider when starting a RAP project. We found that:

  • team capability is very important – skilled people are needed for all stages of the process
  • good communication in the team and with data owners is essential
  • if you change one part of the flow it can have unintended consequences later — for example, unexpected changes to input data risk breaking the pipeline, but communication can mitigate that risk
  • it takes time to incorporate fundamental changes or new reporting processes — make sure you plan your work accordingly

You can find more insights in the dashboard’s user engagement case study. If you have any feedback, you can email us at Coronavirus-Tracker@ukhsa.gov.uk.

The Analysis Function supported the COVID-19 Infection Survey analysis (CISA) development team. Our support emphasises the benefits of working together and learning by doing, rather than learning courses and classroom teaching. We worked with the CISA team to help analysts upskill and develop multiple Reproducible Analytical Pipelines (RAPs).

The COVID-19 Infection Survey was developed quickly and in a challenging development environment. Code often had to be produced to demanding deadlines, with resource and priorities changing rapidly. Although the analysis was being done by code, the team were aware of the importance of the analysis they were doing and invited us to help them improve their quality assurance.

Over time the analysis code had become more and more complex. It had also accumulated substantial “technical debt”. This means that work had been put aside for the future to produce analysis on time.

Many existing processes were not fully automated. The code was repetitive and not modular or parametrised. This meant that it had to be changed regularly and run one section at a time. Because of this:

  • the system took a long time to run
  • the analysis needed careful manual quality assurance
  • the analysis was difficult to review

Requirements frequently changed, but the code base had become difficult to work with and change.

The team had good programming capability, but RAP development was inhibited because:

  • there were not enough people with the right skills and time to work on the project
  • there was pressure to produce regular outputs
  • there were rapidly changing priorities
  • there is an initial time cost of upskilling in RAP good practices
  • analysts were working in an unsuitable development environment

How the pipelines were improved

Our work with the development team concentrated on the main pipeline. With additional resource and support using Git, it was possible to ensure a minimum of two people were always working on the pipeline. This meant that:

  • the team could do more frequent reviews of the system
  • more problems could be fixed at source – this helped to reduce technical debt
  • the team were able to keep a comprehensive audit trail – this included details about who made which changes, when and why
  • the team could backtrack to older versions of the code when needed

The initial time invested in learning to use git paid off immediately by making code development faster and safer.

Having more analysts working on development also allowed the use of “paired programming”. This is where one person writes code and shares their screen, while the other offers review and guidance. Paired programming made it easier to onboard analysts and avoid single points of failure.

Coding resource was an important barrier to RAP, so we took on some of the coding work. This gave the development team time to upskill without having to sacrifice business-as-usual or code development work. The combined team had the necessary domain knowledge and statistical experience, as well as RAP development experience, which meant we could create a better pipeline as a team than any one member could create on their own.

Alongside the new pipeline came new ways of working. The development created a “definition of done” to clearly define when a piece of work could be considered complete. It also ensured that time estimates included the work needed to meet these standards. Code that met these standards was quality assured through documentation, review, and automated testing.

Finally, the team designed functionality that was adaptable and could be extended in future to meet changing requirements. This means the code will not need to be re-written as often in the long term. The pressure to produce results has never gone away, but the high-quality code base has meant less time needs to be spent dealing with technical debt. Analysts now have more time to concentrate on meeting user needs.

As part of the work, the pipeline was migrated to a new cloud platform which helped reduce a lot of the technical issues the development team previously had to deal with. This was a significant effort that from a multi-disciplinary team and a code base that is self-contained, modular, and adaptable. RAP transformation not only made for a more efficient and robust pipeline, but also made it possible to use more suitable technologies.

The CISA team produced three mature RAPs within six months while we were working with them. These pipelines now run at least monthly, with some running multiple times per week, saving thousands of hours of analyst time per year. The pipelines are rigorously version controlled, tested, and documented. The analysts we worked with have since trained others in their division to follow these good practices.

The outcome

The new pipelines are faster and demand less analyst time. The team are now using a new cloud platform, which has also helped to reduce runtimes to less than a quarter of what they were before. The move to the new cloud platform would not have been possible without first moving to better pipelines.

Code development is more productive because:

  • colleagues can work together more effectively
  • the team are following better working practices
  • the code base is easier to review

Many of the practices that improved reproducibility had the added benefit of making development faster, easier, and less prone to error. Increased efficiency led to increased quality assurance, as analysts spent more time writing automated tests rather than manually checking intermediary outputs.  As there are multiple contributors to each pipeline, these now have fewer risks thanks to:

  • better business continuity and knowledge management
  • code that is better reviewed
  • better quality code due to increased knowledge-sharing

RAP development has a high upfront cost in terms of resource. Without committing enough time for RAP development there is a risk that the project could fail, which would result in a waste of resource.  But with the right resource commitment and a multi-disciplinary team the RAP transformation was a success in CISA.

The COVID-19 Infection Survey has provided high-quality information to aid the government’s response to the pandemic. Our support, along with the enthusiasm and skill of the great analysts in the CISA team, enabled the survey to become more efficient, higher quality, and more responsive.

Other sources of case studies

Other sources of case studies include:

Original blog posts

There are several blog posts that share examples of RAP principles being used across government, including: