RAP: it’s all about the code

Jade Lester

What is RAP?

RAP stands for Reproduceable Analytical Pipeline. A pipeline is a coding project with data as an input, and a process to transform or clean it and output it again. While that might be a description for every coding project, the ‘reproducible’ part is what separates a RAP from general analysis or processing.

The Reproducible Analytical Pipeline is designed to be run and re-run numerous times and produce a consistent output. A consistent output is easier to quality assess, as you can prove (using the output) that the pipeline repeatedly does what you said it does.

A tool for reliability

You may think you are already running a RAP, especially if you are processing data repeatedly. However, there are standards to meet which make pipelines more effective, less risky, and better quality; all things RAP!

Working with small datasets which don’t generally change or update can involve simple processes to clean and change them. When we start dealing with large data sets, which are potentially linked with other data sets or even across a timeline, the process can start to get more complex and unwieldy. The risk of errors creeping in increases each time the code is run. If we focus on the idea of reproducibility, we can design processes with the RAP standards in mind.

We are creating a tool: like any tool you expect to use repeatedly, it will need to be maintained for future uses – preferably with some nice instructions so you don’t hurt yourself when you use it.

Three standards for success

To make reproducible analytical pipelines which help your team, consider the RAP standards:

Readability

We all get caught in the act of creating and forget that other people may use our tools and might not understand the logic or the purpose.

An easy win is to give functions and objects clear names – df1 might not say much at all, but population_by_county tells an analyst a bit about the data structure, and even what it should be. Similarly, a ‘process_data’ function could mean a lot of things, whereas ‘filter_on_date’ is a far clearer. This helps analysts better understand code at a glance, and can support easier maintenance.

Modularity

What happens if you write a part that’s good – like really good? Surely you’ll want to use it again in this project or another one?

Using classes and functions makes the process much easier for multiple reasons. Firstly, it reduces the likelihood of errors being introduced from manually copying and pasting the code, and means updates only need to happen in one place. Secondly, it helps isolate bugs to that one function, making them easier to squish later! Using functions in particular also complements the next RAP standard, reliability.

Reliability

Utilising unit testing and custom dummy data means that we can test different parts of the code to make sure it’s all working as expected. Thinking about what we want our code to do in different scenarios also allows us to identify and work out how to handle ‘edge cases’. This could include missing or incomplete data, or an unexpected value. These tests help to ensure our code is robust and to minimise breakages.

The many benefits of RAP

These features all interact with and support each other. A more modular code base is easier to test, as you are testing smaller sections of it, making it more reliable and more consistent. A more reliable code is more likely to be used for a longer period of time, as it’s simpler to use, meaning more people will be able to use it.

Readability is important so everyone can understand what’s going on and not only run it but adapt or de-bug it. If everyone understands it, it’s easier to know if a function or feature is too big and needs to be broken down into more pieces, making it easier to become more modular. This all contributes to code that’s easier to maintain and continue to improve on – it’s a positive feedback loop.

Ultimately, these standards mean you have reliable and effective code resulting in more accurate data outputs and more accurate analysis. The analysis is also more consistent, making it far easier to quality check it and to ensure the code does exactly as it has been designed to do (further feeding into the Reliable aspect of RAP standards).

This reliable and effective code can be used and reused for a longer period, getting the most use from your new tool. It’s also easily built on, creating more interesting and broader analysis and providing better insight for all users.

Applying RAP in your work

In case you still need convincing of the benefits of RAP, improving code quality was highly rated as a factor contributing to job satisfaction of developers in the Stack Overflow 2024 Developer Survey. So why not get started with RAP – it could be more enjoyable than you think.

Hopefully, it’s now clear how incorporating even a few RAP fundamentals can help with projects across several different measures. It doesn’t end here – there’s more to RAP than just coding. Our next blog, ‘How to RAP’, will explain ways of working in the code and how to set yourself up for success.

Reproducible Analytical Pipeline (RAP) Champion Network
Jade Lester
The Reproducible Analytical Pipeline (RAP) Champion Network supports the implementation of reproducible analytical pipelines across government. You can learn more about the network and join as a champion or practitioner through their GSS webpages.