Open sourcing analytical code

Policy details

Metadata item Details
Publication date:15 March 2023
Owner:Analytical Standards and Pipelines team at the Office for National Statistics (ONS)
Who this is for:Government analysts

What we mean by open source code

Open source code is code that is made freely available for users to use, modify, and share. Users and other programmers are able contribute to open source projects. Analysts may choose to work in the open where work is visible even before it is complete.

Who this guidance is for

This guidance is for anyone working on analysis code in government, whether they are currently open sourcing their code or not. In particular, the guidance should be read by managers who own analysis pipelines and need to decide whether to open source analytical code.

What this guidance is for

This guidance is intended to help analysts evaluate how they could benefit from open sourcing their code. It will explain the risks and offer guidance about how to open source code safely.

About the authors

This guidance was written on behalf of the Analysis Function by the Analysis Standards and Pipelines team at the Office for National Statistics (ONS). We work to support Reproducible Analytical Pipelines (RAP) development across government.

Why we should make analytical code open source

The Central Digital and Data Office (CDDO) recommends open sourcing code whenever possible and making code open from the start of the project. CDDO also provide guidance on how to decide when open sourcing is not appropriate.

Open sourcing code is also an important part of RAP. You can find out more about open sourcing code and RAP on Github.

Open your code to increase trust in your analysis

Open sourcing your analysis code gives users and citizens the assurance that it can be reviewed by anyone. When you use code to carry out analysis, it is possible to see exactly how it was done and reproduce your findings. But your users will only have access to this information if you open your code to the public.

As taxpayers fund this work, it must be transparent and open. The CDDO service standard states:

“Public services are built with public money. So, unless there’s a good reason not to do so, the code they’re based on should be made available for people to reuse and build on.”

Use open source code to encourage colleagues to work together

Open sourcing has great benefits for your users. But it will also benefit your team. Opening your code allows other analysts and users to review it, which will help you improve your product.

It is much easier to share open source code with colleagues in different teams and departments. This makes it much easier to ask colleagues to review your work and support you with it. You do not have to accept contributions from others when you code is open source. You should accept feedback on your code, but it is fine to state that you are not looking for others to directly contribute to your open code.

Open sourcing code improves quality

Open code needs to be simple enough for others to understand. Because of this, working in the open will encourage your team to maintain:

  • a high standard of documentation
  • readable code
  • a good audit trail

Writing code that anyone can understand means it is easy to introduce new people to the project and improves business continuity.

Help others benefit from your work

Analysts across government often solve the same problems. For example, they create similar visualisations, apply the same statistical methods, or handle similar data. Sharing your code helps others avoid duplicating work that has already been done and encourages them to return the favour.

Deciding when to open source code

Open sourcing code is easy, but it should be done with care. There are free tools available to share code openly. For example, GitLab, GitHub and BitBucket are platforms that have version control and additional software management features. All these tools use Git, which is a code version control software. Version controlling analytical code using Git is a core requirement for RAP, so Git should be available in your department.

While this makes opening your code very easy to do, you should not do it lightly. There are security and disclosure risks associated with making code publicly available, so it is important that you take great care with this.

Consider whether it is appropriate to open source your code

Analysis code should be open by default. But there are times where all or part of your code should not be open. For example, this includes code involved in:

  • counter fraud – this code is used to assess and flag potential criminal activity
  • statistical disclosure – this code is used to automate statistical disclosure control that could be used to find individuals in datasets
  • work that is sensitive in nature – for example, if the analysis refers to un-released policy

If your code cannot be open sourced in its current form, consider how you may be able to change it so that you can open it. You should also consider your organisation’s security policy. Your organisation may have its own rules on whether, and how, people can open source their code.

Know your code base

You should manage the risk of disclosure from open source code like any other disclosure risk. If you manage other analysts, you may not write the code yourself. But you should still have a good understanding of how the data and the code are structured so that you can understand where the disclosure risks are.

Make yourself aware of mitigations for disclosure risks and where to find guidance and advice. If you are a manager, you must make sure everyone knows what is expected of them regarding disclosure risks and what the relevant policies in your department are.

Ensure those around you know how to work in the open

Everyone working on open source code should have the knowledge to avoid basic and avoidable errors, like publishing raw data or sensitive information. If information is shared accidently, they should know how to:

  • remove information from the source code platform
  • report the error to their data protection officer

The Duck Book provides detailed guidance on using Git safely and training for Git.

Managers should be aware of good code practices. They should also ensure their team have the time and support to learn the necessary coding skills and practices.

Consider different ways of open sourcing code

The decision about when and how to open source code should be based on public value, user needs, risk management, and technical constraints. You should review your approach as circumstances change. For example, you should review your approach if the skill level of your team changes.

Coding fully in the open maximises the benefits of open source, but it also carries more risk. It may be suitable if your team already has the right technical skills and awareness of disclosure risk. Working in the open may not be feasible right away, if:

  • some team members are too inexperienced to safely use Git
  • the code can only be shared safely at certain times, meaning the team must keep it closed until the release day
  • the project will be safe to share in future, but the team is not confident about sharing the entire version history

Alternatively, you can choose to maintain two repositories, with one private and one public. In this case, you can release new versions of the public code at important times, for example with each publication cycle. This allows more control over what is in the public repository. But this can also create more work than working fully in the open. It will also mean the public will not have access to a full version history and audit trail, which limits transparency.

Lastly, you may choose to keep your repository private until the team is ready to start working in the open. But this is likely to create more work before you can open the repository to the public. For example, you may need to write new documentation and check the entire version history. If you start the project in the open you can encourage these good practices to be followed from the beginning.

Good open-source coding practices

Use open source software

If users need to buy expensive software licenses like SAS, SPSS, or Stata to run your code then it is not truly accessible. Using open source tools such as R and Python means that when we open source our code, anyone can pick it up and run it. It also means we can re-use other peoples’ open source code in our own pipelines.

Use an appropriate license

An open source license makes it clear to your users what they are allowed to do with your code. In most cases, this will be an Massachusetts Institute of Technology (MIT) license. You can see an example of an open source license in practice in the Government Digital Service (GDS) way. This license allows your users to use your code as they wish. But it also makes it clear that the code comes without warranty. The license states that the copyright notice should be included if the code is copied and used elsewhere.

Design your code to be easy to open source

Even if you are unsure when or whether you will be open sourcing your code, you can save a lot of additional work by designing code that can be made open in future. Usually this means using general good coding practices, such as using version control software and writing comprehensive documentation.

When working towards open sourcing code, teams must implement practices to avoid disclosure. We recommend that you should do this routinely for all your coding projects. Embed good practices from the start and take the time to remove disclosive files right away. When opening a previously closed repository, the entire version history must be safe to share, not just the latest version.

Reduce risk by structuring projects the right way

You should structure your projects to lower the risk of disclosing sensitive information like raw data and passwords. Most of the time you can achieve this by placing those files outside of the repository so they will not end up in the version history. To be safe, avoid saving the following in the same folder as your repository:

  • raw data
  • outputs
  • configuration files
  • credentials such as application programming interface (API) keys and passwords

Different tools and software have their own pitfalls. If you are using Jupyter notebooks, be aware that notebooks store outputs as well as code. If you are working in RStudio, you should disable automatic saving of data and history. These can contain data and other sensitive material.

It can be difficult to manually check your repository for all these files. You can to exclude certain types of files from the version history. Other tools, such as govcookiecutter, are available to help you add addition layers of assurance.

Managers and technical leads should ensure everyone working on the project has a good understanding of which files should be included in the repository and which should not. While tools exist to help you, ultimately it is your team’s responsibility to ensure code is shared safely and responsibly.

Follow good practice guidelines for code

Good code does not contain disclosive information or security risks. For example, data and file locations should not be hard-coded into scripts. Well-written code is easier to check and review, which reduces the risk of disclosive information ending up in the code without someone realising.

Making the most of open source code

Share your work

You will get more value out of your open source code if more people know about it. More users of your code means more people to review your work, suggest improvements, and even contribute to your projects.

The Government data science Slack workspace is a great forum for sharing our projects. You can also use internal and external blogs to promote open source work. You can create GitHub Organisations for your team or department, so users can find related open source repositories more easily. If you are publishing your analysis, make sure you include a link to your open source code in the publication.

Document your work

Documenting your code base and project makes your repository easier for other people, and new team members, to understand. It can help you get more outside reviewers and contributors, which means more people can review the code and suggest improvements. It also makes it much more likely that other government analysts will be able to re-use your code.

A lot of open source projects contain contributing guidance so that everyone knows what rules to follow when adding to the code base. This is useful if you want outside contributors to actively support the project. While this might not apply to some analytical pipelines, this will be helpful if you are designing code that others can re-use. The Turing Way provides a great example of contributing guidance.

Generalise your code

A lot of analytical pipelines contain code that others would like to use. Often, the problem is that the code is designed to solve a specific problem or work with a specific dataset.

Consider how other analysts might be able to repurpose your code when you are writing it. When you are writing functions, try to make them as generic as possible. Your team and others will get more value out of your code if you package up useful functions for re-use. Generalised code is easier to understand and re-use internally. The gptables and a11ytables packages, which simplify producing accessible statistical outputs, are good examples of this.

Include dummy data

For analytical pipelines, sharing the code may not be enough. If your users do not have access to the real data, they would find it difficult to check whether your code is fit for purpose. Include example or dummy data with your code so that users have a dataset to run the pipeline with.

Dummy data is fictional, but it has the same column headings and types as real data. It can be used as a placeholder for the real data during development. This means it is easier to keep your data separate from the code base. While dummy data makes open sourcing easier, it is also helpful for testing purposes. Dummy data can be tailored to allow you to test how your code behaves when given realistic data with specific attributes.

  • If you would like us to get in touch with you then please leave your contact details or email directly.
  • This field is for validation purposes and should be left unchanged.