How do you solve a problem like missing data?
The Office for National Statistics (ONS) is exploring how administrative data sources can be used to produce estimates of the population in England and Wales.
A tremendous amount of work has shown the potential for using administrative data to produce population estimates, as well as its potential for estimating characteristics of the population and combining these characteristics together. This includes characteristics such as:
A significant benefit of using administrative data is that the ONS may be able to use data that has already been collected, on a large scale, by other government departments and organisations. These include the National Health Service (NHS), His Majesty’s Revenue and Customs (HMRC), the Department for Work and Pensions (DWP), and many others.
There are legal gateways which can allow accredited and approved researchers to access administrative data for research and statistical purposes, and certain criteria must be met for this to happen. But one major challenge that the ONS is facing with administrative data is how to handle missing data in these data sources.
Missing data is not a problem unique to administrative data. There are already statistical methods in place to handle missing data in other data sources. But there are unique challenges with missing administrative data, which teams at the ONS have been investigating.
It is important to deal with missing data in administrative data sources for several reasons. For example, we must ensure data has very high coverage of the population of interest, for instance the usual resident population in England and Wales. It is important to have a high level of coverage of the population, otherwise the quality of outputs using the administrative sources will not be sufficient.
We also want to be able to allocate records to a variety of geographies, both at national and regional level, as well as small areas like local authorities. This will help us derive a range of statistics. This is particularly challenging when using administrative data sources to measure the whole population, because it is dependent on how the administrative data are derived. Administrative data is usually collected when citizens access a service, for instance registering with a GP, or submitting Self-Assessment forms if they are self-employed. We need to make sure the data used are accurate and inclusive for statistical purposes.
Another common challenge with any data source for the population who use services is that some of the information for records may be:
- missing — this could be because service users have not provided certain pieces of information
- inconsistent — this could be where there is different information when linking across multiple data sources
To solve this problem, we considered using statistical imputation methodologies that aim to replace missing or implausible items with realistic values. This means that when there is missing or implausible data, the imputation method will estimate a value based on other information in the data.
One challenge we faced was to decide what type of imputation methodology to use. There are various types of imputation models that differ depending on the:
- type and complexity of data being used
- amount of missing data
- availability and ‘ease-of-use’ of the methods and data sources to resolve the missing data
We considered an extensive and rapid literature review of potential methods would be beneficial and decided to outsource this work to an external organisation. Third party contracts with external organisations can be a solution to fill gaps in resource and get work completed quickly. The Methodological Research Hub awarded this work to Alma Economics following a commercial tender process. Alma Economics produced:
- a literature review
- an evidence map — this is an interactive tool that categorises the reviewed papers into succinct themes
Finally, the team from Alma Economics researched available code options in Python. This has enabled us to explore how we can use the reviewed methods in our research project.
Having this literature review and coding examples puts the ONS in a strong position to tackle missing administrative data. It also highlights what methods may be best suited to address these issues. As we have outlined, there are several types of missing data, but the literature review provides us with an awareness of the variety of methods needed to deal with these various types of missing data.
Of course, the success of any method dealing with missing data will always depend on the availability of high-quality data that can inform the imputation of missing data. Such considerations on data availability, as well as the information in the review, will help us understand the latest and best quality methods that we could use in the future where missing administrative data risks the quality of using that data for ONS purposes.
In summary, through this work the Methodological Research Hub at the ONS received a quality product that made a significant contribution to our research and benefitted from out-sourcing to an external organisation. This meant we could free resource within the ONS and ensure our work progressed at pace. This work will be used to consider future methods to address missing data in various applications making use of administrative data.