Tips for urgent quality assurance of data
This guidance sits alongside ‘Tips for urgent quality assurance of ad-hoc statistical analysis‘. That guidance focuses on the analysis, this looks at the data used in the analysis.
|Publication date:||6 May 2020|
|Owner:||Data Quality Hub|
|Who this is for:||Users and producers of statistics|
The Code of Practice for Statistics says “quality means that statistics fit their intended uses, are based on appropriate data and methods, and are not materially misleading.”
This is supported by the principle Q1: Suitable data sources: “Statistics should be based on the most appropriate data to meet intended uses. The impact of any data limitations for use should be assessed, minimised and explained”.
The Office for Statistics Regulation produced a Regulatory Standard including a toolkit on the Quality Assurance of Administrative Data (QAAD) for statistical producers. Using the toolkit helps decide on the level of assurance required by considering two dimensions: the likely risks to quality and the profile of the resulting statistics.
Once the level of assurance is decided, producers quality assure data through actions on investigation, management and communication.
If you are rapidly acquiring data to urgently answer questions, possibly in a rapidly changing situation, you will not have the opportunity to work systematically through a managed framework. The data will typically also lack the comprehensive metadata and quality management information to help understand its properties. Here we offer some tips to help address the requirements of quality assurance in such circumstances.
Urgency modifies the profile of the resulting statistics because, for example, the decisions being taken will change. This emphasises that the profile isn’t static, and this is also true of the risks to quality. So, even if a thorough QAAD analysis has previously been carried out on these data you will have to think about the data in the light of the current situation.
“No Assurance” is never an option, as it could lead to statistics and analysis that “materially mislead”. This means you will still need some time for quality assurance.
Broad tips on your approach
How will the statistics or analysis be used?
If you are to respond quickly, you will need to be able to work closely with the users of the data to find out what they need to know from the statistics. In doing so, discuss their appetite for the residual risk from a rapid assessment of data quality. Internal use for management information may allow for more risk than wider publication.
Try to determine how this user base may evolve over the near future. Will a narrow initial use be followed by publication to a wider audience? This would suggest a phased quality assurance with more in-depth scrutiny following on from initial urgent work.
Be open about the added uncertainty
A consequence of not being able to investigate in depth is that you will have a weaker grasp on the likely limitations of the data, including measurement errors and whether the recorded attributes reflect reality well. This can be mitigated by labelling the statistics and analysis with a frank description of what has been summarised. This should give an indication of the coverage and timing, and the method of measurement. You should also be transparent about areas you could not investigate in time. This is part of the trade-off being made: the public interest being served by the best available information, accompanied by straightforward, frank explanations.
Tips on detailed actions
Tip 1: What exactly is in the data?
What are the units or records in each table in the data – what is it that the recorded attributes refer to: events, people, adults, patients, households, addresses, businesses, care homes? Are these consistent throughout, or have different types of units been mixed in a data table?
What is the coverage of the data? What units should be recorded in the data, and who is recorded? Try to be as explicit as possible about this inclusion noting, for example, that geographic coverage could refer to usual residence, presence on a particular night, workplace, etc.
Tip 2: How did the data get here?
While you won’t get an in-depth picture of the motivations and context of the data inputters and the errors that might arise, you should build the best impression you can. Can you get access to the form or forms used to capture the data at each point in time, together with any associated instructions? Are there checks built into the instrument that captures the data? Where data register an event, is there a time lag between the event and registration, and are you clear about which date variables record each?
Can you sketch out the data journey from that initial input to the set you have in front of you? What systems does it pass through? The steps where data are passed from one system to another, are merged or linked with other data or are stored in a format such as a spreadsheet represent vulnerabilities where errors can be introduced.
Where variables in the data have been derived from others, can you get access to the code used to create these? Be aware that brief variable and value labels typically don’t capture the complexity of the decisions taken.
Where data have been linked, find out how the linking was performed. This may be using a common key variable or by using a combination of attributes possibly in an approximate match. How was the linking performed and what was the quality of the linkage?
Tip 3: Learn from other uses of the data
What else are these data used for and what quality assurance has already been done? If the data are used to determine eligibility for a service or benefit, then the attributes used in that determination are likely to have been scrutinised better. Your supplier may be able to describe that scrutiny.
If there are other analyses of the data, including internal management information, can you get access to those and partially reproduce them to ensure you have the right data and give insight into its properties?
Tip 4: Triangulate with other sources
Where other sources produce estimates for the same or related phenomena, do preliminary estimates from your new data lead to coherent patterns? There will be differences, but can you triangulate between sources and produce a convincing explanation for the size and direction of those differences?
Tip 5: Consider distortive effects
In any quality assurance of data, you should consider distortive effects arising from, for example, pressure to meet a target. You may not be able to fully explore these, but frank labelling, as mentioned above will help to alert users to this possibility. In addition, where you are acquiring data to respond to an urgent situation, consider whether that situation could motivate data inputters differently, or put stress on the system in such a way as to distort the data.
Tip 6: Put your data into a tidy form
Manipulate the data into a tidy format, with units in rows, attributes (or variables) stored in columns and different types of units stored in different tables
Where there are hierarchies in the data, what is the relationship between the tables? Should units in one table have related entries in other tables, or is it possible for that to be zero?
Tip 7: Plot your data
Univariate and multivariate plots of your data will help you understand the distributions and quickly see individual departures from general patterns. So, you may see spikes in distributions from using default values or imputing data, or they may relate to expected peaks around known values such as benefit rates.
Tip 8: Carry out credibility checks
As with any analysis, try to anticipate the magnitude of the results you will get from the data. Discuss this with your users in advance – how many cases were they expecting? Can you use other sources to inform these predictions? Then compare your rough predictions with simple aggregates from the data. Can differences, possibly for different groups, be explained?
Tip 9: Focus resource on what matters
Your ability to check and edit the data will be restricted by the time available. Try to focus that effort on observations and attributes that are likely to be influential. Also consider which data will affect the most high-profile estimates, where errors are more likely to mislead or draw doubt on the credibility of the analysis and your users. Which inconsistencies in the data might lead to such concerns?