Anonymisation and data confidentiality
|Publication date:||31 January 2020|
|Owner:||Data Quality Hub|
|Who this is for:||Members of the Government Statistical Service|
This guidance provides advice for maintaining data confidentiality while making the data as useful as possible.
What is data confidentiality?
Data confidentiality ensures that confidential information is not disclosed to parties who do not have authority to access it.
What is anonymisation?
A process that transforms personal and/or identifiable data into data that cannot be identified. This procedure requires the removal, obscuring or aggregation of identifiers. For anonymity, combinations of variables that allow indirect identification must be protected by one or more methods of disclosure control.
What is Statistical Disclosure Control?
Statistical Disclosure Control (SDC) is the application of methods to reduce the risk of disclosive information about data subjects. Methods for SDC usually restrict the amount of or reduce the detail of the data released. Access to unaltered data may be controlled by releasing them to a small number of users under licence as an alternative.
What does SDC achieve?
SDC modifies data so that the risk of data subjects being identifiable is within acceptable limits. The aim is to make the data as useful as possible, while maintaining appropriate confidentiality.
Where should SDC be applied?
- on cells that have frequencies with low counts enabling identification of data objects and potentially associated attributes
- on cells that have magnitudes where there is a dominating contributor or contributors
- on unique records in the data with respect to a small number of defined variables. Usually released under licence
Common challenges in disclosure were identified in National Statistician’s Quality Review (NSQR). These challenges were:
- assessing what is possible versus what is workable in a statistical context. In practice, balancing risk of disclosure with data utility is a key consideration
- predicting and planning for future change amid evolving public attitudes to privacy. This is influenced by publicity around any data privacy breaches
- evaluating the risk of intruder attacks, particularly when linking data
- keeping up-to-date with the latest methodological research and technological advancements, and building capability
- developing approaches to better exploit new data sources and technological advancements. Especially when the usefulness of these have not yet been demonstrated
- assessing and developing the potential of specialist software and automation
- communicating disclosure risk, choice of privacy and confidentiality methods and their trade-offs
- future-proofing data releases and anticipating future releases by different data providers
The Introduction to statistical disclosure control course covers typical disclosure risks and SDC methods. This course is open to all members of the GSS.
Data ethics is covered in the introduction to data ethics online course and is open to all members of the GSS.
The Better Use of Data page provides information about the statistics and research strands of the Digital Economy Act 2017. The Office for National Statistics (ONS) hosts a data protection information page that provides privacy information for ONS data subjects.
The GSS website hosts materials on various forms of disclosure control:
- guidance for tables produced from surveys and case studies for tables produced from surveys (PDF, 38KB)
- guidance tables produced from administrative sources and case studies for tables produced from administrative sources (PDF, 85KB)
- guidance for microdata produced from social surveys and case studies for microdata produced from social surveys (PDF, 38KB)
The anonymisation code of practice (PDF, 1.84MB) gives guidance for managing the risk of data protection.
The National Statistician’s Quality Review (NSQR) of privacy and data confidentiality methods made recommendations around the increasing detail and volume of available data. This NSQR also discussed challenges to protecting the confidentiality of personal information.
Data confidentiality champions
A cross-government network group of data confidentiality champions has been set up to:
- identify, promote and share best practice across government
- act as departments’ central point of contact for advice and support
- identify areas within departments needing extra support
- help to test and disseminate GSS tools, guidance and training
Terms of reference for champions and further information will be released on this web page. If you would like to become a champion of data confidentiality, please contact email@example.com.
The Disclosure Control Centre of Expertise
A Disclosure control Centre of Expertise (DisCoE) has been set up, led by the Methods, Data and Research directorate of the ONS. This group carries out research into new methods for disclosure control. Also, the centre will expand on the toolkit of GSS courses and guidance that’s available.
Publications and information about DisCoE will be shared on this page as they are released.
Future anonymisation and disclosure control methods
There is usually a trade-off between maintaining privacy of data subjects and accuracy of the information disclosed. Differential privacy gives precise quantification of this risk. This allows statistical producers to inject a specific amount of noise into the data to safeguard against disclosure. With assurance of the exact amount of noise needed, statisticians are able to maximise the quality of their data. To learn more see the NSQR chapter on differential privacyy (PDF, 964KB).
These data have the properties of real data sets but are generated using a statistical model. In practice, a synthetic data set is only as good as the model underpinning it. Producing realistic synthetic data is challenging. It is difficult to capture conditional relationships between variables and there is usually still risk of disclosure. Elliot and Domingo-Ferrer (PDF, 566KB) discuss the potential of machine learning in synthetic data, and research conducted to synthesise data using deep learning techniques. Page et al (PDF, 936KB) identify differentially private synthetic data as one path for model deployment and discuss some potential advantages of this technique.