Privacy and data confidentiality methods: a Data and Analysis Method Review (DAMR)
|Publication date:||13 December 2018|
|Owner:||Data Quality Hub|
|Who this is for:||Members of the Government Statistical Service|
This is the Data and Analysis Method Review (DAMR) on privacy and data confidentiality methods (this was previously called a National Statistician’s Quality Review).
Foreword from John Pullinger
The rapid increase in the detail, volume and frequency of data collected, alongside the diversification of data sources available, presents a real opportunity for the statistical community to innovate. This richer detail will provide better statistics that deepen our understanding of society and better support decision making.
On the other hand, making more data available raises new concerns and new challenges to protecting privacy and confidentiality of personal information. In this challenging and rapidly changing landscape, the statistical community has a legal and an ethical obligation to protect the confidentiality of data, while at the same time striving to meet evolving user demands for more detailed and helpful statistics.
I welcome the findings of this National Statistician’s Quality Review (NSQR) of privacy and data confidentiality methods, which will help the Government Statistical Service (GSS) to take full advantage of the cutting edge developments and research conducted by world leading experts from across academia and the private sector. Closely collaborating and joining forces with these experts helps us to ensure that the GSS prepares for the future and identifies opportunities to improve and innovate.
We are delighted to collaborate with leading experts in privacy and confidentiality external to the GSS and grateful for their contribution: Professor Natalie Shlomo and Professor Mark Elliot from the University of Manchester, Professor Josep Domingo-Ferrer from Rovira I Virgili University in Tarragona Spain, Professor Kerina Jones and Professor David Ford from Swansea University, Professor Jim Smith and Professor Felix Ritchie from the University of the West of England, Dr Hector Page and Charlie Cabot from Privitar and Professor Kobbi Nissim from Georgetown University, USA.
Find out more about this collaboration.
The Digital Economy Act (DEA) 2017 enables better sharing and use of data across organisational boundaries, at a time of dramatic increase in the volume of data available and a wide range of sources. New and updated legislation, including the General Data Protection Regulation (GDPR), have brought about major changes to the way organisations process personal data, encouraging greater transparency and accountability.
These developments present an unprecedented opportunity to innovate with data, while safeguarding privacy and fostering public trust. On the other hand, against a backdrop of increasingly sophisticated attacks being developed by intruders and the potentially serious consequences of data breaches, the data revolution presents significant new challenges to protecting privacy and confidentiality.
It is vital that the statistical community understands and addresses these evolving challenges to provide a solid foundation for innovation to take place. This is not a straightforward task, and there is a need to draw upon the full range of expertise in this fast-developing field.
By joining forces with leading experts, this Data and Analysis Method Review explores the latest advances in these methods. From a suite of contributing articles setting out the latest developments, the review draws out emerging themes and articulates the challenges facing the statistical community. It also sets out the steps that need to be taken across the statistical system, which will be further developed and implemented through engagement across organisations.
In collaboration with leading experts, this review has explored the latest research and trends for privacy and confidentiality methods and has identified the following challenges currently faced by producers of statistics:
- Assessing and balancing what is theoretically possible with what is feasible and appropriate in a statistical context. In practice, balancing risk of disclosure with data utility is a key consideration.
- Predicting and planning for future change when the nature of that change is uncertain and public attitudes to privacy are changing and evolving; this is heavily influenced by publicity around any data privacy breaches.
- Evaluating the risk of intruder attacks and new types of privacy attacks, particularly from linking with information available from other sources.
- Keeping up-to-date with the latest methodological research and technological advancements, and building capability.
- Identifying and developing approaches and methods that better exploit the potential of new data sources and technological advancements to investigate the feasibility of new developments whose practical usefulness has not yet been clearly demonstrated (such as machine learning and artificial intelligence applications, or practical applications of differential privacy).
- Assessing and developing the potential of specialist software and automation.
- Communicating disclosure risk, choice of privacy and confidentiality methods and their trade-offs to the statistical community, decision makers and the public.
- Future-proofing data releases and considerations around what is going to (or likely to) be released in the future by different data providers. Legally, producers of statistics do not have to consider any data not already (or due to be) in the public domain, but best practice should be to consider future-proofing in expectation of what is likely to be released.
This review aims to support decision makers as well as privacy and data confidentiality experts from across the Government Statistical Service (GSS) to understand the art of the possible for these methods and make informed decisions on the next steps required to keep pace with latest developments.
Follow-up work will help to expand understanding of how these methods can be developed and implemented across the GSS. Further strands of work are to be confirmed, but it is expected that they will support activities such as identification of priority areas, further understanding of the current practices and development of jointly agreed implementation plans.
The greatest challenge for all specialists in confidentiality protection and for producers of statistics is to move with the times. An important outcome from this review must be to help make inform decisions on further developments needed for these methods and to build capability across the Government Statistical Service (GSS). The key strands are:
- Establish a disclosure control centre of expertise led by the Office for National Statistics (ONS). The centre should look to build expertise across the GSS and should act as a central point of contact for disclosure control advice, including reviews of the current methods used against best practice and developments in this field
- Further expand collaborations between the GSS, academia, private sector and other National Statistics Institutes (NSIs). This will include setting up an international working group with other NSIs to share knowledge, research and future plans on privacy and confidentiality. Engagement and collaboration with experts in academia will also be utilised to ensure the most up-to-date methods are understood, researched and applied across the GSS.
- Reinforce and encourage a balanced disclosure risk vs. data utility approach. Develop tools and capability for risk assessment, including approaches for addressing and evaluating the risks brought by new types of intruder attacks. Ensure there is good practice in risk management rather than only focusing on risk reduction. Acknowledge that the risk can never be zero and have procedures in place in the event of a confidentiality breach, including a more formal breach policy.
- Expand the toolkit of courses and guidance materials, ensuring they meet user needs. This could include provision of e-learning and other activities such as clinics and workshops, drawing on existing and emerging expertise across GSS. One of the first tasks here will be to assess what training and guidance are currently used.
- Explore the applications of new techniques and technology to identify and develop new approaches and methods that take better advantage of the opportunities provided by the data environment. Identify exploratory research needed to demonstrate the practical application for these new approaches for the production of statistics and highlight their trade-offs, advantages and limitations.
- Explore the potential for practical applications of new software and technologies in protecting privacy and confidentiality. Coordinate the development of specialist software and automation to support the assessment and implementation of anonymisation best practice. Explore options for applying SDC methods to support the development of new dissemination strategies, such as automated table generators, determining their applications and limitations.
- Develop an overarching confidentiality protection framework that offers data providers a way of measuring the risk and data utility across Statistical Disclosure Control (SDC) methods and privacy models in a more unified way. Explore options for this confidentiality framework to provide anonymisation transparency and help communicate disclosure risk, choice of privacy and confidentiality methods and their trade-offs to the decision makers and the public.
- Establish a GSS Task Force to implement the next steps from this DAMR. This forward-focused group should report to the GSS Statistical Policy and Standards Committee, and draw on the expertise and experience in anonymisation practice from across the GSS.
There is a need to continue to build on the research already carried out as part of this DAMR, working across the Government Statistical Service (GSS) and engaging support from the academic community and private sector.
Suggested topics for the forward programme of research include:
- Investigate how statistical disclosure control methods may be adapted to learn from insights provided by differential privacy by conducting a pilot on a “non-sensitive dataset”.
- Assess whether disclosure risk is increased by providing the parameters of the protection method.
- Review the set of intruder scenarios to include potential and emerging threats to privacy and confidentiality.
- Develop strategies for defending against reconstruction attacks.
- Assess the risk pertaining to inferential disclosure.
- Analyse the threat of social media and other data sources to identify appropriate strategies to protect data releases against the risk of a linking attack.
- Assess the practicality of producing and using synthetic datasets.
- Further develop the measures to quantify risk of disclosure and information loss including taking context into account.
- Explore the practical applications of machine learning and artificial intelligence protection of privacy and confidentiality and options for avoiding the replication of human biases.
Summary of GSS challenges, next steps and further research and development
|GSS challenge||GSS next steps||GSS further research and development|
|Balancing theory with what is possible/feasible in a statistical context||Disclosure control centre of expertise led by the Office for National Statistics (ONS)||Develop the measures to quantify the risk of disclosure and information loss including taking context into account|
|Predicting and planning for future change||Expand collaborations between the Government Statistical Service (GSS), academia, private sector and other National Statistical Institutes|
|Evaluating the risk of intruder attacks||Develop tools and the capability for risk assessment||Review the set of intruder scenarios to include potential and emerging threats to privacy and confidentiality
Develop strategies for defending against reconstruction attacks
Assess the risk pertaining to inferential disclosure
Analyse the threat of social media and other data sources to identify appropriate strategies to protect data releases against the risk of a linking attack
|Keeping updated with the latest methodology||Expand the tool kit of courses and guidance materials|
|Identify and develop new approaches/methods that take better advantage of the opportunities provided by the data environment||Identify exploratory research needed to demonstrate the practical application for the approaches/ methods||Investigate how Statistical Disclosure Control (SDC) methods may be adapted to learn from insights provided by Differential Privacy (DP) by conducting a pilot on a “nonsensitive dataset”
Assess the practicality of producing and using synthetic datasets
Explore the practical applications of machine learning and artificial intelligence protection on privacy and confidentiality
|Assessing and developing the potential of specialist software||Develop specialist software and automation to support the assessment and implementation of anonymisation best practice||Develop the measures to quantify risk of disclosure and information loss including taking context into account|
|Communicating disclosure risk||Develop an overarching confidentiality protection framework||Assess whether disclosure risk is increased by providing parameters of the protection method
Further develop measures to quantify risk of disclosure and information loss including taking context into account
|Future-proofing data releases||Ongoing work including Statistical Standards and Policy Committee (SPSC) task force||Analyse the threat of social media and other data sources to identify appropriate strategies to protect data releases against the risk of a linking attack|
Contributed articles from experts
Methods to assess and quantify disclosure risk and information
Statistical Disclosure Control (SDC) methods are selected to be consistent, practical and implementable in a reasonable timeframe with available resources. Shlomo’s ‘Methods to assess and quantify disclosure risk and information’ provides a comprehensive description and examples for the different SDC approaches. Shlomo emphasises that to achieve the optimal balance between risk and utility it is important to have quantitative measures for both risk of disclosure and information loss.
To access this document please email: DQHub@ons.gov.uk
About the author
Professor Natalie Shlomo, The University of Manchester
Natalie Shlomo is a Professor of Social Statistics at the University of Manchester. She is a survey statistician with interests in survey design and estimation, record linkage, statistical disclosure control, statistical data editing and imputation and small area estimation. Natalie is an elected member of the International Statistical Institute, currently serving as Vice President and a fellow of the Royal Statistical Society and the International Association of Survey Statisticians. She is also on the editorial board of several journals and a member of several national and international methodology advisory boards.
The future of statistical disclosure control
In ‘The future of statistical disclosure control’, Elliot and Domingo-Ferrer discuss privacy models and their application to control for disclosure risk. The privacy models (for example k-anonymity) can be seen as alternative approaches to SDC methods, but in practice they can be strengthened by using one or several SDC methods. The different types of data and contexts (referred to as the data environment), including previously released information and intruders’ knowledge of the population. Elliot and Domingo-Ferrer discuss the relationship between the data environment and the data intended to be released, and draw attention to the inherent difficulty of assessing risk of disclosure given that the data held in private databases are a key source of uncertainty for the SDC process.
To access this document please email: DQHub@ons.gov.uk
About the authors:
Professor Mark Elliot, The University of Manchester
Mark Elliot is a Professor of Data Science at the University of Manchester and one of the key international researchers in the field of anonymisation. His research cuts across computer science, statistics, law, sociology and psychology. He has extensive experience in collaboration with non-academic partners, particularly with national statistical agencies where he has been a key influence on disclosure control methodology, and his advice is regularly sought by organisations across all sectors of the United Kingdom (UK) economy. Mark leads the UK Anonymisation Network and is co-director of the National Centre for Research Methods.
Professor Josep Domingo-Ferrer, Rovira i Virgili University
Josep Domingo-Ferrer is a distinguished Professor of Computer Science and an ICREA-Acadèmia researcher at Universitat Rovira i Virgili, Tarragona, Catalonia. He holds the UNESCO Chair in Data Privacy and is the director of CYBERCAT-Center for Cybersecurity Research of Catalonia. His research interests are in data anonymisation, statistical disclosure control, information security and cryptographic protocols. He is an Institute of Electrical and Electronics Engineers (IEEE) Fellow and an Association for Computing Machinery (ACM) Distinguished Scientist.
Privacy, confidentiality and practicalities in data linkage
In ‘Privacy, confidentiality and practicalities in data linkage’, Jones and Ford discuss the different linking methods with their advantages and disadvantages. The choice of data linking method depends on a variety of factors, including burden on the data provider and technological limitations.
To access this document please email: DQHub@ons.gov.uk
About the authors:
Associate Professor Kerina Jones, Swansea University
Kerina Jones is an Associate Professor of Health Informatics at Swansea University. She is the academic lead for Information Governance and Public Engagement to ensure data protection and maximise socially-acceptable data utility across the various Swansea University-based data intensive/linkage initiatives, including: the Secure Anonymised Information Linkage (SAIL) Databank, Administrative Data Research Centre Wales, Farr Centre for Improvement in Population Health through E-records Research (CIPHER) and the recently awarded Health Data Research UK collaboration between Swansea University and Queen’s University Belfast.
Professor David Ford, Swansea University
David Ford is a Professor of Health Informatics at Swansea University, where he is the Principal Investigator and Director of the Administrative Data Research Centre Wales (ADRCW). He is also the Deputy Director of Farr-CIPHER: one of the four UK Centres of Excellence for E-Health Research, funded by a consortium of top UK research funders, as part of the Farr Institute. David is also joint lead of the SAIL Databank, an internationally recognised data linkage resource that safely and securely shares linked and carefully de-identified data from a wide variety of routinely collected data from across Wales.
Confidentiality and linked data
In ‘Confidentiality and linked data’, Ritchie and Smith discuss the potential of re-identification through linking and unpredictability of future scenarios. There is promising initial evidence for new solutions to issues resulting from linking diverse datasets. Ritchie and Smith discuss the advantages and challenges posed by using new techniques such as machine learning based record linking, highlighting that the same methods and analytical tools can also be used in malicious linking attacks.
To access this document please email: DQHub@ons.gov.uk
About the authors:
Professor Felix Ritchie, University of the West of England
Professor Felix Ritchie is a professor of Applied Economics at the University of the West of England (UWE), with research interests in data management, confidentiality, government statistics and the evidence base for policy-making. He has previously worked at the Office for National Statistics (ONS) where he designed and ran the Virtual Microdata Laboratory. He also developed the Five Safes model, Active Researcher Management and Principles-Based Output Statistical Disclosure Control. He continues to advise public and private sector organisations on governance, confidentiality, efficient processes and data quality and use.
Professor Jim Smith, University of the West of England
Jim Smith is Professor of Interactive Artificial Intelligence (AI) at the University of the West of England (UWE). His research in adaptive and human-interactive systems for machine learning and optimisation has been supported by the European Commission, Engineering and Physical Sciences Research Council (EPSRC), Innovate UK, Defence Science and Technology Laboratory, other government agencies and industry, and attracted awards. He is a member of EPSRC review college, reviewer for British Council, and the Leverhulme trust, and sits on the board of two major AI journals.
Differential privacy: an introduction for statistical agencies
In ‘Differential privacy: an introduction for statistical agencies’, Page, Cabot and Nissim define Differential Privacy (DP) as a privacy model that bounds privacy risk and formalises the intuitive view about privacy that a statistical output should reveal (almost) no information specific to an individual within the data set. Protecting against disclosure risk using differential privacy is a much more context-free approach and is easily quantifiable once the parameter in the model is set, but setting this parameter is one of DP’s practical challenges.
To access this document please email: DQHub@ons.gov.uk
About the authors:
Dr. Hector Page, Privitar
Hector Page was awarded a PhD in Computational Neuroscience from the University of Oxford in 2015. After which he joined the University College London (UCL) Institute of Behavioural Neuroscience where he worked on spatial cognition. As a member of the Privitar research team, Hector works on bridging the gap between differential privacy theory and practice.
Charlie Cabot, Privitar
Charlie Cabot leads the research and data science team at Privitar, working on solving industry problems with privacy enhancing technologies. Charlie focuses on formal privacy models, risk metrics, and the statistical impact of anonymisation on analytics and data science. He has led data privacy consulting projects for clients across industries, and worked on productising state of the art privacy techniques. Previously, working in cyber security, Charlie engineered machine learning driven approaches to malware detection and modelled cyber attacks on computer networks.
Professor Kobbi Nissim, Georgetown University
Professor Kobbi Nissim is a McDevitt Term Chair in Computer Science, Georgetown University. Nissim’s work is focused on the mathematical formulation and understanding of privacy, bridging these with formulations in privacy law. With collaborators, Nissim presented some of the basic constructions supporting differential privacy, and has studied differential privacy in various contexts, including statistics, computational learning, mechanism design, and social networks. Other contributions of Nissim include the Boneh, Goh and Nissim homomorphic encryption scheme, and the research of private approximations. Nissim was awarded the Godel Prize in 2017, the IACR TCC Test of Time Award in 2016 and in 2018, and the ACM PODS Alberto O. Mendelzon Test of-Time Award in 2013.
Latest research and emerging themes
The rapid growth in the number and range of bodies releasing data has made it increasingly difficult for the Government Statistical Service (GSS) to keep track of relevant publicly available data.
The push for open data and transparency has been countered by the increase in privacy and confidentiality concerns and the increase in threat, both perceived and real, posed by multiple data sources from multiple data providers and possible linkage between those sources. The threat has been exacerbated particularly by the growth in use of social media and people’s willingness to post considerable detail about themselves on these platforms. It is possible that data collected and published by a government department could be matched to information posted on social media, leading to confidential details being revealed. Elliot and Domingo-Ferrer discuss that in practice it is difficult to determine which matches are incorrect so the assumption is made that any possible matches constitute some level of disclosure risk.
The Digital Economy Act (DEA) and developments in data technology will make it easier for the Government Statistical Service (GSS) to harness the power of data by linking and matching two or more data sets. Linking data sets can be a methodologically challenging, multi-stage process and it is important that privacy and confidentiality are protected at all stages.
Jones and Ford discuss the different linking methods with their advantages and disadvantages, while Ritchie and Smith discuss the potential of re-identification through linking and unpredictability of future scenarios. The choice of data linking method depends on a variety of factors, including burden on the data provider and technological limitations. Some of the measures for managing disclosure risk are specific to data linkage.
Identifiable data are required at some stage in all data linking methods and the resulting linked data sets are not immune to disclosure risks. As intruder attacks are becoming more sophisticated and easier to conduct with the aid of new technologies, it is essential to identify vulnerabilities at each stage of the data linking process and to develop approaches to protecting the data that will minimise the risks to an acceptable level.
There is promising initial evidence for new solutions to issues resulting from linking diverse datasets. Ritchie and Smith discuss the advantages and challenges posed by using new techniques such as machine learning based record linking, highlighting that the same methods and analytical tools can also be used in malicious linking attacks. All these techniques are in early stages of development and will require more research before providing sufficient evidence that they can become viable options.
In achieving high data utility, it is important to maintain a balance of risk and utility in outputs and avoid being too risk averse. Defining what this balance should look like in practice is not straightforward, and guidance on this is not clear.
The assessment is almost always dependent on the risk appetite of the data provider or the Information Asset Owner (IAO). It is important to be able to provide good advice and guidance to IAOs to ensure practice that is, wherever possible, consistent across the Government Statistical Service (GSS) and encourages risk management as opposed to complete risk avoidance.
Statistical disclosure control (SDC) methods should minimize the risk of disclosure to an acceptable level (in line with legal requirements and good practice) while releasing as much information as possible. The decision of which SDC methods to use is dependent on trade-offs between minimising the disclosure risk and having the least possible adverse impact on the usefulness of data.
This decision also needs to consider the different types of data and contexts (referred to as the data environment), including previously released information and intruders’ knowledge of the population. Elliot and Domingo-Ferrer discuss the relationship between the data environment and the data intended to be released, and draw attention to the inherent difficulty of assessing risk of disclosure given that the data held in private databases are a key source of uncertainty for the SDC process. Shlomo emphasises that to achieve the optimal balance between risk and utility it is important to have quantitative measures for both risk of disclosure and information loss.
The impact of SDC methods used on the utility of the data can be communicated through a disclosure risk – data utility map. SDC application entails using different methods with different parameters or thresholds and the resulting disclosure risk and data utility should be quantified, to support the decision on the best risk-utility reconciliation.
Elliot and Domingo-Ferrer discuss privacy models and their application to control for disclosure risk. The privacy models (for example k-anonymity) can be seen as alternative approaches to SDC methods, but in practice they can be strengthened by using one or several SDC methods.
One of the latest additions to privacy models is Differential Privacy (DP). Page, Cabot and Nissim define DP as a privacy model that bounds privacy risk and formalises the intuitive view about privacy that a statistical output should reveal (almost) no information specific to an individual within the data set.
More powerful intruder attacks including new types of attack (such as reconstruction attacks) are some of the motivating factors that led to the development of DP. Protecting against disclosure risk using differential privacy is a much more context-free approach and is easily quantifiable once the parameter in the model is set, but setting this parameter is one of DP’s practical challenges.
As DP is a rather new addition to privacy models, it is still in the early stages of transition from research to practical applications for statistical production, and users are still learning how to apply it effectively in practice (see Page, Cabot and Nissim and Shlomo).
Elliot and Domingo-Ferrer point out that these privacy models should not be regarded as competing models and an understanding of their applications and limitations can inform the use of one model to boost another.
An added difficulty in developing a unified framework that offers data providers a way of measuring the risk and data utility across SDC methods and privacy models is inconsistency in their application in practice; in part a result of varying risk appetite, with individual context being part of the risk assessment. Whereas data with ‘zero risk’ (none or negligible) or ‘total risk’ (identified) are relatively easy to define, something in-between which provides sufficient utility while still providing sufficient risk protection is very much dependent on the relative importance subjectively put on the two requirements.
Elliot and Domingo-Ferrer conclude that, in general, methods developed to quantify and understand data utility are less developed than methods designed to measure disclosure risk. Overall more research is needed for both types of methods to assess whether they are satisfactory. Further innovation and development will better support decision making and provide solutions to the tension between demand for more detailed information and the obligation to ensure privacy and confidentiality. This becomes more pressing when considering big data, data mining and a rapidly expanding data environment.
Across the Government Statistical Service (GSS), Statistical Disclosure Control (SDC) methods are selected to be consistent, practical and implementable in a reasonable timeframe with available resources. Natalie Shlomo provides a comprehensive description and further examples for the different SDC approaches.
Traditionally, a statistician’s choice of SDC method (or combination of methods) and threshold parameters is guided by expert advice and previous good practice examples of dealing with disclosure risk and information loss for similar statistics. For example, for frequency tables, record swapping is the preferred pre-tabulation method (implemented on microdata prior to constructing the table) due to ease of implementation, but table redesign, random rounding or noise addition is advised for post-tabulation treatment (after the table is constructed). For magnitude tables, primary and secondary suppression is advised with consideration given to the sensitivity of the disclosive cell and threshold rules (see Shlomo for a classification of disclosure risks). A high-level classification of these methods by type of output can also be found in Elliot and Domingo-Ferrer.
Intruder testing can help identify vulnerabilities and help measure disclosure risks. However, this is a complex and resource intensive process. Elliot and Domingo-Ferrer discuss the different stages, expertise and resource required, and recommend this approach for new data situations where the calibration of disclosure risk measures is difficult to achieve.
A disclosure risk that is becoming more prominent with the development of automatic table generators and remote analysis servers is inferential disclosure. More widely, the inferential disclosure risk has led the statistical community to recognise the increased need to use perturbative methods (see Shlomo and Page, Cabot and Nissim). Inferential disclosure risk is an area where further research is needed to explore options and identify solutions that can help inform the choice of methodology used to protect the different types of statistics. Overall, further research is needed to help develop approaches that can deal with, and build protection against, malicious behaviours and intruder attacks.
Elliot and Domingo-Ferrer discuss the challenges posed to the existing SDC model by the changes in the data environment, and the increase in the data available. Contributing to this challenge is the increase in capacity to process data (such as linking). However, minimising the risk of disclosure when working with different types of data sets has exposed some of the limitations of the standard SDC model and the need to develop new approaches and identify new risks.
Machine learning could be a useful tool to help further develop the SDC model but this is at an early stage of development.
As SDC methods are being developed and evaluated, threat models are updated to reflect what are considered as potential threats at a particular point in time. These models have to be regularly reviewed to allow for new developments, especially increase in computing power and richness of auxiliary data.
Synthetic data have the properties of real data sets but are composed of artificial individual cases generated using a particular model. Elliot and Domingo-Ferrer touch on the lack of consensus between experts on whether synthetic data can be viewed as a statistical disclosure control (SDC) method or as an alternative to SDC. In practice, a synthetic dataset is as good as the model underpinning it; it is difficult to capture conditional relationships between variables and the risk of disclosure is not completely eliminated. Elliot and Domingo-Ferrer discuss the potential of machine learning in synthetic data generation and model building, and the initial research conducted to synthesise data using deep learning techniques.
Page, Cabot and Nissim identify differentially private synthetic data as one path for the differential privacy model deployment, and discuss some potential practical advantages such as allowance for indefinite queries of the data.
It is relatively common for experts and researchers to use different terms to refer to the same concept, or alternatively, to use the same term when referring to different concepts. While some of these terms have legal implications, they are not always used consistently. To complicate matters further, in this fast-evolving field there isn’t always consensus between experts on how some of the new developments can be implemented in practice. In addition, the terminology used is evolving.
Anonymisation, privacy, confidentiality and Statistical Disclosure Control (SDC) are sometimes used interchangeably. SDC methods are just one of the tools in the anonymiser’s toolbox, and to fully identify privacy concerns an assessment that goes beyond disclosure control is needed.
Elliot and Domingo-Ferrer differentiate between anonymisation (transforming personal into non-personal or anonymised data) and SDC methods, with the latter being a subset of methods used as part of the anonymisation process to manipulate and minimise disclosure risk.
This makes the dissemination of findings to a non-expert audience more difficult, and is at times a source of confusion that could be avoided by developing guidance and encouraging the consistent and clear use of terminology across the Government Statistical Service (GSS) and wider. When communicating to the public how risks of disclosure are dealt with, consistent and meaningful language is essential in building public trust and alleviating concerns around privacy breaches.
Increasing demand for more accessible and detailed statistical data has led to the development of new methods for dealing with privacy and data confidentiality threats. This work resulted in new and innovative dissemination strategies: remote access data enclaves, and web based dissemination tools such as flexible or automated table generators and remote analysis servers.
Shlomo discusses how flexible table generators development is driven by demand from policy makers and researchers. The tables allow them to define their own outputs from a set of pre-defined variables and categories. The automatically generated table is checked against a list of criteria and if these criteria are met it is released to the researcher without the need for human intervention.
Remote analysis servers are online systems that provide similar outputs to flexible table generators, without the need for human intervention. With remote analysis servers all Statistical Disclosure Control (SDC) methods are applied pre- or post tabulation based on the rules and thresholds programmed in the system (see Shlomo and Ritchie and Smith).
The challenges for the future development of SDC methods are to examine the potential of privacy guarantees pertaining to differential privacy and to develop applications for new and innovative statistical dissemination strategies.
|Active learning||A branch of semi-supervised learning where at each iteration the system identifies the unlabeled items that are most likely to be informative and asks a user to provide labels/responses from which it can learn.|
|Active Server Pages||A Microsoft technology for creating web pages with embedded scripts or programs that perform some computation before returning results to the user.|
|Aggregate data||Record level data summed to create a table. Aggregate data includes frequency tables and magnitude tables.|
|Anonymisation||A process that transforms personal and/or identifiable data into unidentifiable data.|
|Approved researcher||Researchers with permission to access personal information to assist in statistical research. They must meet criteria under the Statistics and Registration Service Act (2007) to be accredited as an approved researcher.|
|Attribute disclosure||A type of disclosure where an intruder can discover new information about a specific individual, household or business. This form of disclosure usually occurs in tabular data releases and arises from the presence of empty cells either in a released table or linkable set of tables after any subtraction has taken place.|
|Base classifiers||The individual models within an ensemble of machine learning models (see ensemble learning).|
|Bayesian latent model||A Bayesian model where the parameter space includes latent variables; used in classical record linkage.|
|Binary decision||A decision in machine learning that can take two possible values, such as true/false or linked/not linked.|
|Confidentiality||The right or expectation of an individual or organisation to not have information about them disclosed|
|Comparison vector||Given a set of F common features present in two datasets A and B, and two records a (from A), and b (from B), the comparison vector C is a list of length F, containing the differences between the values of a and b for each feature.|
|Data divergence||The sum of all differences between two data sets (in format or granularity, or due to variations in coding practice, errors on one dataset or the other etc.).|
|Distributed access||A set of protocols that allow organisations to control the extent to which remote users have access to their systems and data, in order to preserve confidentiality.|
|Distributed data||Datasets which are physically distributed over several locations.|
|Dominance rule||See (n, k) rule.|
|End User Licence (EUL)||A licence used for data, normally microdata, that have been de-identified and partially anonymised. The user cannot attempt to identify an individual, nor claim to have (inadvertently) identified an individual.|
|Ensemble learning||A machine learning approach which creates several different models (base classifiers) and then combines their predictions by voting etc. Different models will make different errors, therefore combining the votes of several models should improve accuracy|
|Euclidean distance||The straight-line distance between two points calculated as the square root of the sum of the squared distances for each feature/dimension. Best known in 2 dimensions via Pythagoras’ theorem relating the lengths of the sides of a right-angled triangle|
|Expectation Maximisation (EM||Maximisation (EM)
A statistical method for estimating the most likely value for parameters in a machine learning model that represents a dataset.
|Feasibility intervals / region||In an optimisation problem, interval or region containing all values of the variables that satisfy the constraints of the problem. The optimal solution must be in the feasibility interval/region.|
|Five Safes||A protocol to ensure that personal information stored at the ONS (and other organisations) is secure. The safes are Safe People, Safe Projects, Safe Settings, Safe Outputs, Safe Data.|
|Frequency tables||Tables of counts often used to display data collected in surveys and censuses. Each cell in a table represents the frequency or count of the defined combination of categories.|
|Group disclosure||A type of disclosure usually seen in tabular data when information about a small group can be determined. If all respondents in a group fall into a sensitive category then group disclosure is possible.|
|Global data environment||Theoretically, the sum of all data in the world. Pragmatically, the set of all data that might be linked to a given dataset. The concept is most relevant for open data releases.|
|Herfindahl concentration index||A measure of the relative size of different firms in a market. Calculated as the sum of the squares of their market share.|
|Heuristic parameter choice||Choice of a parameter by trial and error or a rule of thumb.|
|Hypergraph partitioning||A hypergraph is a set of nodes (points of intersection), with a set of edges (hyperedges) connecting two or more nodes. For example, a simple 2-D table with row and column totals. Hypergraph partitioning is the problem of subdividing a hypergraph whilst minimising the number of hyperedges that link partitions|
|Identification dataset||Data set containing identifier attributes that is linked to an anonymised dataset with the goal of reidentifying the subjects to whom the records in the latter dataset correspond.|
|Identification disclosure||The act of identifying a person or statistical unit in the table. This identification could lead to the disclosure of potentially sensitive information about the respondent.|
|Identification key / key variables||A small number of key variables which can be linked to determine whether a record is a sample unique. These variables are usually visible and can assist in identifying respondents in the dataset.|
A measure used to assess whether there is sufficient uncertainty within a microdata set. There must be least K records within the de-identified microdata set that have the same combination of indirect identifiers.
|L-diversity||An extension of k-anonymisation. Each sensitive variable contains at least L categories with one or more records|
|Linear programming||A set of mathematical optimisation methods to determine the best result of a mathematical model given a set of constraints represented by linear relationships.|
|Machine learning||An application of artificial intelligence that allows systems to learn automatically and improve from experience without being explicitly programmed. Machine learning algorithms receive input data, using statistical analysis to look for patterns in the data whilst constantly updating outputs as new data become available.|
|Magnitude data||A variable in a dataset where the cell entries are continuous variables such as trade with a particular country, number of employees etc.|
|Magnitude tables||Summed (or averaged) magnitude data. These are often used to display data from business surveys. Each cell would represent a total value for the businesses which are contributors to that cell.|
|Manhattan distance||Also known as city-block distance. Calculated as the sum of the absolute differences in values for each feature/dimension.|
|Markov matrix||Also known as a stochastic matrix, transition matrix or probability matrix, it is a square matrix containing the probabilities of transition between several states.|
|Meta-heuristic feature selection approaches||feature selection approaches
The use of artificial intelligence optimisation methods such as evolutionary algorithms to find ‘good’ subsets of features from a dataset. Often used to improve the performance of machine learning algorithms when datasets contain irrelevant and/or highly correlated features.
|Microdata||Microdata (also known as record-level data or row-level data) are data on the characteristics of units of a population, such as individuals, households, or establishments, collected by a census, sample survey, or from administrative data. The data are in the form of tables where each row corresponds to an individual person, business, other statistical unit or event.|
|Microdata Review Board||Microdata Review Boards are situated within organisations releasing statistical data to inform decisions about releasing microdata and the mode of access.|
|Multiple imputation||Multiple imputation is a method for assigning values to missing data in a sample, based on a model of the data that are available. Although it was initially proposed to deal with missing data, multiple imputation has also been used to generate synthetic data.|
|Naïve Bayes learning||A form of machine learning that creates classifiers that give the probability of different labels for an observation. They apply Bayes’ rule under the assumption that the values of descriptive features are independent.|
|(n,k) rule||Applied to each cell in a magnitude table. Under this rule a cell is regarded as unsafe if the n largest units contribute more than k % to the cell total.|
|Non-parametric methods||These methods seek to make inferences on a data sample without making any hypothesis on the distribution of that sample (as opposed to parametric modelling).|
|Non-perturbative methods||The appearance of the data (but not the data itself) is changed. This includes methods such as table redesign and suppression.|
|NP-hard||In computer science, NP-hardness (non-deterministic polynomial-time hardness) denotes a problem that is at least as hard as the hardest problem in NP. In other words, NPhard problems are those that are likely to be unsolvable in polynomial time|
|Ontology||A set of concepts and categories in a subject area or domain that shows their properties and the relations between them. It can be roughly understood as a taxonomy or a classification of concepts.|
|Parametric modelling||Consists of adjusting to a data sample a family of distributions that is hypothesised to fit the data. The adjustment process entails finding the distribution parameters that best fit the data|
|Perturbative methods||Changes have been made to some values of the data. This includes methods such as rounding and the addition of noise.|
|Privacy||Privacy is breached if a unit in the data can be identified through unique or rare combinations of variables. Privacy is applicable to data subjects whereas confidentiality applies to data|
|Privacy preserving record linkage||Methods for efficiently linking records from different datasets without compromising the confidentiality included in either.|
|Pseudo-F metric||Measure of the quality of clustering found by an unsupervised learning method. It describes the ratio of between-cluster variance to within cluster variance.|
|Pseudonymisation||The initial step when protecting microdata. This is the act of removing direct identifiers from a record and replacing with another code (such as a row number).|
|Quasi-identifying variables||Variables which identify individuals indirectly such as age, gender, occupation, place of residence.|
|Random noise||Random numbers that are generated to be added to raw data values. Noise may be positive or negative. It is often chosen from a fixed statistical distribution centered at 0, e.g. the Laplace distribution.|
|Record-level data||See microdata.|
|Relational databases||A standard way of storing data in a set of tables, designed to minimise duplication of information.|
|Remote Access||Online access to disclosive microdata. Analysis can be carried out remotely by the researcher but no data are downloaded to the researcher’s computer and outputs will be checked by the organisation with responsibility for the data prior to release.|
|Row-level data||See microdata.|
|R-U (risk-utility) map||A graphical representation of the trade-off between disclosure risk and data utility. The aim is to use a method of disclosure control which maintains as much utility as possible.|
|Secure Research Service||The Secure Research Service (SRS), formerly the Virtual Microdata Laboratory (VML), is an ONS facility for providing secure access to sensitive detailed data, typically microdata.|
|Seeds||The initial set of items with labels in semi-supervised machine learning.|
|Self-learning||A branch of semi-supervised learning where at each iteration the system makes predictions for the unlabeled items. It then accepts those predictions for which it has greatest confidence, to expand the training set for the next iteration of learning.|
|Semi-supervised learning||A form of model learning that makes use of a set of training examples, where the correct response is only available for some of them.|
|Server-side web page delivery||Technologies that perform some processing on the webserver before returning results to a user on a client machine. Found in many online business platforms.|
|Supervised learning||A form of model learning that makes use of a set of training examples, alongside the correct response for each. The differences between the system’s outputs and the true outputs are used to adapt the learned model.|
|Synthetic data||Data that do not relate to real statistical units but have the look and structure of real data. They will have been generated from one or more population models, designed to be non-disclosive, and used either for teaching purposes, for testing code, or for use in developing methodology.|
|Topology||Mathematical description of the shape and structure of a landscape or space.|
|Threshold rule||A cell in a table of frequencies is defined to be sensitive if the number of respondents is less than some specified number (between 3 and 5 are commonly used).|
|Uniform Resource Identifier||A Uniform Resource Identifier (URI) is a string of characters that unambiguously identifies a particular resource. The most common form of URI is the Uniform Resource Locator (URL), frequently referred to informally as a web address (W3C definition).|
|Univariate distribution||A probability distribution of a one-dimensional random variable.|
|Unsupervised learning||A form of model learning that makes use of a set of training examples where there are no labels or responses. Typically used for clustering. Model adaptation is driven by statistical measures describing the ‘quality’ of the clustering.|
|Weak learners||Simple machine learning models whose performance is better than random guessing, but may not be extremely accurate.|
|Within group disclosure||Occurs in tables when there is one respondent in a single category, with all other respondents in a different category. This would enable the single respondent to be an intruder and find out additional information about those members of the other category.|
To find out more about this Data and Analysis Method Review please email DQHub@ons.gov.uk