Disclosure Limitation Techniques
Techniques for disclosure limitation can be classified in a number of ways. For the purposes of this paper, techniques are grouped by information limiting methods and data perturbation methods.1 Information limiting methods are those that delete, mask, suppress, or obscure data fields or values in order to prevent re-identification. Data perturbation methods are those that use statistical means to alter either the underlying data itself or query results drawn from the data.
Information Limiting Techniques
PII, Anonymization and the Re-identification Problem
The simplest method of disclosure limitation is to strip PII from a dataset, removing all fields (or suppressing or masking these fields in some way) that could directly and uniquely identify an individual, such as name, social security number, and phone number. Through the 1990s and into the 2000s, PII was often used as something of a bright-line approach to data anonymization. It defined what data needed to be protected, with the remaining data fields considered harmless for disclosure from a privacy perspective. However, in the mid-2000s it became clear that a wide range of other data categories can be used to identify individuals.2
The GDPR expands the scope of protected information beyond PII, instead using the term “personal data”—a broader range of potentially identifying information as defined in Article 4(1):
‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
Most, if not all, privacy laws rely on some concept of data being either personally identifiable or not to determine whether the law applies. As the GDPR continues, “The principles of data protection should therefore not apply to anonymous information, namely information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” Defining “identifiable” data, however, becomes complicated by the possibility of re-identification attempts.
Re-identification is the matching of anonymized data back to an individual. In recent years,3 faith in anonymization has been greatly shaken by studies demonstrating re-identification of released data.4 Two high profile examples are the 2006 Netflix Prize study5 by Arvind Narayanan and Vitaly Shmatikov, which re-identified individuals from Netflix’s release of over 100 million user ratings of movies, and a 2009 Social Security Number study by Alessandro Acquisti and Ralph Gross showing that data about an individual's place and date of birth can be used to predict their Social Security number.6
Most commonly, re-identification is performed using external databases to infer information about the anonymized data (known as “linkage attacks”). The Narayanan and Shmatikov Netflix study identified a subset of individuals by cross-referencing the Netflix data with non-anonymized movie ratings from the Internet Movie Database (IMDb). The Acquisti and Gross study used the Social Security Administration’s Death Master File to detect statistical patterns in SSN assignment, and used multiple sources for inferring birthdate, such as voter registration lists, online white pages, and social media.
Researchers have developed new anonymization algorithms such as k-anonymity, t-closeness, and l-diversity to more formally protect against re-identification. These frameworks each rely on their own assumptions and limitations, and therefore each protects against only certain types of attacks. For example, k-anonymity requires that there be k number of different records that share a combination of quasi-identifiers (attributes that are not direct identifiers, but might potentially contribute to identification, such as age and sex). For example, in a 3-anonymous table containing medical condition by zip code and age, every combination of zip code and age values needs to appear at least three times. This can prevent the identification of a specific individual’s record in a table, but is still susceptible to what is known as a homogeneity attack. If the quasi-identifier values of an individual are known, then even though their actual record may not be identified (as there are k number of records that all have the same quasi-identifier values), their presence alone in a certain dataset could be confirmed and could reveal sensitive information about that person. Consider again a table of medical conditions 3-anonymized by zip code and age. If it happened that the only three records for a certain zip code and age combination all had a heart condition diagnosis, it could be possible for someone to deduce information about someone matching that zip code and age. If we know Bob’s age, where he lives (and thus his zip code), and knew he had been in the hospital, we would then be able to tell that it was because of a heart condition.7
While anonymization has come under scrutiny as a method of privacy protection, the concepts of anonymity and identification are still key to the applicability and interpretation of privacy laws. Difficult questions concerning de-identification will likely be around for the foreseeable future. As techniques for de-identification improve, so will methods of attack. Ongoing research continues to provide techniques for evaluating and measuring re-identification risks,8 but it is unclear whether such methods can adequately account for external data sources that may become available in the future. Datasets cannot be taken back once released, so even if data is anonymized effectively based on current standards, future techniques could remove protections. Additionally, datasets are becoming more detailed and increasing longitudinally (covering a greater period of time), and these higher-dimension, long-term datasets will be more challenging to effectively anonymize.
Anonymization of data is perhaps best considered through a risk management perspective. Removing or obscuring potentially identifying data can be of benefit,9 not in providing certainty of de-identification, but in lowering risks when combined with other privacy and security controls. Anonymization techniques should not be a stand-alone approach to privacy, but rather one tool in the disclosure limitation toolkit. Other methods, as discussed in this paper, can be combined with anonymization to lower risks to acceptable levels.
Aggregating, Coarsening, and Suppressing Data
Beyond removing or masking direct PII identifiers, a number of disclosure limitation techniques have been developed for obscuring data in some manner, by coarsening, rounding, aggregating, or suppressing data.
Data coarsening (or generalization) techniques reduce the detail of the data so that individuals whose information is reflected in categories with low n-count values10 cannot be uniquely identified. One such technique involves top- and bottom-coding methods to place bounds on the reporting of data in order to prevent identification of outliers. Essentially, this means broadening data categories to include more unusual or extreme values, so they are not listed on their own, and thus potentially identifiable. For example, if only one individual in a dataset were age 99, their age could be recoded to a broader 90+ category.
Cell suppression is the withholding of information in the cell of a table output, based on some threshold rule for what counts or aggregates would implicitly or explicitly reveal confidential information. For table cells in which the data would allow estimating a single individual’s value too closely, missing (or imputed) values are displayed. Suppression is often used when there are very few values contributing to a cell, or when one or two large values are the dominant contributors to the aggregate statistics. The table below shows a very simple suppression of the cell showing that two women in the dataset live in New York City.
| NYC | DC | LA | Total | |
|---|---|---|---|---|
| Male | 10 | 14 | 6 | 30 |
| Female | 2 | 12 | 9 | 23 |
| NYC | DC | LA | Total | |
|---|---|---|---|---|
| Male | 10 | 14 | 6 | 30 |
| Female | * | 12 | 9 | 23 |
Across either rows and columns of tables, or across multiple tables, this “primary” cell suppression alone does not always protect the data. Confidential values of primary cells can be determined through comparing and subtracting values. For these cases, secondary (or “complementary”) cell suppression is necessary, in which additional values are also displayed as missing. In the example above, the suppressed value of 2 could be determined by simply subtracting 12 and 9 from 23. Thus, as shown in the following table, a secondary suppression would be made.
| NYC | DC | LA | Total | |
|---|---|---|---|---|
| Male | 10 | 14 | 6 | 30 |
| Female | * | 12 | * | 23 |
Data Perturbation Methods
Beginning in the 1970s, as computers moved beyond the days of punch cards and mainframes, increasing numbers of researchers and others have been able to easily access and query databases. As a result, computer scientists began thinking about the associated risks to privacy and how to prevent queries from revealing information about particular individuals. An initial approach was to restrict or audit database queries, preventing queries that would pull back data that could identify individuals. The simplest such limit is to only allow users to run queries that would return aggregate, or statistical results, and prevent queries that would pull back individual, unaggregated records. For example, developers could design a system so that queries would only return record results of a certain size in the result set. Using a certain threshold number, n, would ensure that only aggregate queries based on at least a set of n records can be run. For example, with an n of two, this would involve preventing a query from being run that would return data of a single, individual record. The problem is that multiple queries can use differentials and overlapping sets (calculating values across query results, by say, subtracting the results of one query from another) to obtain confidential information. A simple example is running a query seeking the total income for all individuals in the data, and then running a second query seeking the total income for all individuals in the data excluding one certain individual. Subtracting the results from the two datasets would provide the income of that individual.
Researchers can employ techniques that prevent releasing statistics if the number of common records returned in a set of queries exceed a given threshold to guard against these re-identification attempts. Further research has created increasingly sophisticated methods of query restriction. However, query restrictions ultimately provide no real guarantee of privacy against a sophisticated user who could potentially generate sets of queries that eventually reveal information about an individual.11
The other approach researchers developed to protect databases is data perturbation—either altering the actual, underlying data or altering query outputs to protect confidentiality. This usually involves adding statistical noise—altering values of the data while still maintaining the statistical relationships between data fields. There is a long history of use of perturbation in disclosure limitation, and there are a number of perturbation approaches and techniques.
Data swapping, first proposed by Tore Dalenius and Steven Reiss in the late 1970s,12 is a technique through which researchers use statistical models to find pairs of records with similar attributes and switch personally identifying or sensitive data values between the records. After this manipulation, outside researchers or attackers will not be able to determine which values correspond to which individuals. However, the aggregate values of the data are preserved sufficiently to enable researchers to make statistical inferences.
Other perturbation techniques involve changing the values of the data by some random amount, while still preserving the underlying statistical properties of the data within a certain range. For example, the perturbation technique of additive noise works by replacing true values x with the values of x+r, drawing value r from some distribution. The r amounts are such that the replacement x+r values preserve the statistical relations between data records.
While perturbation has been used extensively and effectively by the Census Bureau and other organizations, perturbation methods do have weaknesses. If strongly correlated attributes can be found in real-world data, this correlation can potentially be used to filter out the additive randomizations.13 The same can occur if someone has certain background knowledge about the data. Additionally, a number of mathematical and statistical techniques, such as spectral filtering, can be used to filter off the random noise from perturbed data, retrieving the original values.14
Synthetic Data
Synthetic data are datasets that seek to replicate the statistical properties of real-world datasets, serving as analytical replacements. Synthetic datasets can be either fully synthesized, with all of the original dataset values generated synthetically, or partially synthesized, with only certain fields or a portion of records synthesized. Often, the data must be presented in the same form and structure as the original data in order to be compatible with existing systems, algorithms, and software. Created through various modeling techniques,15 synthetic data differs from perturbation techniques that alter the original, underlying data as discussed above. Instead, synthetic data creates completely new data by using models that fit the original data (or by using defined parameters and constraints) to generate statistically comparable data independent from the underlying, real-world data.
Synthetic data can greatly reduce the risks of re-identification through ancillary datasets, as attempts to perform matching with external databases is difficult. The ability to adjust the model also provides an additional confidentiality advantage. Researchers modeling the data can make decisions about which relationships of the real-world data will be preserved. Relationships omitted from the model will not be discoverable by analysts, since they will not be present in the synthetic data, providing the ability to keep certain data correlations and the sensitive information they might reveal confidential. For example, if the correlation of interest in a research dataset is between gender, age, and health condition, that could be the only correlation preserved in the synthetic dataset. Other data fields might be generated and included in this dataset, modeled to provide broad, aggregate statistics, but not with statistically significant correlation to combinations of certain other variables—correlations that could predictively reveal information about individuals.
There are still confidentiality concerns that need to be taken into account with synthetic data, however. Synthetic datasets can potentially leak the underlying data if the model fits too closely. For example, if the synthetic data has enough different fields per individual, and the model is closely fitted to the original data, outliers can potentially be identifiable. However, there is ongoing research aimed at developing the means to generate synthetic data in a way that would provide formal privacy guarantees (discussed in the section on differential privacy below).16
Adversarial machine learning, which is the use of AI techniques to detect vulnerabilities, can potentially be used to determine information about a record used in the modeling of the synthetic dataset (i.e. “real-world” data). However, those seeking to uncover the information would also need key information about the model used to create the data.17
Apart from its value in disclosure limitation, another potential benefit of synthetic data is the ability to generate large volumes of research data at low cost.18 Machine learning requires running large volumes of training data through algorithms, and synthetic data may have great value as a way to rapidly provide these large volumes of data. These datasets contain no real-world data, but are statistically similar enough to real-world data to be of training value. While companies such as Google and Facebook generate large datasets as part of their business, smaller companies may be able to use this synthetic data to jump-start a machine learning program without collecting data about real people. From a privacy standpoint, using synthetic data to train artificial intelligence is attractive in that it avoids the need for collecting, storing, and using real-world data in the large amounts needed for machine learning.
Differential Privacy
A group of prominent computer scientists first introduced the concept of differential privacy (also known as “formal privacy”) in their 2006 paper, Calibrating Noise to Sensitivity in Private Data Analysis,19 although precursors to the technique go back decades. It is not a single tool or method, but rather a privacy standard that provides formal mathematical guarantees of privacy that can be implemented in various ways. Differential privacy’s guarantee is that an adversary can learn virtually nothing more about an individual based upon disclosures from a dataset than they would learn if that person’s record were not included in the dataset. In other words, whether or not your personal data is included, resulting outputs from a dataset would be approximately the same. Strong privacy protection is provided, while still allowing an analyst to derive useful statistical results. Differential privacy provides a promising solution to database reconstruction and re-identification attacks, as it would be highly difficult to link the noisy, approximate results to external data sources.
In practice, differential privacy works by injecting a precise, calculated amount of statistical noise to the data contained in query results by using statistical means (so it can be essentially thought of as a perturbation method for query outputs). What is provided by differential privacy is an approximation of the true value—the exact same query could produce two different answers.20 The difference between the data value provided in differentially private outputs and the real-world value can be tuned to be a larger or smaller value (known as the privacy loss parameter), but at a trade-off between accuracy and privacy. Differential privacy defines privacy risk as an allowable leakage of data on an individual in comparison to a hypothetical database without a certain individual. The allowed deviation between data that includes an individual and one that does not is usually represented as ε (epsilon).
While an individual’s personal information is almost irrelevant to the outputs produced via differential privacy, some insignificantly small change in belief about an individual can potentially be made based on the information released. The probability that some inference can be made about an individual is at most 1+ε times the probability that an inference could be made without the individual’s data. For example, if the baseline probability of an individual developing a certain disease is 3 percent (say for a female in the United States), with an ε of 0.01, the known probability under differential privacy would rise from 3 percent to only 3.03 percent (3 x 1+ε) at most.21 Put another way, the probability difference in the outputs between a dataset with the individual and one without the individual included is .03 percent.
Differential privacy also measures and bounds the total privacy loss over multiple analyses. Part of the application of differential privacy involves establishing a “privacy budget,” which limits the overall amount of data that can be disclosed. Setting this budget requires determining the cumulative risk of data disclosures over the lifespan of the data. With all disclosure limitation techniques, there is no avoiding the fundamental fact that when multiple analyses are performed using an individual’s data, disclosure risk increases by some amount. Thus with each statistical release, or query, under differential privacy, some small amount of potentially private information is leaked. Therefore, while risk does increase with each release, the privacy budget ensures that risk accumulates in a bounded way. Queries are analyzed to determine their privacy cost (ε) and whether the remaining balance (a running tally ε over all queries) of the privacy budget is sufficiently high to run it. Setting a privacy budget thus returns us to the always present trade-off of informational value of the data and confidentiality; potentially releasing identifiable information (if the privacy budget is set too high) versus data releases not being informationally useful (if the budget is set too low). Methods for optimally calculating the privacy budget are an area of current research.22
Differential privacy has recently seen a number of real-world uses by companies. Uber uses differential privacy to protect internal analyses, such as those done on driver revenue.23 Apple is using differential privacy to protect user privacy while improving the usability of features such as lookup hints.24 Federal agencies such as the Census Bureau are also beginning to adopt differential privacy.25 Additional tools for making differential privacy use more accessible are under development, with some being provided open-source, such as Google’s differential privacy kit, which is available via GitHub and allows users to calculate differentially private simple statistics from a dataset.26
Differential privacy stands as one of the most promising disclosure limitation techniques, one that can provide formal, mathematical assurances of privacy while unlocking valuable research data. However, like any other disclosure limitation method, it is not an absolute assurance. In addition to concerns about correctly calculating the privacy budget, covert-channel attacks can potentially be used against differentially private query systems and need to be protected against. Using information other than the query values, such as the time to complete the query, could potentially be used to reveal information such as the presence of an individual in a database.27 For example, a query looking for an individual in a dataset of cancer patients may take one second to run if the individual is not present, versus a half hour to run if the individual is in the dataset.
Citations
- There is some blurring of lines between these two categories.
- Starting in 2006, a number of studies demonstrated the ability to re-identify individuals in publicly released, anonymized data. This includes re-identifications from the 2006 AOL release of search queries of users (see Michael Barbaro and Tom Zeller Jr., "A Face Is Exposed for AOL Searcher No. 4417749", New York Times, Aug. 9, 2006, source, The Massachusetts government’s release of state employee hospital visit data, Daniel Barth-Jones, “The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now”, SSRN (July 2012) source, and the two other examples discussed below in this paper.
- A 2016 comprehensive review of re-identification attacks found that 72.7% of all successful attacks have taken place since 2009. Jane Henriksen-Bulmer and Sheridan Jeary, "Re-identification Attacks—A Systematic Literature Review", International Journal of Information Management 36 (December 2016):1184-1192source
- Ohm, 1716-1720
- Arvind Narayanan and Vitaly Shmatikov, "Robust De-anonymization of Large Sparse Datasets",SP '08: Proceedings of the 2008 IEEE Symposium on Security and Privacy (May 2008):111–125, source
- Alessandro Acquisti and Ralph Gross, "Predicting Social Security numbers from public data", Proceedings of the National Academy of Sciences 27 (July 2009): 10975-10980, source
- Methods to prevent against such attacks continue to be developed however. See for instance Qian Wang, Zhiwei Xu and Shengzhi Qu, "An Enhanced K-Anonymity Model against Homogeneity Attack", Journal Of Software 6 (October 2011): 1945-1952, source and Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, And Muthuramakrishnan Venkitasubramaniam, "l-Diversity: Privacy Beyond k-Anonymity", ACM Transactions on Knowledge Discovery from Data 1 (March 2007): 1-52,source
- See for instance on qualitative and quantitative risk measurement for randomized control trial data, Parveen Kumar and Rajan Sareen, “Evaluation of Re-identification Risk for Anonymized Clinical Documents”, Candian Journal of Hospital Pharmacy 62, (July–August 2009): 307-319, source
- In revisiting Latanya Sweeney’s well known re-identification study showing that 87% of the US population could be identified by gender, date of birth, and ZIP code, the authors found that re-identification would drop to .02% by replacing date of birth with month and year only, and zip code with county. Philippe Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, Proceedings of the 5th ACM workshop on Privacy in Electronic Society (October 2006): 77–80, source
- When data is broken down by categories in tables, only a few individuals (“low n” for number of individuals) may fall into some of the categories– such as an example of only one or two students of a certain race and gender being in a particular college program.
- See Appendix B in Irit Dinur and Kobbi Nissim, “Revealing Information while Preserving Privacy”, Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (June 2003): 202–210, source
- Tore Dalenius and Steven P. Reiss, "Data-swapping: A Technique for Disclosure Control", Journal of Statistical Planning and Inference 6, no 1 (1982): 73-85, source
- See Kun Liu, Chris Giannella, and Hillol Kargupta, “A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods”, in Privacy-Preserving Data Mining: Models and Algorithms, ed. Charu C. Aggarwal and Philip S. Yu (New York, NY: Springer, 2008) 359-381, source
- See Songtao Guo and Xintao Wu, "On The Use Of Spectral Filtering For Privacy Preserving Data Mining" Proceedings of the ACM Symposium on Applied Computing, Dijon, France, April 23-27, 2006
- See Surendra H and Mohan HS “A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing”, International Journal Of Scientific & Technology Research 6 (March 2017): 95-101, source
- See Haoran Li, Li Xiong, and Xiaoqian Jiang, “Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions”, Advanced Database Technology 2014 (2014): 475–486, source , and National Institute of Standards and Tehcnology, “2018 Differential Privacy Synthetic Data Challenge” (accessed Jan 10, 2020) source
- See for instance, Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov, “Membership Inference Attacks against Machine Learning Models”, Proceedings of the IEEE Symposium on Security and Privacy (2017), source
- Methods for generating synthetic data itself via machine learning are currently being developed using techniques such as generative adversarial networks (GAN).
- Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith, “Calibrating Noise to Sensitivity in Private Data Analysis”. In: Theory of Cryptography, Lecture Notes in Computer Science, ed. Shai Halevi and Tal Rabin (Berlin: Springer, 2006), 265-284.source
- Providing individuals in the dataset plausible deniability. As an example, for a population, 1,022 may be returned one time and 1,016 another, but this would be irrelevant as part of statistical analyses.
- For the mathematical proof of differential privacy, see Cynthia Dwork, “A Firm Foundation for Private Data Analysis”, Communications of the ACM, 54 (January 2011):86-95, source
- See for instance, Anis Bkakria and Aimilia Tasidou, “Optimal Distribution of Privacy Budget in Differential Privacy”, Risks and Security of Internet and Systems. CRiSIS 2018. Lecture Notes in Computer Science 11391 (2019)source
- source
- Apple Inc., “Differential Privacy Overview”, source
- The Census Bureau’s move to differential privacy is discussed below.
- source
- Andreas Haeberlen, Benjamin C. Pierce, and Arjun Narayan, “Differential Privacy Under Fire”, Proceedings of the 20th USENIX conference on Security (August 2011) source