Abstract

This paper provides an overview of some of the privacy issues involved with data releases (disclosures), and how disclosure limitation techniques can be used to protect the confidentiality of individuals whose data is included in disclosures. It provides an overview of some of the primary methods that have traditionally been used, as well as those that have emerged more recently. It does not aim to be an exhaustive list of disclosure limitation methods, but will hopefully provide pointers to further reading. The Census Bureau, which has been a primary center of developing disclosure limitation techniques, is used as an example of how disclosure limitation is practiced and how it has evolved.

Acknowledgments

We would like to thank the Bill & Melinda Gates Foundation for its generous support of our work. The views expressed in this report are those of its author and do not necessarily represent the views of the foundation, their officers, or their employees.

Downloads

Protecting Privacy in Data Releases

Introduction

The falling cost of data storage and the spread of the internet have led to an acceleration in the collection of data about individuals. Many organizations, both private and public, gather and store information from a myriad of sources, resulting in an accumulation of data exceeding 40 zettabytes (40 trillion gigabytes) globally.¹ Data holds the potential for substantial gains to society in building knowledge, research, informing policy, and providing information to the public. However, publishing or sharing data creates privacy risks in exposing individuals to potential financial, reputational, and other harms and liabilities. Organizations have ethical and often legal requirements to protect the confidentiality of data, but this involves a tradeoff with the usefulness of the data. Protecting confidentiality necessitates excluding, aggregating, or obscuring the data in some way that reduces its detail and exactness. Thus, a balance must be struck between confidentiality and the informational value of data.

“Disclosure” refers to the release of data by some means, including making it publicly available or available to another entity or individual (such as the sharing of records with a researcher).² The primary privacy concern³ created by disclosure occurs when the data released contains either direct personally identifiable information (PII), or when other fields or aspects of the released data can be used in some way (often in conjunction with other available datasets) to identify a person).⁴ Disclosures may include sensitive information about individuals, but the risk is in being able to link data in the disclosure to a specific person. “Disclosure limitation” (also known as “disclosure avoidance” and “disclosure control”) refers to the safeguards and statistical methods used to reduce the risk of disclosure of identifiable information in a data release.

Privacy concerns about disclosure have often focused on the release of public-use government data. The Census Bureau, with a primary mission of disclosing data to the public, has been at the forefront of cutting-edge research on and empirical use of methods for disclosure limitation. Other statistical agencies at both the federal and state level also regularly release data, and non-statistical agencies are about to start doing the same, as the 2019 OPEN Government Data Act will require federal agencies to publish much of their information online as open data.

There is also growing concern about how corporations use the data they hold. As corporate data warehouses build in volume and detail over time, they become valuable for discovering information relationships about customers through analytic techniques (a process known as data mining). The potential for derivation of highly sensitive information through data mining carries serious ethical implications.⁵ There are calls for comprehensive laws that would control collection and use of personal data by companies, but even without those laws, we are seeing groundbreaking research into disclosure limitation from many sources in the private sector.⁶

Traditionally, government and private entities seeking to disclose information without creating privacy harms have attempted to provide data in aggregate or anonymized form, so that sensitive information cannot be related back to any particular individual. In recent years, however, it has become clear that traditional techniques of anonymization and aggregation of data are not as privacy protecting as had been thought.⁷ The challenges of balancing the quality and usefulness of disclosures with the fundamental rights of confidentiality and privacy have become much more complex as both technological advances and public perceptions of privacy have changed. There is a wide range of methods for suppressing, aggregating, and obscuring data, all with a goal of creating a release of information that reduces individually identifiable information. However, increases in computing power, the advancement of analytical techniques and sophistication of attacks, the growth of available data sources on individuals, and other factors have weakened the protections of many traditional disclosure techniques. While these older methods are still useful in reducing disclosure risks and continue to be refined, there has been an accelerating shift to the modern, formal disclosure limitation techniques of differential privacy.

This paper provides an overview of some of the privacy issues involved with data disclosures, and how disclosure limitation techniques can be used to protect the confidentiality of individuals whose data is included in disclosures. It provides an overview of some of the primary methods that have traditionally been used, as well as those that have emerged more recently. It does not aim to be an exhaustive list of disclosure limitation methods, but will hopefully provide pointers to further reading. The Census Bureau, which has been a primary center of developing disclosure limitation techniques, is used as an example of how disclosure limitation is practiced and how it has evolved.

Citations

Jeff Desjardins, “How much data is generated each day?” World Economic Forum, April 17, 2019, “source
A disclosure is distinguished from a breach, which is an unintentional release of information.
“Confidential” as used in this paper means any information that was not intended to be released as part of data made available, which includes both personal and non-personal information.
In some instances, there may be other disclosure concerns apart from personal identification, such as release of classified information or certain sensitive or proprietary information about an organization or company.
Such as the noted case of Target determining and revealing a teen pregnancy to her father. See Kashmir Hill "How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did", Forbes.com (accessed Dec 9, 2020), source
The field of research in protecting privacy and confidentiality in data mining is known as Privacy Preserving Data Mining (PPDM), and utilizes many of the disclosure limitation techniques discussed in this paper; Such as Google’s development of tools for secure multiparty computation and differential privacy.
Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, UCLA Law Review 57 (August 2010):1701-1777. source

Laws Governing Disclosure

Data privacy is not covered by a comprehensive law in the United States, though this is an idea that is under much current discussion. Instead, there are a variety of federal and state laws that form a patchwork of privacy protections for the disclosure of data in the United States. These include certain sector-specific laws, such as the Health Insurance Portability and Accountability Act (HIPAA) for personal medical data, the Family Educational Rights and Privacy Act (FERPA) for educational records, and the Gramm-Leach-Bliley Act (GLBA) for financial information. Some industry best practice standards, such as the Health Information Trust Alliance framework and the Payment Card Industry Data Security Standard also address disclosure, but focus more on data security controls.

The Privacy Act of 1974 covers the collection and release of information contained in U.S. federal government agency systems of records. It restricts disclosure of personally identifiable records, prohibits disclosure of an individual’s record without written consent—with certain exceptions, such as the release of certain information under a Freedom of Information Act (FOIA) request—and requires recordkeeping of all disclosures and releases of data. Activities of statistical agencies and units of the government are also governed by the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA), which limits and protects the use of statistical data and is discussed below further in the context of the Census Bureau.

Signed into law in 2019, the Open, Public, Electronic and Necessary (OPEN) Government Data Act provides a mandate for all federal agencies to publish all nonsensitive information assets in “modern, open, and electronic format.” In 2009, the White House issued the Open Government Directive to improve data transparency in the federal government, which included an increase in the release of data online through the Data.gov site. The OPEN Government Act makes the Open Government Directive a requirement in statute, rather than a policy. In implementation of the act, the Office of Management and Budget (OMB) is set to issue guidance to agencies on “risks and restrictions related to the disclosure of personally identifiable information.” This includes the risk that although an individual data asset in isolation does not pose a privacy or confidentiality risk, this data “when combined with other available information may pose such a risk.”

The European Union has had a comprehensive privacy law for years, which was overhauled by the passage of the General Data Protection Regulation (GDPR) that went into effect in May 2018. The GDPR restricts disclosure of personally identifiable information under Recital 26 to data that is “anonymized.” The complicated issue of anonymization and potential re-identification are further discussed below in the section on de-identification of data, using the GDPR as an example.

Other country-specific laws, such as Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA) and Australia's Privacy Principles (APP), govern aspects of privacy and disclosure practices in varying ways. Internationally, privacy principles that define practices to follow in handling data, such as those developed by the Organization for Economic Co-operation and Development¹ and the Asia-Pacific Economic Cross-Border Privacy Rules² have also been adopted by some countries.

Citations

Organisation for Economic Co-operation and Development, OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, 2013. source
Asia-Pacific Economic Cooperation, What is the Cross-Border Privacy Rules System?, April 15, 2019. source

Disclosure Limitation Techniques

Techniques for disclosure limitation can be classified in a number of ways. For the purposes of this paper, techniques are grouped by information limiting methods and data perturbation methods.¹ Information limiting methods are those that delete, mask, suppress, or obscure data fields or values in order to prevent re-identification. Data perturbation methods are those that use statistical means to alter either the underlying data itself or query results drawn from the data.

Information Limiting Techniques

PII, Anonymization and the Re-identification Problem

The simplest method of disclosure limitation is to strip PII from a dataset, removing all fields (or suppressing or masking these fields in some way) that could directly and uniquely identify an individual, such as name, social security number, and phone number. Through the 1990s and into the 2000s, PII was often used as something of a bright-line approach to data anonymization. It defined what data needed to be protected, with the remaining data fields considered harmless for disclosure from a privacy perspective. However, in the mid-2000s it became clear that a wide range of other data categories can be used to identify individuals.²

The GDPR expands the scope of protected information beyond PII, instead using the term “personal data”—a broader range of potentially identifying information as defined in Article 4(1):

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

Most, if not all, privacy laws rely on some concept of data being either personally identifiable or not to determine whether the law applies. As the GDPR continues, “The principles of data protection should therefore not apply to anonymous information, namely information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” Defining “identifiable” data, however, becomes complicated by the possibility of re-identification attempts.

Re-identification is the matching of anonymized data back to an individual. In recent years,³ faith in anonymization has been greatly shaken by studies demonstrating re-identification of released data.⁴ Two high profile examples are the 2006 Netflix Prize study⁵ by Arvind Narayanan and Vitaly Shmatikov, which re-identified individuals from Netflix’s release of over 100 million user ratings of movies, and a 2009 Social Security Number study by Alessandro Acquisti and Ralph Gross showing that data about an individual's place and date of birth can be used to predict their Social Security number.⁶

Most commonly, re-identification is performed using external databases to infer information about the anonymized data (known as “linkage attacks”). The Narayanan and Shmatikov Netflix study identified a subset of individuals by cross-referencing the Netflix data with non-anonymized movie ratings from the Internet Movie Database (IMDb). The Acquisti and Gross study used the Social Security Administration’s Death Master File to detect statistical patterns in SSN assignment, and used multiple sources for inferring birthdate, such as voter registration lists, online white pages, and social media.

Researchers have developed new anonymization algorithms such as k-anonymity, t-closeness, and l-diversity to more formally protect against re-identification. These frameworks each rely on their own assumptions and limitations, and therefore each protects against only certain types of attacks. For example, k-anonymity requires that there be k number of different records that share a combination of quasi-identifiers (attributes that are not direct identifiers, but might potentially contribute to identification, such as age and sex). For example, in a 3-anonymous table containing medical condition by zip code and age, every combination of zip code and age values needs to appear at least three times. This can prevent the identification of a specific individual’s record in a table, but is still susceptible to what is known as a homogeneity attack. If the quasi-identifier values of an individual are known, then even though their actual record may not be identified (as there are k number of records that all have the same quasi-identifier values), their presence alone in a certain dataset could be confirmed and could reveal sensitive information about that person. Consider again a table of medical conditions 3-anonymized by zip code and age. If it happened that the only three records for a certain zip code and age combination all had a heart condition diagnosis, it could be possible for someone to deduce information about someone matching that zip code and age. If we know Bob’s age, where he lives (and thus his zip code), and knew he had been in the hospital, we would then be able to tell that it was because of a heart condition.⁷

While anonymization has come under scrutiny as a method of privacy protection, the concepts of anonymity and identification are still key to the applicability and interpretation of privacy laws. Difficult questions concerning de-identification will likely be around for the foreseeable future. As techniques for de-identification improve, so will methods of attack. Ongoing research continues to provide techniques for evaluating and measuring re-identification risks,⁸ but it is unclear whether such methods can adequately account for external data sources that may become available in the future. Datasets cannot be taken back once released, so even if data is anonymized effectively based on current standards, future techniques could remove protections. Additionally, datasets are becoming more detailed and increasing longitudinally (covering a greater period of time), and these higher-dimension, long-term datasets will be more challenging to effectively anonymize.

Anonymization of data is perhaps best considered through a risk management perspective. Removing or obscuring potentially identifying data can be of benefit,⁹ not in providing certainty of de-identification, but in lowering risks when combined with other privacy and security controls. Anonymization techniques should not be a stand-alone approach to privacy, but rather one tool in the disclosure limitation toolkit. Other methods, as discussed in this paper, can be combined with anonymization to lower risks to acceptable levels.

Aggregating, Coarsening, and Suppressing Data

Beyond removing or masking direct PII identifiers, a number of disclosure limitation techniques have been developed for obscuring data in some manner, by coarsening, rounding, aggregating, or suppressing data.

Data coarsening (or generalization) techniques reduce the detail of the data so that individuals whose information is reflected in categories with low n-count values¹⁰ cannot be uniquely identified. One such technique involves top- and bottom-coding methods to place bounds on the reporting of data in order to prevent identification of outliers. Essentially, this means broadening data categories to include more unusual or extreme values, so they are not listed on their own, and thus potentially identifiable. For example, if only one individual in a dataset were age 99, their age could be recoded to a broader 90+ category.

Cell suppression is the withholding of information in the cell of a table output, based on some threshold rule for what counts or aggregates would implicitly or explicitly reveal confidential information. For table cells in which the data would allow estimating a single individual’s value too closely, missing (or imputed) values are displayed. Suppression is often used when there are very few values contributing to a cell, or when one or two large values are the dominant contributors to the aggregate statistics. The table below shows a very simple suppression of the cell showing that two women in the dataset live in New York City.

	NYC	DC	LA	Total
Male	10	14	6	30
Female	2	12	9	23

	NYC	DC	LA	Total
Male	10	14	6	30
Female	*	12	9	23

Across either rows and columns of tables, or across multiple tables, this “primary” cell suppression alone does not always protect the data. Confidential values of primary cells can be determined through comparing and subtracting values. For these cases, secondary (or “complementary”) cell suppression is necessary, in which additional values are also displayed as missing. In the example above, the suppressed value of 2 could be determined by simply subtracting 12 and 9 from 23. Thus, as shown in the following table, a secondary suppression would be made.

	NYC	DC	LA	Total
Male	10	14	6	30
Female	*	12	*	23

Data Perturbation Methods

Beginning in the 1970s, as computers moved beyond the days of punch cards and mainframes, increasing numbers of researchers and others have been able to easily access and query databases. As a result, computer scientists began thinking about the associated risks to privacy and how to prevent queries from revealing information about particular individuals. An initial approach was to restrict or audit database queries, preventing queries that would pull back data that could identify individuals. The simplest such limit is to only allow users to run queries that would return aggregate, or statistical results, and prevent queries that would pull back individual, unaggregated records. For example, developers could design a system so that queries would only return record results of a certain size in the result set. Using a certain threshold number, n, would ensure that only aggregate queries based on at least a set of n records can be run. For example, with an n of two, this would involve preventing a query from being run that would return data of a single, individual record. The problem is that multiple queries can use differentials and overlapping sets (calculating values across query results, by say, subtracting the results of one query from another) to obtain confidential information. A simple example is running a query seeking the total income for all individuals in the data, and then running a second query seeking the total income for all individuals in the data excluding one certain individual. Subtracting the results from the two datasets would provide the income of that individual.

Researchers can employ techniques that prevent releasing statistics if the number of common records returned in a set of queries exceed a given threshold to guard against these re-identification attempts. Further research has created increasingly sophisticated methods of query restriction. However, query restrictions ultimately provide no real guarantee of privacy against a sophisticated user who could potentially generate sets of queries that eventually reveal information about an individual.¹¹

The other approach researchers developed to protect databases is data perturbation—either altering the actual, underlying data or altering query outputs to protect confidentiality. This usually involves adding statistical noise—altering values of the data while still maintaining the statistical relationships between data fields. There is a long history of use of perturbation in disclosure limitation, and there are a number of perturbation approaches and techniques.

Data swapping, first proposed by Tore Dalenius and Steven Reiss in the late 1970s,¹² is a technique through which researchers use statistical models to find pairs of records with similar attributes and switch personally identifying or sensitive data values between the records. After this manipulation, outside researchers or attackers will not be able to determine which values correspond to which individuals. However, the aggregate values of the data are preserved sufficiently to enable researchers to make statistical inferences.

Other perturbation techniques involve changing the values of the data by some random amount, while still preserving the underlying statistical properties of the data within a certain range. For example, the perturbation technique of additive noise works by replacing true values x with the values of x+r, drawing value r from some distribution. The r amounts are such that the replacement x+r values preserve the statistical relations between data records.

While perturbation has been used extensively and effectively by the Census Bureau and other organizations, perturbation methods do have weaknesses. If strongly correlated attributes can be found in real-world data, this correlation can potentially be used to filter out the additive randomizations.¹³ The same can occur if someone has certain background knowledge about the data. Additionally, a number of mathematical and statistical techniques, such as spectral filtering, can be used to filter off the random noise from perturbed data, retrieving the original values.¹⁴

Synthetic Data

Synthetic data are datasets that seek to replicate the statistical properties of real-world datasets, serving as analytical replacements. Synthetic datasets can be either fully synthesized, with all of the original dataset values generated synthetically, or partially synthesized, with only certain fields or a portion of records synthesized. Often, the data must be presented in the same form and structure as the original data in order to be compatible with existing systems, algorithms, and software. Created through various modeling techniques,¹⁵ synthetic data differs from perturbation techniques that alter the original, underlying data as discussed above. Instead, synthetic data creates completely new data by using models that fit the original data (or by using defined parameters and constraints) to generate statistically comparable data independent from the underlying, real-world data.

Synthetic data can greatly reduce the risks of re-identification through ancillary datasets, as attempts to perform matching with external databases is difficult. The ability to adjust the model also provides an additional confidentiality advantage. Researchers modeling the data can make decisions about which relationships of the real-world data will be preserved. Relationships omitted from the model will not be discoverable by analysts, since they will not be present in the synthetic data, providing the ability to keep certain data correlations and the sensitive information they might reveal confidential. For example, if the correlation of interest in a research dataset is between gender, age, and health condition, that could be the only correlation preserved in the synthetic dataset. Other data fields might be generated and included in this dataset, modeled to provide broad, aggregate statistics, but not with statistically significant correlation to combinations of certain other variables—correlations that could predictively reveal information about individuals.

There are still confidentiality concerns that need to be taken into account with synthetic data, however. Synthetic datasets can potentially leak the underlying data if the model fits too closely. For example, if the synthetic data has enough different fields per individual, and the model is closely fitted to the original data, outliers can potentially be identifiable. However, there is ongoing research aimed at developing the means to generate synthetic data in a way that would provide formal privacy guarantees (discussed in the section on differential privacy below).¹⁶

Adversarial machine learning, which is the use of AI techniques to detect vulnerabilities, can potentially be used to determine information about a record used in the modeling of the synthetic dataset (i.e. “real-world” data). However, those seeking to uncover the information would also need key information about the model used to create the data.¹⁷

Apart from its value in disclosure limitation, another potential benefit of synthetic data is the ability to generate large volumes of research data at low cost.¹⁸ Machine learning requires running large volumes of training data through algorithms, and synthetic data may have great value as a way to rapidly provide these large volumes of data. These datasets contain no real-world data, but are statistically similar enough to real-world data to be of training value. While companies such as Google and Facebook generate large datasets as part of their business, smaller companies may be able to use this synthetic data to jump-start a machine learning program without collecting data about real people. From a privacy standpoint, using synthetic data to train artificial intelligence is attractive in that it avoids the need for collecting, storing, and using real-world data in the large amounts needed for machine learning.

Differential Privacy

A group of prominent computer scientists first introduced the concept of differential privacy (also known as “formal privacy”) in their 2006 paper, Calibrating Noise to Sensitivity in Private Data Analysis,¹⁹ although precursors to the technique go back decades. It is not a single tool or method, but rather a privacy standard that provides formal mathematical guarantees of privacy that can be implemented in various ways. Differential privacy’s guarantee is that an adversary can learn virtually nothing more about an individual based upon disclosures from a dataset than they would learn if that person’s record were not included in the dataset. In other words, whether or not your personal data is included, resulting outputs from a dataset would be approximately the same. Strong privacy protection is provided, while still allowing an analyst to derive useful statistical results. Differential privacy provides a promising solution to database reconstruction and re-identification attacks, as it would be highly difficult to link the noisy, approximate results to external data sources.

In practice, differential privacy works by injecting a precise, calculated amount of statistical noise to the data contained in query results by using statistical means (so it can be essentially thought of as a perturbation method for query outputs). What is provided by differential privacy is an approximation of the true value—the exact same query could produce two different answers.²⁰ The difference between the data value provided in differentially private outputs and the real-world value can be tuned to be a larger or smaller value (known as the privacy loss parameter), but at a trade-off between accuracy and privacy. Differential privacy defines privacy risk as an allowable leakage of data on an individual in comparison to a hypothetical database without a certain individual. The allowed deviation between data that includes an individual and one that does not is usually represented as ε (epsilon).

While an individual’s personal information is almost irrelevant to the outputs produced via differential privacy, some insignificantly small change in belief about an individual can potentially be made based on the information released. The probability that some inference can be made about an individual is at most 1+ε times the probability that an inference could be made without the individual’s data. For example, if the baseline probability of an individual developing a certain disease is 3 percent (say for a female in the United States), with an ε of 0.01, the known probability under differential privacy would rise from 3 percent to only 3.03 percent (3 x 1+ε) at most.²¹ Put another way, the probability difference in the outputs between a dataset with the individual and one without the individual included is .03 percent.

Differential privacy also measures and bounds the total privacy loss over multiple analyses. Part of the application of differential privacy involves establishing a “privacy budget,” which limits the overall amount of data that can be disclosed. Setting this budget requires determining the cumulative risk of data disclosures over the lifespan of the data. With all disclosure limitation techniques, there is no avoiding the fundamental fact that when multiple analyses are performed using an individual’s data, disclosure risk increases by some amount. Thus with each statistical release, or query, under differential privacy, some small amount of potentially private information is leaked. Therefore, while risk does increase with each release, the privacy budget ensures that risk accumulates in a bounded way. Queries are analyzed to determine their privacy cost (ε) and whether the remaining balance (a running tally ε over all queries) of the privacy budget is sufficiently high to run it. Setting a privacy budget thus returns us to the always present trade-off of informational value of the data and confidentiality; potentially releasing identifiable information (if the privacy budget is set too high) versus data releases not being informationally useful (if the budget is set too low). Methods for optimally calculating the privacy budget are an area of current research.²²

Differential privacy has recently seen a number of real-world uses by companies. Uber uses differential privacy to protect internal analyses, such as those done on driver revenue.²³ Apple is using differential privacy to protect user privacy while improving the usability of features such as lookup hints.²⁴ Federal agencies such as the Census Bureau are also beginning to adopt differential privacy.²⁵ Additional tools for making differential privacy use more accessible are under development, with some being provided open-source, such as Google’s differential privacy kit, which is available via GitHub and allows users to calculate differentially private simple statistics from a dataset.²⁶

Differential privacy stands as one of the most promising disclosure limitation techniques, one that can provide formal, mathematical assurances of privacy while unlocking valuable research data. However, like any other disclosure limitation method, it is not an absolute assurance. In addition to concerns about correctly calculating the privacy budget, covert-channel attacks can potentially be used against differentially private query systems and need to be protected against. Using information other than the query values, such as the time to complete the query, could potentially be used to reveal information such as the presence of an individual in a database.²⁷ For example, a query looking for an individual in a dataset of cancer patients may take one second to run if the individual is not present, versus a half hour to run if the individual is in the dataset.

Citations

There is some blurring of lines between these two categories.
Starting in 2006, a number of studies demonstrated the ability to re-identify individuals in publicly released, anonymized data. This includes re-identifications from the 2006 AOL release of search queries of users (see Michael Barbaro and Tom Zeller Jr., "A Face Is Exposed for AOL Searcher No. 4417749", New York Times, Aug. 9, 2006, source, The Massachusetts government’s release of state employee hospital visit data, Daniel Barth-Jones, “The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now”, SSRN (July 2012) source, and the two other examples discussed below in this paper.
A 2016 comprehensive review of re-identification attacks found that 72.7% of all successful attacks have taken place since 2009. Jane Henriksen-Bulmer and Sheridan Jeary, "Re-identification Attacks—A Systematic Literature Review", International Journal of Information Management 36 (December 2016):1184-1192source
Ohm, 1716-1720
Arvind Narayanan and Vitaly Shmatikov, "Robust De-anonymization of Large Sparse Datasets",SP '08: Proceedings of the 2008 IEEE Symposium on Security and Privacy (May 2008):111–125, source
Alessandro Acquisti and Ralph Gross, "Predicting Social Security numbers from public data", Proceedings of the National Academy of Sciences 27 (July 2009): 10975-10980, source
Methods to prevent against such attacks continue to be developed however. See for instance Qian Wang, Zhiwei Xu and Shengzhi Qu, "An Enhanced K-Anonymity Model against Homogeneity Attack", Journal Of Software 6 (October 2011): 1945-1952, source and Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, And Muthuramakrishnan Venkitasubramaniam, "l-Diversity: Privacy Beyond k-Anonymity", ACM Transactions on Knowledge Discovery from Data 1 (March 2007): 1-52,source
See for instance on qualitative and quantitative risk measurement for randomized control trial data, Parveen Kumar and Rajan Sareen, “Evaluation of Re-identification Risk for Anonymized Clinical Documents”, Candian Journal of Hospital Pharmacy 62, (July–August 2009): 307-319, source
In revisiting Latanya Sweeney’s well known re-identification study showing that 87% of the US population could be identified by gender, date of birth, and ZIP code, the authors found that re-identification would drop to .02% by replacing date of birth with month and year only, and zip code with county. Philippe Golle, “Revisiting the Uniqueness of Simple Demographics in the US Population”, Proceedings of the 5th ACM workshop on Privacy in Electronic Society (October 2006): 77–80, source
When data is broken down by categories in tables, only a few individuals (“low n” for number of individuals) may fall into some of the categories– such as an example of only one or two students of a certain race and gender being in a particular college program.
See Appendix B in Irit Dinur and Kobbi Nissim, “Revealing Information while Preserving Privacy”, Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (June 2003): 202–210, source
Tore Dalenius and Steven P. Reiss, "Data-swapping: A Technique for Disclosure Control", Journal of Statistical Planning and Inference 6, no 1 (1982): 73-85, source
See Kun Liu, Chris Giannella, and Hillol Kargupta, “A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods”, in Privacy-Preserving Data Mining: Models and Algorithms, ed. Charu C. Aggarwal and Philip S. Yu (New York, NY: Springer, 2008) 359-381, source
See Songtao Guo and Xintao Wu, "On The Use Of Spectral Filtering For Privacy Preserving Data Mining" Proceedings of the ACM Symposium on Applied Computing, Dijon, France, April 23-27, 2006
See Surendra H and Mohan HS “A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing”, International Journal Of Scientific & Technology Research 6 (March 2017): 95-101, source
See Haoran Li, Li Xiong, and Xiaoqian Jiang, “Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions”, Advanced Database Technology 2014 (2014): 475–486, source , and National Institute of Standards and Tehcnology, “2018 Differential Privacy Synthetic Data Challenge” (accessed Jan 10, 2020) source
See for instance, Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov, “Membership Inference Attacks against Machine Learning Models”, Proceedings of the IEEE Symposium on Security and Privacy (2017), source
Methods for generating synthetic data itself via machine learning are currently being developed using techniques such as generative adversarial networks (GAN).
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith, “Calibrating Noise to Sensitivity in Private Data Analysis”. In: Theory of Cryptography, Lecture Notes in Computer Science, ed. Shai Halevi and Tal Rabin (Berlin: Springer, 2006), 265-284.source
Providing individuals in the dataset plausible deniability. As an example, for a population, 1,022 may be returned one time and 1,016 another, but this would be irrelevant as part of statistical analyses.
For the mathematical proof of differential privacy, see Cynthia Dwork, “A Firm Foundation for Private Data Analysis”, Communications of the ACM, 54 (January 2011):86-95, source
See for instance, Anis Bkakria and Aimilia Tasidou, “Optimal Distribution of Privacy Budget in Differential Privacy”, Risks and Security of Internet and Systems. CRiSIS 2018. Lecture Notes in Computer Science 11391 (2019)source
source
Apple Inc., “Differential Privacy Overview”, source
The Census Bureau’s move to differential privacy is discussed below.
source
Andreas Haeberlen, Benjamin C. Pierce, and Arjun Narayan, “Differential Privacy Under Fire”, Proceedings of the 20th USENIX conference on Security (August 2011) source

The Census Bureau

While new legal rules mandating government transparency, such as the Open Data Act, will require agencies to release more data, the Census Bureau has long differed from other major government agencies in that the public release of data is one of its primary functions. The Census Bureau publishes a large amount of information on the demographics and economy of the United States, while endeavoring to protect the privacy of individuals. For the 2010 Census, the Bureau published 5.6 billion independent tabular summaries, based on over 300 million person records.¹ The Census Bureau faces a number of disclosure limitation challenges, including the high-dimensionality of its data, and the need to preserve associations among variables. In addition to releasing tabular summaries, the Census Bureau also publicly releases microdata (record-level data) from the decennial census and from many of its demographic and economic surveys.

Under Title 13 of the U.S. Code, the Bureau is prohibited from releasing data that allows “any particular establishment or individual” to be identified. In addition, the Census Bureau is bound by the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Applied primarily, but not exclusively, to the statistical agencies of the federal government, CIPSEA was created to provide federal agencies the ability to make a statutory commitment to confidentiality and to restrict data use to statistical purposes only. CIPSEA sets high penalties for disclosures, including fines and jail time. Other statistical agencies have often used data licensing agreements to provide data to specific users with confidentiality requirements. However, the Census Bureau cannot rely on such agreements because any Census data released to external parties is automatically considered publicly available. The Census Bureau thus has a strong focus on preserving confidentiality in the data it releases, and has been on the cutting edge of disclosure limitation methods.

This history of how the Census Bureau has protected public releases of information provides useful examples of disclosure limitation in practice. Early censuses in the 1800s used few privacy measures, often only removing names. As concerns about confidentiality grew, the 1929 census law established a requirement that “no publication shall be made by the Census Office whereby the data furnished by any particular establishment or individual can be identified.”² Protections were further codified and strengthened by the 1954 census law (Title 13 of the U.S. Code). Until the early 1960s, Census data was only released in printed volumes, greatly limiting the detail and amount of data that could be disclosed (and thus reducing privacy risks). With the move to publishing of extensive electronic data in more recent times, and the greater attendant need to protect privacy, the Census Bureau has used a number of information reduction and data perturbation methods to limit disclosure. Information reduction techniques traditionally used by the Bureau include geographic thresholds, coding, and sampling.³ For example, to protect against identification, any geographic areas identified on public-use files must have a population over 100,000, and every categorical variable (a variable used for grouping, such as gender), must have at least 10,000 people nationwide, otherwise the variable is recoded into a broader one. Cell suppression methodology, as described above, has also been a primary information restriction method by the Census Bureau, one they have continually worked to improve.⁴ In 1996, the Bureau began using data swapping and noise infusion techniques to perturb data by a confidential amount. The Census Bureau has also protected data from the decennial census and the American Community Survey (an ongoing population and housing survey) by creating partially and fully synthetic data.

To further protect against disclosure risks, the Census Bureau also uses procedural and administrative methods. Before dissemination, all data products released by the Census Bureau must be reviewed by their Disclosure Review Board (DRB). The DRB examines whether appropriate disclosure limitation techniques have been applied for a Census product, but also determines whether a certain product presents additional disclosure risks that need to be addressed. After an error was made in the release of a product in 2010, the Bureau also created the position of disclosure limitation officer. Each division at the Census Bureau that produces data releases must designate an officer to oversee all disclosure limitation activities and final submission to the DRB.

Centralized disclosure review boards (used by the Census Bureau, and other government agencies such as the Department of Education) offer added benefits beyond what a more limited, specific review of a disclosure would provide. Each data release can be considered in the context of all planned data releases by an agency. Additionally, centralized disclosure review boards bring together experts from across the agency, including staff with technical skills and those with specialized knowledge about particular data types and datasets.

Another, different approach to disclosure limitation is to restrict data access by legal and/or operational means. Due to confidentiality concerns, some Census data cannot be released publicly. So to provide secure, authorized data access to researchers (rather than licensed release), the Census Bureau maintains 29 Federal Statistical Research Data Centers (RDCs) hosted at government agencies, universities, and nonprofit institutions. Using their own review and approval processes, statistical agencies (both the Census Bureau and others, including the Bureau of Labor Statistics and the Bureau of Economic Analysis) provide microlevel data to the secure RDC environments. Researchers must obtain Census Bureau Special Sworn Status by passing a background check and swearing a lifetime confidentiality oath. Under Title 13 and Title 26 of the U.S. Code, penalties of a federal prison sentence of up to five years, a fine of up to $250,000, or both apply to any violations of the confidentiality requirements. Researchers function under the supervision of employees of the RDC on non-networked machines, and a researcher’s output, code, and notes all undergo disclosure review by an RDC analyst. Additionally, statistical software used at the RDCs has certain commands, such as those for copying or printing datasets, restricted. Certain projects may allow remote access through a secure communication network, with the code submitted by the researcher and executed on a computer in the RDC, and subject to the same code and output review provisions.

Realizing that the increases in computing power and availability of external databases were increasing the risks of re-identification, the Census Bureau has begun to move from legacy disclosure limitation methods to techniques based on formal privacy. Internal researchers at the Census Bureau in 2019 discovered that confidential data could be reconstructed from the publicly released tabulations of the 2010 Census by using commercial data, potentially revealing the race and ethnicity of individuals.⁵ For the 2020 Census, differential privacy will be used to protect data through a new processing system developed in-house.⁶ The adoption of differential privacy will require Census to closely evaluate how the quantity of and nature of statistics it releases affects its privacy budget, as each release of data will use a fraction of it. Tables for which high accuracy is critical will require a larger share of the privacy budget.

Citations

United States Census Bureau, American Fact Finder (accessed January 7, 2020), source
Included as part of the Reapportionment Act of 1929. Reapportionment Act of 1929, 71st Cong., 1st sess., June 18,1929, 21-27, source
Amy Lauger, Billy Wisniewski, And Laura McKenna, “Disclosure Avoidance Techniques at the U.S. Census Bureau: Current Practices and Research”, Research Report Series, Center for Disclosure Avoidance Research #2014-2 (2014), source
Phyllis Singer and Nelson Chung, “Predicting Complementary Cell Suppressions Given Primary Cell Suppression”, Research Report Series, Center for Disclosure Avoidance Research #2016-5 (2016), Conditions source
John M. Abowd, “Starting Down the Database Reconstruction Theorem” (presentation at the American Association for the Advancement of Science Annual Meeting, Washington, DC, February 16, 2019) source
United States Census Bureau, “Disclosure Avoidance and the 2020 Census”, source

Conclusion

Collection of data continues to expand rapidly, growing datasets into longer-term repositories with increasing value. However, this higher-dimension, longitudinal data creates a greater risk of privacy harms and the corresponding need to develop more privacy-protective techniques and technologies. The tension in providing detailed enough data to be useful while maintaining confidentiality of the underlying information will always remain. When datasets include potentially identifiable personal information, steps to prevent disclosure of this information can limit the extent to which researchers can analyze data with granular and accurate enough calculations.

Both private and public organizations have long relied on notice and consent and de-identification for protecting privacy—methods that have been shown to be no longer reliable. There are no silver bullets in disclosure limitation, and no single privacy-enhancing technique or technology will completely remove privacy risks. However, recent advances in disclosure limitation hold great promise for protecting confidentiality while allowing data to be used to provide valuable information. The emerging techniques of differential privacy and synthetic data can help move us forward from debates about anonymization and re-identification, towards a better balancing of data disclosure and confidentiality based on formalized and measurable metrics. Traditional disclosure limitation techniques are still of value as well and can be used in conjunction with modern methods to greatly reduce privacy risks.¹ However, the focus on personally identifiable information in current privacy regulations presents complications when considering disclosures protected by modern means such as differential privacy. Existing and future laws and policies will need to take account of the more quantifiable, comprehensive concepts of privacy that formal privacy methods provide. Researchers and policymakers will need to ask tough questions about how much statistical noise is enough to adequately protect privacy while still providing useful data, and how to capture and define these considerations in regulations and policies.

Citations

And administrative and regulatory approaches such as formal application and review, data use agreements, and secure data enclaves can also be used to minimize disclosure risks from non-publicly released data further.

More About the Authors

Chris Sadler

Education Data and Privacy Fellow, Open Technology Institute

Education & Work

Democratic Futures

Global Security

Technology & Democracy

Thriving Families

Real Skills, Real Income: Why Youth Apprenticeship Is Resonating Now

Future-Proofing U.S. Nuclear Policy: Forecasting Outcomes of the Nuclear-Armed Sea-Launched Cruise Missile

Debunking Myths on Student Parent Data Collection

The App Store Accountability Act Poses Serious Concerns for Privacy, Security, and Free Expression

Redrawing School Boundaries for Fairer Funding

Reframing Fusion Voting as a Practical, Powerful Reform Strategy

Harnessing Terrorism Data to Reshape U.S. National Security Policy

Establishing a National Housing Loss Rate

New America Fellows

Evictions in the District of Columbia: June 2025 – February 2026

The Charleston Regional Youth Apprenticeship Model

Accreditation 101: A Fireside Chat on How Colleges Are Measured

The Great Game

Table of Contents

Abstract

Acknowledgments

Downloads

Introduction

Citations

Laws Governing Disclosure

Citations

Disclosure Limitation Techniques

Information Limiting Techniques

Data Perturbation Methods

Citations

The Census Bureau

Citations

Conclusion

Citations

More About the Authors

Chris Sadler

Issues

Programs/Projects/Initiatives

Topics

Related

The Santa Clara Principles 2.0

Trained for Deception: How Artificial Intelligence Fuels Online Disinformation

Does Data Privacy Need its Own Agency?

Equity by Design

Protecting Privacy in Data Releases

Protecting Privacy in Data Releases