Introduction
The falling cost of data storage and the spread of the internet have led to an acceleration in the collection of data about individuals. Many organizations, both private and public, gather and store information from a myriad of sources, resulting in an accumulation of data exceeding 40 zettabytes (40 trillion gigabytes) globally.1 Data holds the potential for substantial gains to society in building knowledge, research, informing policy, and providing information to the public. However, publishing or sharing data creates privacy risks in exposing individuals to potential financial, reputational, and other harms and liabilities. Organizations have ethical and often legal requirements to protect the confidentiality of data, but this involves a tradeoff with the usefulness of the data. Protecting confidentiality necessitates excluding, aggregating, or obscuring the data in some way that reduces its detail and exactness. Thus, a balance must be struck between confidentiality and the informational value of data.
“Disclosure” refers to the release of data by some means, including making it publicly available or available to another entity or individual (such as the sharing of records with a researcher).2 The primary privacy concern3 created by disclosure occurs when the data released contains either direct personally identifiable information (PII), or when other fields or aspects of the released data can be used in some way (often in conjunction with other available datasets) to identify a person).4 Disclosures may include sensitive information about individuals, but the risk is in being able to link data in the disclosure to a specific person. “Disclosure limitation” (also known as “disclosure avoidance” and “disclosure control”) refers to the safeguards and statistical methods used to reduce the risk of disclosure of identifiable information in a data release.
Privacy concerns about disclosure have often focused on the release of public-use government data. The Census Bureau, with a primary mission of disclosing data to the public, has been at the forefront of cutting-edge research on and empirical use of methods for disclosure limitation. Other statistical agencies at both the federal and state level also regularly release data, and non-statistical agencies are about to start doing the same, as the 2019 OPEN Government Data Act will require federal agencies to publish much of their information online as open data.
There is also growing concern about how corporations use the data they hold. As corporate data warehouses build in volume and detail over time, they become valuable for discovering information relationships about customers through analytic techniques (a process known as data mining). The potential for derivation of highly sensitive information through data mining carries serious ethical implications.5 There are calls for comprehensive laws that would control collection and use of personal data by companies, but even without those laws, we are seeing groundbreaking research into disclosure limitation from many sources in the private sector.6
Traditionally, government and private entities seeking to disclose information without creating privacy harms have attempted to provide data in aggregate or anonymized form, so that sensitive information cannot be related back to any particular individual. In recent years, however, it has become clear that traditional techniques of anonymization and aggregation of data are not as privacy protecting as had been thought.7 The challenges of balancing the quality and usefulness of disclosures with the fundamental rights of confidentiality and privacy have become much more complex as both technological advances and public perceptions of privacy have changed. There is a wide range of methods for suppressing, aggregating, and obscuring data, all with a goal of creating a release of information that reduces individually identifiable information. However, increases in computing power, the advancement of analytical techniques and sophistication of attacks, the growth of available data sources on individuals, and other factors have weakened the protections of many traditional disclosure techniques. While these older methods are still useful in reducing disclosure risks and continue to be refined, there has been an accelerating shift to the modern, formal disclosure limitation techniques of differential privacy.
This paper provides an overview of some of the privacy issues involved with data disclosures, and how disclosure limitation techniques can be used to protect the confidentiality of individuals whose data is included in disclosures. It provides an overview of some of the primary methods that have traditionally been used, as well as those that have emerged more recently. It does not aim to be an exhaustive list of disclosure limitation methods, but will hopefully provide pointers to further reading. The Census Bureau, which has been a primary center of developing disclosure limitation techniques, is used as an example of how disclosure limitation is practiced and how it has evolved.
Citations
- Jeff Desjardins, “How much data is generated each day?” World Economic Forum, April 17, 2019, “source
- A disclosure is distinguished from a breach, which is an unintentional release of information.
- “Confidential” as used in this paper means any information that was not intended to be released as part of data made available, which includes both personal and non-personal information.
- In some instances, there may be other disclosure concerns apart from personal identification, such as release of classified information or certain sensitive or proprietary information about an organization or company.
- Such as the noted case of Target determining and revealing a teen pregnancy to her father. See Kashmir Hill "How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did", Forbes.com (accessed Dec 9, 2020), source
- The field of research in protecting privacy and confidentiality in data mining is known as Privacy Preserving Data Mining (PPDM), and utilizes many of the disclosure limitation techniques discussed in this paper; Such as Google’s development of tools for secure multiparty computation and differential privacy.
- Paul Ohm, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, UCLA Law Review 57 (August 2010):1701-1777. source