April 21, 2021
The U.S. Census Bureau is currently tabulating and verifying the results of the 2020 census. Delayed by the challenges of COVID-19, the agency is scheduled to deliver the first population counts by the end of April. An ongoing pandemic isn’t the only distinctive aspect of the 2020 Census, however. This will be the first census to protect confidentiality primarily through a privacy method known as differential privacy. While differential privacy has been successfully piloted by Microsoft, Apple, and a few other organizations (including limited use by the Census Bureau itself as far back as 2008), the 2020 census will be its largest-scale use thus far.
The Census Bureau has long been at the forefront of developing new techniques for anonymizing data releases (more formally known as “disclosure limitation” or “disclosure avoidance” techniques), and the adoption of differential privacy continues this tradition. The move was necessitated by society’s advancing capacities for data processing and storage, as well as the increase in types of data being produced. These factors undermine the effectiveness of the Bureau’s traditional disclosure avoidance and anonymization techniques, allowing the re-identification of confidential data through the matching of that data with other, external data sources. Internal researchers at the Census Bureau discovered that confidential information on individuals could be reconstructed from the publicly released 2010 Census data. These problems of “database reconstruction” and the potential to identify individuals will only increase as available data sources become larger and more varied. The 2019 finding pushed the bureau to accelerate work on building its new Disclosure Avoidance System (DAS) based on differential privacy, which can provide mathematically rigorous privacy protection in ways that traditional disclosure avoidance methods cannot.
Differential privacy was first defined in a 2006 paper by a group of prominent computer scientists. It works by adding a precise amount of statistical noise to data results. It can provide a guarantee that someone can learn essentially nothing more about an individual than they would if that individual’s record were not included in the data. In other words, whether or not one particular individual’s personal data is included, resulting outputs from a dataset would be approximately the same. This gets around the difficult task of determining which data elements are “identifying” by treating all information as potentially identifying. There just needs to be enough noise added to the data results to conceal any individual's contribution to the data (so that, in essence, whether your data are included or not, the noisy results will stay within some “same” statistical range). How much noise depends on the sensitivity of the data you're working with: How much can one person and their information affect the data overall?
The key to the process is the uncertainty the noise introduces in the data and the problems it presents for an attacker. As a simplified example, consider a company that releases some aggregate data on its 250 employees. The next month it releases the same data on its now 251 employees. An attacker could simply subtract the first set from the second to find out information on this new individual (perhaps finding on LinkedIn who newly joined the company that month). Differential privacy adds some noise to the employee total, making it, say, 248 or 252 the second month. These noise additions can make it very difficult to identify someone with any level of certainty. The goal is to keep the results of the data accurate enough to be useful for analysis (within certain statistical limits) but protect the privacy of individuals.
Differential privacy is also an improvement over traditional disclosure limitation techniques in that the designers of the system can calculate cumulative privacy losses. For any method of protecting data, it’s inevitable that as more and more data is released from a dataset, it becomes more likely that an attacker can identify an individual. Differential privacy, however, can quantify and track privacy loss as data continues to be released against a “privacy budget” that tells the owners of the data set how close they are to having data revealed.
The move to differential privacy in the 2020 Census has not been without controversy. Researchers have raised questions about their ability to accurately conduct research using differentially private data, and Alabama recently filed a lawsuit backed by a number of other states questioning whether redistricting counts will be accurate enough under differential privacy. At heart, these are issues related to the trade-off between data confidentiality and data utility—a trade-off that exists with all disclosure limitation methods, differential privacy included. Too much obscuring of or adding noise to data makes it unreliable, but not adding enough noise diminishes privacy. This is not much of an issue for data the Census Bureau releases at larger region levels, but at granular, small-population levels or tabulations, the amount of noise needed to protect confidentiality can make the data too inaccurate to be useful.
There are valid questions about how the bureau will set the privacy budget and how this will affect more detailed data. However, with the more traditional disclosure methods the bureau used in the past, trade-offs were an internal agency decision that could not be made public. Differential privacy, with its ability to quantify both data accuracy and privacy loss, allows more transparency about how these trade-offs are made (revealing the differential privacy parameters used doesn’t compromise privacy), and thus more public discussion and debate. The Bureau continues to solicit feedback from its data user community and other stakeholders such as civil rights groups, and release demonstration data to inform ongoing improvements to its use of differential privacy. There are also several avenues the Bureau can take to address concerns through use of alternative statistical products instead of public release, including facilitating and expanding access to Federal Statistical Research Data Center (RDC) locations where researchers can work with more detailed census data under strict protections. This would, however, entail legal changes and increased funding.
There has always been a tension between confidentiality and utility in the release of census data, a function of the Bureau’s dual mandate. While the Constitution specifies the Bureau must provide an “actual enumeration” of the U.S. population, Title 13 of the U.S. Code prohibits the Bureau from releasing data that allows “any particular establishment or individual” to be identified. Balancing these competing mandates is a challenge with which the Bureau continually grapples. But the Bureau realizes that the threat of re-identification attacks, which continue to grow in scope and sophistication, requires it to move to the latest disclosure limitation advances. While it remains to be seen how current controversies over differential privacy’s use in the 2020 Census will be resolved, this census will be a very significant one for how the Bureau balances the public’s trust in the data it provides with protecting the public’s privacy. It will also be a milestone in the use of differential privacy, advancing a promising tool for mitigating privacy risks when data are gathered, analyzed, and published.