The Census Bureau
While new legal rules mandating government transparency, such as the Open Data Act, will require agencies to release more data, the Census Bureau has long differed from other major government agencies in that the public release of data is one of its primary functions. The Census Bureau publishes a large amount of information on the demographics and economy of the United States, while endeavoring to protect the privacy of individuals. For the 2010 Census, the Bureau published 5.6 billion independent tabular summaries, based on over 300 million person records.1 The Census Bureau faces a number of disclosure limitation challenges, including the high-dimensionality of its data, and the need to preserve associations among variables. In addition to releasing tabular summaries, the Census Bureau also publicly releases microdata (record-level data) from the decennial census and from many of its demographic and economic surveys.
Under Title 13 of the U.S. Code, the Bureau is prohibited from releasing data that allows “any particular establishment or individual” to be identified. In addition, the Census Bureau is bound by the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). Applied primarily, but not exclusively, to the statistical agencies of the federal government, CIPSEA was created to provide federal agencies the ability to make a statutory commitment to confidentiality and to restrict data use to statistical purposes only. CIPSEA sets high penalties for disclosures, including fines and jail time. Other statistical agencies have often used data licensing agreements to provide data to specific users with confidentiality requirements. However, the Census Bureau cannot rely on such agreements because any Census data released to external parties is automatically considered publicly available. The Census Bureau thus has a strong focus on preserving confidentiality in the data it releases, and has been on the cutting edge of disclosure limitation methods.
This history of how the Census Bureau has protected public releases of information provides useful examples of disclosure limitation in practice. Early censuses in the 1800s used few privacy measures, often only removing names. As concerns about confidentiality grew, the 1929 census law established a requirement that “no publication shall be made by the Census Office whereby the data furnished by any particular establishment or individual can be identified.”2 Protections were further codified and strengthened by the 1954 census law (Title 13 of the U.S. Code). Until the early 1960s, Census data was only released in printed volumes, greatly limiting the detail and amount of data that could be disclosed (and thus reducing privacy risks). With the move to publishing of extensive electronic data in more recent times, and the greater attendant need to protect privacy, the Census Bureau has used a number of information reduction and data perturbation methods to limit disclosure. Information reduction techniques traditionally used by the Bureau include geographic thresholds, coding, and sampling.3 For example, to protect against identification, any geographic areas identified on public-use files must have a population over 100,000, and every categorical variable (a variable used for grouping, such as gender), must have at least 10,000 people nationwide, otherwise the variable is recoded into a broader one. Cell suppression methodology, as described above, has also been a primary information restriction method by the Census Bureau, one they have continually worked to improve.4 In 1996, the Bureau began using data swapping and noise infusion techniques to perturb data by a confidential amount. The Census Bureau has also protected data from the decennial census and the American Community Survey (an ongoing population and housing survey) by creating partially and fully synthetic data.
To further protect against disclosure risks, the Census Bureau also uses procedural and administrative methods. Before dissemination, all data products released by the Census Bureau must be reviewed by their Disclosure Review Board (DRB). The DRB examines whether appropriate disclosure limitation techniques have been applied for a Census product, but also determines whether a certain product presents additional disclosure risks that need to be addressed. After an error was made in the release of a product in 2010, the Bureau also created the position of disclosure limitation officer. Each division at the Census Bureau that produces data releases must designate an officer to oversee all disclosure limitation activities and final submission to the DRB.
Centralized disclosure review boards (used by the Census Bureau, and other government agencies such as the Department of Education) offer added benefits beyond what a more limited, specific review of a disclosure would provide. Each data release can be considered in the context of all planned data releases by an agency. Additionally, centralized disclosure review boards bring together experts from across the agency, including staff with technical skills and those with specialized knowledge about particular data types and datasets.
Another, different approach to disclosure limitation is to restrict data access by legal and/or operational means. Due to confidentiality concerns, some Census data cannot be released publicly. So to provide secure, authorized data access to researchers (rather than licensed release), the Census Bureau maintains 29 Federal Statistical Research Data Centers (RDCs) hosted at government agencies, universities, and nonprofit institutions. Using their own review and approval processes, statistical agencies (both the Census Bureau and others, including the Bureau of Labor Statistics and the Bureau of Economic Analysis) provide microlevel data to the secure RDC environments. Researchers must obtain Census Bureau Special Sworn Status by passing a background check and swearing a lifetime confidentiality oath. Under Title 13 and Title 26 of the U.S. Code, penalties of a federal prison sentence of up to five years, a fine of up to $250,000, or both apply to any violations of the confidentiality requirements. Researchers function under the supervision of employees of the RDC on non-networked machines, and a researcher’s output, code, and notes all undergo disclosure review by an RDC analyst. Additionally, statistical software used at the RDCs has certain commands, such as those for copying or printing datasets, restricted. Certain projects may allow remote access through a secure communication network, with the code submitted by the researcher and executed on a computer in the RDC, and subject to the same code and output review provisions.
Realizing that the increases in computing power and availability of external databases were increasing the risks of re-identification, the Census Bureau has begun to move from legacy disclosure limitation methods to techniques based on formal privacy. Internal researchers at the Census Bureau in 2019 discovered that confidential data could be reconstructed from the publicly released tabulations of the 2010 Census by using commercial data, potentially revealing the race and ethnicity of individuals.5 For the 2020 Census, differential privacy will be used to protect data through a new processing system developed in-house.6 The adoption of differential privacy will require Census to closely evaluate how the quantity of and nature of statistics it releases affects its privacy budget, as each release of data will use a fraction of it. Tables for which high accuracy is critical will require a larger share of the privacy budget.
Citations
- United States Census Bureau, American Fact Finder (accessed January 7, 2020), source
- Included as part of the Reapportionment Act of 1929. Reapportionment Act of 1929, 71st Cong., 1st sess., June 18,1929, 21-27, source
- Amy Lauger, Billy Wisniewski, And Laura McKenna, “Disclosure Avoidance Techniques at the U.S. Census Bureau: Current Practices and Research”, Research Report Series, Center for Disclosure Avoidance Research #2014-2 (2014), source
- Phyllis Singer and Nelson Chung, “Predicting Complementary Cell Suppressions Given Primary Cell Suppression”, Research Report Series, Center for Disclosure Avoidance Research #2016-5 (2016), Conditions source
- John M. Abowd, “Starting Down the Database Reconstruction Theorem” (presentation at the American Association for the Advancement of Science Annual Meeting, Washington, DC, February 16, 2019) source
- United States Census Bureau, “Disclosure Avoidance and the 2020 Census”, source