Types of PETs and Plain-Language Explanations: A Glossary

Privacy-enhancing technologies (PETs) are highly technical. This makes it challenging for non-experts to understand their functions, in turn making it harder to fully understand their benefits. To bridge this gap, this report includes a glossary that provides clear, non-technical explanations of key PETs.

Additionally, after the glossary, there is a table to help practitioners navigate which PET may be best suited for different needs (see Table 1). The goal is to ensure that policymakers, data managers, and engineers can make informed decisions about which PETs to use in different scenarios.

Methodology for PET Selection

The PETs covered in this report were chosen based on their relevance to government use cases, their ability to protect privacy while maintaining data utility, and their compliance with regulatory requirements. The selection process considered:

  • Strength of privacy protections: The extent to which each PET minimizes exposure of sensitive data.
  • Suitability for government applications: Whether the PET can be effectively integrated into government workflows and data-sharing initiatives.
  • Regulatory and ethical considerations: How well each PET aligns with privacy laws and ethical data-use principles.

By understanding and implementing PETs, government agencies can ensure that data-driven initiatives remain both effective and privacy-compliant. The following sections provide an in-depth look at different PETs, their applications, and how they can be leveraged for secure data sharing.

Different Types of PETs

De-Identification

What it is: De-identification is the process of removing or altering personally identifiable information from datasets so that individuals cannot be easily identified. This is essential when sharing data for analysis, ensuring that personal details such as names, addresses, or contact information are not exposed. Redaction is a specific form of de-identification, where sensitive information in documents (e.g., health records or contracts) is blacked out or removed. This helps prevent privacy breaches while still allowing the relevant data to be used for research or policymaking.

How it works: De-identification can involve several techniques:

  • Pseudonymization: Replacing direct identifiers, like a name or Social Security number, with a pseudonym or code.
  • Suppression: Entirely removing sensitive identifiers that could lead to re-identification.
  • Generalization: Replacing exact data points with broader categories. For example, replacing a specific age with an age range like “30–40.”
  • Redaction: Systematically identifying and blacking out sensitive content, such as health information, financial details, or personal identifiers, so that unauthorized users cannot view or misuse the data.

These techniques preserve the utility of the data for analysis while ensuring individuals’ privacy.

Plain metaphor example: Imagine a basketball game where players wear jerseys with numbers but no names. You can analyze their performance, but you don’t know who they are unless you have the key to the roster.

Use case: De-identification of student data is essential for compliance with the U.S. Family Education Rights and Privacy Act (FERPA), as it removes or obscures personally identifiable information to minimize the risk of unintended disclosure. When properly de-identified, a subset of data can be shared without obtaining further consent, allowing for educational research while maintaining confidentiality and ensuring privacy protections.

Differential Privacy

What it is: Differential privacy is a mathematical technique used to protect individual privacy while sharing data for analysis. It ensures that the inclusion or exclusion of any single person’s data cannot be determined based on the results of a statistical query or analysis. This is achieved by adding controlled random noise to the data or query responses, making it difficult to identify any specific individual’s information while still preserving overall patterns and trends. In simpler terms, differential privacy protects individuals by blending their data into the crowd, allowing researchers and analysts to learn about the group without revealing personal details about anyone.

How it works: Differential privacy works by adding random noise to datasets, making it difficult to link individual records back to a person. It is worth noting that with differential privacy, there is an inverse relationship between privacy and precision (the more noise is added for individual privacy, the less precise the data is).

Plain metaphor example: Think of differential privacy like altering a picture to protect someone’s identity. Imagine you have a group photo where everyone’s faces are clear. If you want to share that photo without revealing anyone’s identity, you blur the faces just enough that no one can recognize individuals but can still tell it’s a group photo.

Differential privacy adds this “blur” by introducing small, random changes (noise) to the data. This ensures that no one can tell if a particular person’s is present in the photo, but overall trends, like average height or hair color, remain visible and useful. It protects individual privacy while still allowing meaningful patterns to emerge.

Use case: In the 2020 Census, the U.S. Census Bureau implemented differential privacy to enhance the protection of individual respondents’ data. This approach involved adding controlled noise to the data before it was released, making it difficult to identify specific individuals while still allowing for meaningful analysis of population trends and demographics.

Encryption

What it is: Encryption is the process of converting data into a secure format that can only be accessed or decrypted by authorized parties. It is a critical component of data security, ensuring that information remains confidential and protected from unauthorized access during storage, transmission, and processing.

Think of it like locking valuables in a safe—only those with the key can access them. Whether the items are being transported (in transit) or stored in the safe (at rest), the lock ensures they remain protected from anyone who shouldn’t have access.

Encryption: In Transit

What it is: Encrypting data in transit ensures that information transmitted over a network remains confidential and protected from interception. Transport Layer Security (TLS) is the protocol used to secure communications, commonly seen in HTTPS connections, which prevents attackers from accessing data during transmission.

How it works: When a client (such as your web browser) connects to a server, they negotiate a secure connection by agreeing on encryption methods and exchanging keys. Data is then encrypted before transmission, ensuring that only the intended recipient can decrypt and read it, even if the data is intercepted.

Plain metaphor example: Imagine two spies using a secret language to communicate over a public radio channel. Even if someone listens in, they won’t understand the message without knowing the code.

Use case: Online banking websites and e-commerce platforms use TLS encryption to protect data transmitted between users and servers. When you log into your bank account or make a purchase online, TLS ensures that sensitive information—like login credentials or payment details—is encrypted during transmission. Even if an attacker intercepts the data, they won’t be able to read it because the data is encrypted before it leaves your device and decrypted only by the intended recipient server. This encryption process is facilitated by certificate authorities, which validate the authenticity of the encryption keys to ensure secure communication.

Encryption: At Rest

What it is: Encrypting data at rest protects data stored on devices, servers, or databases. Before data is saved to disk, it is encrypted using algorithms like the Advanced Encryption Standard (AES). When the data is needed, it is decrypted by authorized systems or users. This method ensures that even if someone gains access to the physical storage, they cannot read the data without the decryption key.

How it works: When data is saved to a storage medium, it is encrypted using a cryptographic algorithm and an encryption key. The data remains encrypted until it is needed, at which point it is decrypted by authorized users or systems with the correct decryption key.

Plain metaphor example: Storing encrypted data is like keeping valuables in a high-security bank vault. Even if a thief gets inside the bank, they can’t open the vault without the right combination.

Use case: Services like Dropbox and Google Drive, used across government settings, employ at-rest encryption to protect files stored on their servers. This ensures that if a hacker gains access to the storage, they cannot read the contents of the files without the decryption keys.

Homomorphic Encryption

What it is: Homomorphic encryption allows computations to be performed on encrypted data without needing to decrypt it first. This technique ensures that sensitive data remains encrypted throughout the entire process, even during analysis, making it a powerful tool for privacy-preserving computations.

How it works: With homomorphic encryption, data is encrypted using a special algorithm that allows mathematical operations, like addition or multiplication, to be performed directly on the encrypted data. The result of these operations is still encrypted, and only after the computations are complete can the encrypted result be decrypted to reveal the final output. This means that sensitive data, such as personal information or financial details, can be processed and analyzed by third parties without ever exposing the original data. It is especially useful in cloud computing, where data privacy is crucial, as it allows for secure data sharing and processing without compromising security.

Plain metaphor example: Homomorphic encryption is like sending a locked box of ingredients to a chef to cook in a private kitchen. The chef uses those ingredients to prepare a meal and then locks the finished product inside the box. Only the person with the key can unlock the box and enjoy the final result, ensuring no one else can steal the ingredients or alter the meal.

Use case: IBM has used homomorphic encryption for cloud computing solutions, enabling clients to analyze sensitive data, such as medical or financial information, while keeping it encrypted throughout the process. Similarly, governments can use homomorphic encryption to securely outsource data processing and storage to cloud providers while maintaining control over the data and ensuring its confidentiality, which is critical for large-scale processing of sensitive data.

Federated Data Science

Federated data science is a collaborative approach to data analysis where multiple parties work together to analyze decentralized data without transferring or sharing sensitive information. By using techniques like federated learning and federated analytics, organizations can derive insights from distributed datasets while ensuring privacy and compliance with data protection regulations.

Federated Learning

What it is: Federated learning is a machine learning technique that allows multiple devices or parties to collaboratively train a model without sharing their raw data. Instead of sending data to a central server, each participant trains the model locally on their own device and only shares the model updates.

How it works: In federated learning, a global machine learning model is built collaboratively by many devices or entities (e.g., smartphones, hospitals, or organizations). Each participant trains the model on their local data, then sends only the model parameters (such as weights or gradients) to a central server, rather than the raw data itself. The server aggregates the updates from all participants to improve the model. This process is repeated across multiple rounds, allowing the model to learn from a diverse set of data sources without any party needing to expose their private data.

Plain metaphor example: Federated learning is like a group of chefs each perfecting their own recipe in separate kitchens. Each chef works with their own set of ingredients, refining their dish based on what they’ve learned. After a round of cooking, the chefs share what improvements they’ve made to their recipes, but none of them reveal their specific ingredients or the exact methods they used. The central restaurant combines these improvements to create the best dish possible, without needing to know what’s inside each chef’s kitchen.

Use case: As governments make strides to train their own artificial intelligence (AI) models, federated learning provides an opportunity to train on decentralized data rather than with general data that may put personally identifiable information (PII) at risk. The Centers for Disease Control and Prevention and National Institutes of Health could use federated learning to train AI models on COVID-19 patient data from multiple hospitals, without hospitals sharing raw patient data.

Federated Analytics

What it is: Federated analytics is similar to federated learning, but instead of training machine learning models, it focuses on performing data analysis collaboratively while keeping the data decentralized and private.

How it works: In federated analytics, data remains on the local devices or servers, and only aggregated insights or analysis results are shared. For example, rather than sending raw data to a central server, each participant can perform calculations on their own data and then share only the aggregated results, such as averages or statistical summaries. This ensures that the original, detailed data is never exposed or transmitted, but collective insights can still be derived from all participants’ datasets.

Plain metaphor example: Federated analytics is like a line of cashiers closing out their registers at a supermarket for the night. Each cashier counts their bills and reports their respective final amounts to the supermarket’s log, but nobody knows how many 1s, 5s, 10s, 20s, 50s, or 100s there were in registers besides their own. Each cash register remains private, but the supermarket can still create a complete picture by combining their totals.

Use case: Google reported in 2020 that it used federated analytics to support the Now Playing feature on Google’s Pixel phones. This approach enhances privacy by ensuring that song recognition happens locally on the device, without transmitting raw or processed audio data. Because each phone received the same database, the feature ensured privacy while maintaining functionality.

Generalization

What it is: Generalization is a privacy technique where specific, granular data is replaced with broader categories or ranges to protect individuals’ identities while maintaining useful data for analysis. For example, rather than recording an individual’s exact salary, the data could be generalized into salary ranges such as “$40,000–$60,000.” Generalization helps ensure that specific individuals cannot be identified based on the data, even when combined with other publicly available information.

How it works: Generalization involves transforming detailed data into higher-level categories or ranges. For example, instead of recording an exact age, an age range (such as “30s”) might be used to avoid identifying an individual. Similarly, a person’s exact geographic location might be replaced with a broader region or area. This approach reduces the risk of re-identification by ensuring that data points are not unique to a person. The challenge is balancing the level of detail retained for analysis with the level of privacy provided to individuals.

Plain metaphor example: Instead of displaying an exact street location, generalization zooms out so you only see the general area, like a heatmap that highlights trends without revealing individual lots.

Use case: Generalization helps protect patient privacy in health care by replacing specific data points—such as exact ages or test results—with broader categories, enabling medical research while reducing the risk of re-identification. Transforming patient data into fixed intervals and replacing values with carefully calculated averages, for example, allows researchers to analyze trends without exposing individual identities.

Hashing

What it is: Hashing is a process that converts input data, such as passwords or files, into a fixed-length string of characters, known as a hash. This hash is a unique identifier that represents the original data but cannot be reversed back to the original data—unlike a token, which can be reversed. Hashing is widely used for ensuring data integrity, verifying authenticity, and securely storing sensitive information like passwords.

How it works: Hashing algorithms take an input (such as a password) and apply a mathematical function that produces a fixed-length hash value. The important property of hashing is that it is a one-way function, meaning that once the data is hashed, it cannot be converted back to its original form. For example, when storing passwords, instead of saving the password itself, systems store the hash of the password. During login, the entered password is hashed and compared to the stored hash. If the hashes match, the password is correct. This approach ensures that even if the hash is exposed, the original password cannot be easily recovered.

Plain metaphor example: Hashing is like making a smoothie from a mixture of fruit. Once the fruits are blended into the smoothie, you can’t separate the individual pieces of fruit back out, but you can still tell the smoothie was made from those fruits based on its flavor.

Use case: In password security, including government passwords, when a user logs in, the system hashes the entered password and compares it to the stored hash.

K-Anonymity

What it is: K-anonymity is a privacy principle that ensures any individual’s data is indistinguishable from at least k-1 other individuals in a dataset. This reduces the risk of identifying a specific person when analyzing or sharing data. K-anonymity works by ensuring that for any individual, their attributes (such as age, zip code, or gender) are shared by at least k-1 other people in the dataset. For example, when k=2, an individual’s data is indistinguishable from at least one other person’s data. The higher the value of k, the stronger the privacy protection.

How it works: K-anonymity is typically achieved by modifying data through generalization or suppression. For example, if an individual’s exact birth date is included in a dataset, it could be replaced with a broader date range, such as “January 1, 1980–December 31, 1989,” so that at least k-1 other individuals share the same birth date range. Similarly, exact geographic locations could be generalized into broader regions to ensure multiple individuals share the same location. K-anonymity ensures that data analysis is still possible without compromising individual privacy, even if other data sources are available for cross-referencing.

Plain metaphor example: K-anonymity is like trying to spot your friends in a crowded stadium full of people wearing the home team’s colors. If you’re surrounded by 10 people in the same jersey, it’s hard to identify who’s who. The more people in the crowd with the same jersey, the harder it becomes to single out anyone, keeping everyone’s identity protected in the sea of fans.

Use case: K-anonymization protects privacy in local-level-area datasets by grouping data and suppressing low-value cells, reducing the risk of identification. Agencies like the U.S. Census Bureau use this approach to prevent attacks on the data, such as reverse geocoding and differencing, while preserving data utility.

Private Set Intersection

What it is: Private set intersection (PSI) is a type of multi-party computation that allows two parties to compare their datasets to identify common elements, while keeping the rest of the data private. PSI ensures that no party learns anything about the other party’s data, except for the items that are present in both sets.

How it works: In PSI, each of the two parties has a private set of data, and they engage in a protocol that allows them to securely compute the intersection (i.e., the common elements) between the two sets. During this process, neither party reveals any information about their private data outside of the common elements. This is accomplished using cryptographic techniques such as encryption or secure hashing. The result of the protocol is the list of common items, without any leakage of additional information about the sets. PSI is useful in scenarios where both parties need to know shared data, but neither is willing to share their entire dataset.

Plain metaphor example: PSI is like two people comparing their contact lists to see which friends they share, without showing each other their entire list of contacts. They can identify and only reveal the common names, keeping their individual contact lists private.

Use case: The strengths of PSI lies in enabling the comparison of datasets without revealing unnecessary information. The FBI and local police departments have the potential to use PSI to check if a suspect appears on watchlists without revealing entire law enforcement records.

Secure Multi-Party Computation

What it is: Secure multi-party computation (SMPC) allows multiple parties to collaboratively compute a function over their private inputs while keeping those inputs confidential. The key feature of SMPC is that no participant learns anything about the others’ private data during the computation, ensuring privacy and security.

How it works: SMPC involves dividing the computation process into smaller pieces and distributing them across multiple parties, each of which performs a computation using its own private data. At no point does any participant receive access to the other participants’ raw data; they only receive partial results, which are combined at the end to produce the final output. This allows for joint computations, like analyzing shared data, comparing results, or making collective decisions, without revealing any individual’s private information. The security comes from the fact that the computations are designed in such a way that no single participant has enough information to infer anything about others’ inputs.

Plain metaphor example: SMPC is like a group of people each solving different pieces of a puzzle. Each person only sees their own piece of the puzzle, and at the end, they combine their pieces to see the complete image, but no one ever learns what the other pieces look like before the final combination.

Use case: In 2020, a group of European health care organizations leveraged SMPC to securely analyze patient data across multiple hospitals without exposing sensitive personal data. Using SMPC, they could compute joint results of disease spread and treatment efficacy on encrypted data without any hospital revealing its internal data or jeopardizing patient privacy.

Synthetic Data

What it is: Synthetic data is artificially, computer-generated information that mirrors the structure, patterns, and statistical properties of real-world data but does not contain any actual personal or sensitive information. It is created using algorithms and statistical models, replicating the patterns and relationships found in the real data. By design, synthetic data behaves similarly to real data in analysis, but because it doesn’t trace back to any individual’s real information, it can be shared more freely.

How it works: Synthetic data is created by models that learn from real data to generate new, similar data. One common method, generative adversarial networks (GANs), uses two parts: a generator that creates new data and a discriminator that checks if the data looks real. Other techniques focus on relationships between data points, like how age might be related to income. Instead of just changing ages randomly, the model keeps the overall pattern intact while altering the data, making it hard to reverse-engineer but still useful for analysis.

Plain metaphor example: An artist creates a painting based on a photograph. The artist doesn’t replicate the photo exactly but captures its key elements—such as color, shape, proportions, and scale—to create a unique painting that still holds a strong resemblance to the source photograph.

Use case: During the U.K.-U.S. PETs Prize Challenge, innovators were tasked with developing federated learning solutions to improve pandemic forecasting while maintaining privacy. Participants used a synthetic dataset that was created by the University of Virginia’s Biocomplexity Institute as a digital twin of a real population, preserving statistical and behavioral properties without exposing actual personal data. This approach demonstrated how synthetic data can support critical public health responses by enabling secure data sharing and analysis without compromising individual privacy.

Tokenization

What it is: Tokenization is the process of replacing sensitive data elements, such as credit card numbers or personal identifiers, with unique identifiers, or “tokens.” Unlike hashing, tokenization is reversible, making it suitable for situations where the original data may need to be retrieved. This makes tokenization ideal for scenarios where re-association may be needed, and it should not be used if the goal is to prevent re-identification by the primary data processor. The tokens do not have any meaningful value outside the context of the system that issued them, ensuring that they cannot be used to access the original data.

How it works: Tokenization works by generating a random string of text, based on a piece of sensitive data, to be used as the token. The actual sensitive data is never transmitted or stored with the token. When the data needs to be accessed or processed, the system can use the token instead of the original data, ensuring that even if the token is intercepted, it has no useful value. This prevents sensitive information from being exposed during transactions or data storage.

Plain metaphor example: Tokenization is like getting a new membership card at a gym, where your real name and contact details are replaced with a unique number. Even though the number is on your card, no one can learn your real identity from just seeing the number—it’s stored securely by the gym.

Use case: Government programs such as the Supplemental Nutrition Assistance Program (SNAP), Medicare, and unemployment insurance limit data sharing with other agencies to safeguard citizens’ sensitive data. While this is a measure to minimize risk, keeping the data separate can hurt agencies’ ability to extract insights and understand their populations. Tokenization offers an opportunity for governments to share data across agencies without compromising individual privacy.

Trusted Execution Environment

What it is: A trusted execution environment (TEE) is a secure area within a processor that runs code in isolation from the rest of the system, ensuring that sensitive data is processed in a trusted and confidential manner. TEEs are designed to protect data and computations from being accessed or tampered with, even by the operating system or malicious software.

How it works: A TEE creates a secure enclave within a processor, where both code and data are isolated from the rest of the system. When a program runs inside a TEE, it is protected from external interference or observation, ensuring that sensitive operations can occur in a trusted environment. TEEs are typically used to process sensitive data, such as encryption keys or financial information, ensuring that this data remains private and secure even when the system itself may be compromised.

Plain metaphor example: A TEE is like having a locked safe inside your house that only you can access. Even if someone else enters your house, they can’t open the safe and see what’s inside because it’s securely isolated.

Use case: TEEs can be leveraged in any context in which a government entity might deal with sensitive data. For example, the Social Security Administration and the U.S. Department of Labor could use TEEs to detect fraudulent disability and unemployment claims without exposing the full databases of all individuals receiving government welfare.

Zero-Knowledge Proof

What it is: A zero-knowledge proof (ZKP) is a cryptographic method that allows one party to prove to another party that they know a piece of information (e.g., a password or secret) without revealing the information itself.

How it works: In a ZKP, the prover (who knows the secret) and the verifier (who wants to be convinced) engage in a protocol where the prover demonstrates knowledge of the secret without ever revealing it. The protocol typically involves the prover presenting evidence that they can correctly solve a problem or answer a question based on the secret, without actually disclosing the secret. ZKPs are used in many PETs to allow for secure authentication or transactions without revealing private data.

Plain metaphor example: A ZKP is like a magician proving they know how a trick works, without ever revealing the secret behind the trick. They show you the result, but not the method used to achieve it.

Use case: Zero-knowledge proofs could enable users to prove they meet age requirements without revealing their exact age or identity. A 2022 demonstration developed by the innovation laboratory at France’s National Commission on Information and Liberty (CNIL) showcases a privacy-preserving age-verification system where a trusted third party certifies a user’s eligibility without disclosing personal data. This approach strengthens online privacy while ensuring compliance with age restrictions, offering a scalable solution for secure digital identity verification.

Types of PETs and Plain-Language Explanations: A Glossary

Table of Contents

Close