Everything in Moderation

How Automated Tools are Used in the Content Moderation Process

Automated tools are used to curate, organize, filter, and classify the information we see online. They are therefore pivotal in shaping not only the content we engage with, but also the experience each individual user has on a given platform. There are a host of automated tools, many fuelled by artificial intelligence and machine learning, that can be deployed during the content moderation process. These tools can be deployed across a range of categories of content and media formats, as well as at different stages of the content lifecycle, to identify, sort, and remove content. This section aims to provide an overview of some of the most widely used automated tools and methods in this field, as well as their strengths and limitations.

Digital Hash Technology

Digital hash technology works by converting images and videos from an existing database into a grayscale format. It then overlays them onto a grid and assigns each square a numerical value. The designation of a numerical value converts the square into a hash, or digital signature, which remains tied to the image or video and can be used to identify other iterations of the content either during ex-ante moderation or ex-post proactive moderation.¹

Digital hash technology has thus far been widely adopted by internet platforms to identify CSAM and copyright-infringing material. The CSAM detection technology, known as PhotoDNA, was originally developed by Microsoft and has expanded to become a powerful tool used by companies such as Twitter, Google, Facebook, law enforcement, and organizations such as the National Center for Missing & Exploited Children. PhotoDNA generates digital hashes from a database of thousands of existing illegal CSAM images and can detect hashes across a broad spectrum in microseconds. In response to growing concerns around copyright-infringement from its users, YouTube adapted PhotoDNA technology to create the ContentID technology. ContentID enables YouTube users to create digital hashes for their video content to help protect against copyright violations. Once these hashes have been created, all content that is subsequently uploaded to the YouTube platform is screened against its database of audio and video files in order to identify potential copyright violations.² Both of these tools are particularly resilient against manipulation, including resizing, color alterations, and watermarking.³

The databases of signatures that these algorithms are trained on are continuously updated. There are over 720,000 known instances of CSAM,⁴ and once new images or videos are identified, they are added to the database. Similarly, when copyright holders flag infringing content to YouTube, this content is added to the ContentID database so that future screenings of content will incorporate these materials.⁵ The expansion of these databases, as well as the continuous evaluation of and updates to these software programs, aim to improve the effectiveness of these tools. This is one particular area where machine learning is being deployed to utilize past learnings in order to inform future predictions and behaviors.⁶ Further, although it is much harder now to circumvent image hashes, it is still possible to circumvent audio and video hashing by, for example, altering the length or encoding format of the file, as this would require a new hash of the file to be generated.⁷ Most recently, PhotoDNA technology has been adapted in order to detect and remove extremist content and terror propaganda-related images, video, and audio online. Similar to PhotoDNA, this tool, known as eGLYPH, is capable of detecting and removing content on a platform that has a corresponding hash, and is also able to prevent the upload of such content ex-ante.⁸ However, the application of this automated technology to content moderation decision-making around extremist content has raised a significant number of concerns, as the definition of what is extremist content, and what should therefore be included in hash databases, is vague and largely platform-dependent. In addition, most platforms focus their content moderation efforts on certain extremist groups, such as the Islamic State and al-Qaeda. As a result, these automated tools demonstrate a bias in terms of which groups they are trained to focus on, and demonstrate less reliability when addressing the larger corpus of extremist groups and movements that may use online services. Furthermore, moderating extremist content often requires a nuanced understanding of varied regions and cultures and an appreciation for the context in which an image is posted, something automated tools do not have. For example, while platforms will want to take down terrorist propaganda that glorifies acts of gruesome violence, it is important to permit journalists and human rights organizations to raise awareness about terrorist atrocities. As a result, automated moderation of this content has resulted in overbroad takedowns and infringements on user expression.

There is also a significant lack of transparency and accountability around how digital hash technology is being deployed to identify and moderate extremist content. For example, in June 2017, Facebook, Microsoft, Twitter, and YouTube formed the Global Internet Forum to Counter Terrorism (GIFCT) in order to curb the spread of extremist content online. One of the main efforts of the GIFCT was the creation of a shared industry hash database that contains over 40,000 image and video hashes that can aid company efforts to moderate extremist content. However, despite the fact that the database has been used by companies for over two years, there has been little transparency around what specific groups this database focuses on, how content added to the database is vetted and verified as extremist content, and how much content and how many accounts have been correctly and erroneously removed across participating platforms as a result of the database.⁹ As a result, it is difficult to assess the effectiveness and accuracy of such tools.

Image Recognition

Digital hash technologies utilize image recognition. However, image recognition can also be used more broadly during the content moderation process. For example, during ex-post proactive moderation, image recognition tools can identify specific objects within an image, such as a weapon, and decide based on factors including user experience and risk whether the image should be flagged to a human for review. Automated image recognition tools are currently employed by several internet platforms, as they help filter through and prioritize cases for human moderators, thus saving time.¹⁰ Although the algorithms that power image recognition tools are continuously reinforced when information regarding the ultimate decision a human moderator made is fed back into it, this feedback loop does not provide detailed information on why the moderator made this decision. As a result, these algorithms are unable to develop into more dynamic tools that could incorporate nuanced and contextual insights into their detection procedures, such as whether content with a weapon in it is actually violent, or—for example—satirical in nature.¹¹ In addition, the accuracy of these models depends on the quality of the datasets they are trained on. If these models are trained on datasets that focus on specific types of weapons, they would reflect this bias and would not be able to accurately identify all potential instances of violent content containing weapons on a platform. There is also a lack of transparency around how these image recognition databases are compiled, what types of content they focus on, how effective and accurate they are across different categories of content, and how much user expression has been accurately and erroneously removed as a result of these tools.

Metadata Filtering

Most digital files contain information that provides descriptive characteristics about their content. This is known as a file’s metadata. For example, an audio file that is a song could be labeled with information such as the song’s title or the length. Metadata filtering tools can be used during ex-ante and ex-post proactive moderation to search a series of files in order to identify content that fits a particular set of metadata parameters. Metadata filtering tools are particularly used to identify copyright-infringing materials. However, because a file’s metadata label can be easily manipulated or mislabeled, the effectiveness and accuracy of metadata filtering tools is limited, and these tools can be easily gamed.¹²

Natural Language Processing (NLP)

NLP is a set of techniques that use computers to parse text. In the context of content detection and moderation, text is typically parsed in order to make predictions about the meaning of the text, such as what sentiments it indicates.¹³ Currently, a wide-range of NLP tools can be purchased off the shelf and are applicable in a range of use cases, including spam detection, content filtering, and translation services. In the context of content moderation, NLP classifiers are particularly being used to detect hate speech and extremist content and to perform sentiment analysis on content.¹⁴ As outlined by researchers from the Center for Democracy & Technology, NLP classifiers are generally trained on text examples, known as documents, that have been annotated by humans in order to indicate whether they belong to a particular category or not (e.g., extremist content vs. not extremist content). When a model is provided a collection of documents, known as a corpus, it works to identify patterns and features associated with each annotated category. These corpora are pre-processed so that they numerically represent particular characteristics in the text, such as the absence of a specific word. The annotated and pre-processed text documents are used to train machine learning models to classify new documents, and the classifier is tested on a separate sample of the training data in order to determine how much the model’s classifications matched those of the human coders.¹⁵

Although internet platforms are increasingly exploring and adopting the use of NLP classifiers, these technologies are limited for a number of reasons. First, NLP technologies are domain-specific, which means that they can only focus on one particular type of objectionable content. In addition, because there is significant variation in how speech is expressed, these categories are very narrow. For example, to maximize accuracy, these models are trained to detect and flag one specific type of hate speech.¹⁶ This means that a classifier trained to detect hate speech could only be trained to focus on a particular sub-domain of hate speech, such as anti-Semitic speech. If this classifier was trained on datasets that also included some examples of other forms of hate speech, it would still only be able to be applied with relative accuracy to anti-Semitic speech. In addition, finding and compiling comprehensive enough datasets to train NLP classifiers is a challenging, expensive, and tedious process. As a result, many researchers have resorted to filtering through content using search terms or hashtags that focus on subtypes of a particular domain of speech, such as hate speech directed at a certain religious group. However, this creates and operationalizes dataset and creator bias, which can disproportionately emphasize certain types of hate speech. This re-emphasizes that such tools cannot be widely applied to multiple forms of hate speech.¹⁷

Furthermore, in order for NLP classifiers to operate accurately, they need to be provided with clear and consistent parameters and definitions of speech. Depending on the type of speech, this can be challenging. For example, definitions around extremist content and disinformation are vague, and they are often unable to capture the full breadth, context, and nuances of such activity. On the other hand, tools that are developed based on definitions that are overly narrow may fail to detect some speech and may be easier to bypass.¹⁸

“On the other hand, tools that are developed based on definitions that are overly narrow may fail to detect some speech and may be easier to bypass.”

In addition, NLP classifiers are limited in that they are unable to comprehend the nuances and contextual elements of human speech. For example, this could include whether a word is being used in a literal or satirical context, or whether a derogatory term is being used in slang form. This therefore decreases the accuracy or these classifiers, particularly when they are applied across platforms, content formats, languages, and contexts.¹⁹

Finally, there is also a lack of transparency around how corpora are compiled, what manual filtering processes—such as hashtag filtering—creators undergo to create these datasets, how accurate these tools are, and how much user expression these NLP tools remove both correctly and incorrectly.

Citations

Klonick, "The New Governors: The People, Rules, and Processes Governing Online Speech".
Klonick, "The New Governors: The People, Rules, and Processes Governing Online Speech".
Kalev Leetaru, "The Problem With AI-Powered Content Moderation Is Incentives Not Technology," Forbes, March 19, 2019, source.
Klonick, "The New Governors.”
Klonick, "The New Governors.”
Klonick, "The New Governors.”
Evan Engstrom and Nick Feamster, The Limits of Filtering: A Look at the Functionality & Shortcomings of Content Detection Tools, March 2017, source.
Counter Extremism Project, "How CEP's eGLYPH Technology Works," Counter Extremism Project, last modified December 8, 2016, source.
Some platforms such as Facebook, YouTube, and Twitter do provide limited disclosures on how much extremist content or accounts they remove in their transparency reports. Facebook also reports on how much extremist content the platform erroneously removed and restored. However, it is unclear what proportion of these removals were due to the use of the shared hash database.
Accenture, Content Moderation: The Future is Bionic, 2017, source.
Accenture, Content Moderation: The Future is Bionic.
Engstrom and Feamster, The Limits of Filtering: A Look at the Functionality & Shortcomings of Content Detection Tools
Natasha Duarte, Emma Llansó, and Anna Loup, Mixed Messages? The Limits of Automated Social Media Content Analysis, November 28, 2017, source.
Duarte, Llansó, and Loup, Mixed Messages?
Duarte, Llansó, and Loup, Mixed Messages?
Raso, Filippo and Hilligoss, Hannah and Krishnamurthy, Vivek and Bavitz, Christopher and Kim, Levin Yerin, Artificial Intelligence & Human Rights: Opportunities & Risks (September 25, 2018). Berkman Klein Center Research Publication No. 2018-6. Available at SSRN: source or source
Duarte, Llansó, and Loup, Mixed Messages?
Duarte, Llansó, and Loup, Mixed Messages?
Raso, Filippo and Hilligoss, Hannah and Krishnamurthy, Vivek and Bavitz, Christopher and Kim, Levin Yerin, Artificial Intelligence & Human Rights: Opportunities & Risk

Education & Work

Democratic Futures

Global Security

Technology & Democracy

Thriving Families

Trending Topics

Real Skills, Real Income: Why Youth Apprenticeship Is Resonating Now

Future-Proofing U.S. Nuclear Policy: Forecasting Outcomes of the Nuclear-Armed Sea-Launched Cruise Missile

Debunking Myths on Student Parent Data Collection

The App Store Accountability Act Poses Serious Concerns for Privacy, Security, and Free Expression

Redrawing School Boundaries for Fairer Funding

Reframing Fusion Voting as a Practical, Powerful Reform Strategy

Harnessing Terrorism Data to Reshape U.S. National Security Policy

Establishing a National Housing Loss Rate

New America Fellows

The Understated Value of Regional Intermediaries for Workforce and Economic Development

Evictions in the District of Columbia: June 2025 – February 2026

The Charleston Regional Youth Apprenticeship Model

Accreditation 101: A Fireside Chat on How Colleges Are Measured

Everything in Moderation

Table of Contents

How Automated Tools are Used in the Content Moderation Process

Digital Hash Technology

Image Recognition

Metadata Filtering

Natural Language Processing (NLP)

Citations

How Automated Tools are Used in the Content Moderation Process

Everything in Moderation