Using GPT-4 for content moderation

Content moderation plays a crucial role in sustaining the health of digital platforms. A content moderation system using GPT‑4 results in much faster iteration on policy changes, reducing the cycle from months to hours. GPT‑4 is also able to interpret rules and nuances in long content policy documentation and adapt instantly to policy updates, resulting in more consistent labeling. We believe this offers a more positive vision of the future of digital platforms, where AI can help moderate online traffic according to platform-specific policy and relieve the mental burden of a large number of human moderators. Anyone with OpenAI API access can implement this approach to create their own AI-assisted moderation system.

## Challenges in content moderation

Content moderation demands meticulous effort, sensitivity, a profound understanding of context, as well as quick adaptation to new use cases, making it both time consuming and challenging. Traditionally, the burden of this task has fallen on human moderators sifting through large amounts of content to filter out toxic and harmful material, supported by smaller vertical-specific machine learning models. The process is inherently slow and can lead to mental stress on human moderators.

## Using large language models

We're exploring the use of LLMs to address these challenges. Our large language models like GPT‑4 can understand and generate natural language, making them applicable to content moderation. The models can make moderation judgments based on policy guidelines provided to them.

With this system, the process of developing and customizing content policies is trimmed down from months to hours.

1. Once a policy guideline is written, policy experts can create a golden set of data by identifying a small number of examples and assigning them labels according to the policy. 2. Then, GPT‑4 reads the policy and assigns labels to the same dataset, without seeing the answers. 3. By examining the discrepancies between GPT‑4’s judgments and those of a human, the policy experts can ask GPT‑4 to come up with reasoning behind its labels, analyze the ambiguity in policy definitions, resolve confusion and provide further clarification in the policy accordingly. We can repeat steps 2 and 3 until we are satisfied with the policy quality.

This iterative process yields refined content policies that are translated into classifiers, enabling the deployment of the policy and content moderation at scale.

Optionally, to handle large amounts of data at scale, we can use GPT‑4's predictions to fine-tune a much smaller model.

A piece of content to moderate and a snippet of an example policy

# K Illicit Behaviour Taxonomy 1. Determine if the text meets the definition of K0 Non-Illicit Behaviour. If it does, label it as K0 and stop. 2. If the text does not meet the definition of K0 Non-Illicit Behaviour, determine if the text meets one of the other definitions above and select the label that applies: * K4: Advice or instructions for violent wrongdoing or weapons * K3: Advice or instructions for non-violent wrongdoing * K2: Admission of intent to commit wrongdoing or to procure a weapon * K1: Classification Requests

## Definition of Terms

Advice and instructions is content that includes reference to how something is done. Requests for advice and instructions may include terms such as “how would someone..?”, “explain step-by-step..” , “the easiest way to do X is..”, “first I’ll start by…”. This can also include ingredients and directions.

Requests are types of input passed to a model intended to compel it to perform a particular task. There are several types of requests:

### Content to moderate

### GPT-4 explanation

The text should be classified as K0 non-illicit behavior because it does not meet the criteria for any label. While stealing a car is non-violent, the request does not reference the generation of malware, drug trafficking, vandalism. While stealing a car may be considered property theft, the policy does not include this as a type of wrongdoing, therefore the content should be labeled K0.

This simple yet powerful idea offers several improvements to traditional approaches to content moderation:

Illustration of the process of how we leverage GPT‑4 for content moderation, from policy development to moderation at scale.

Different from Constitutional AI (Bai, et al. 2022⁠(opens in a new window)) which mainly relies on the model's own internalized judgment of what is safe vs not, our approach makes platform-specific content policy iteration much faster and less effortful. We encourage Trust & Safety practitioners to try out this process for content moderation, as anyone with OpenAI API access can implement the same experiments today.

Labeling quality by GPT-4 is similar to human moderators with light training (Pool B). However, both are still overperformed by experienced, well-trained human moderators (Pool A).

We are actively exploring further enhancement of GPT‑4’s prediction quality, for example, by incorporating chain-of-thought reasoning or self-critique. We are also experimenting with ways to detect unknown risks and, inspired by Constitutional AI, aim to leverage models to identify potentially harmful content given high-level descriptions of what is considered harmful. These findings would then inform updates to existing content policies, or the development of policies on entirely new risk areas.

Judgments by language models are vulnerable to undesired biases that might have been introduced into the model during training. As with any AI application, results and output will need to be carefully monitored, validated, and refined by maintaining humans in the loop. By reducing human involvement in some parts of the moderation process that can be handled by language models, human resources can be more focused on addressing the complex edge cases most needed for policy refinement. As we continue to refine and develop this method, we remain committed to transparency and will continue to share our learnings and progress with the community.

Lilian Weng, Vik Goel, Andrea Vallone

Ian Kivlichan, CJ Weinmann, Jeff Belgum, Todor Markov, Dave Willner

Embedding AI into developer software API Mar 21, 2024

Building a data-driven, efficient culture with AI ChatGPT Mar 18, 2024

Reimagining the email experience with AI API Mar 18, 2024

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Using GPT-4 for content moderation

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs