Paper ID

5471114e37448bea2457b74894b1ecb92bbcfdf6


Title

Bias-Aware Contrastive Learning for Fair Media Bias Detection in Language Models


Introduction

Problem Statement

Current language models struggle to consistently detect and classify media bias across different demographic groups and topics, leading to unfair treatment in hate speech and misinformation detection. This inconsistency can result in biased outcomes, potentially amplifying societal inequalities and undermining the effectiveness of content moderation systems.

Motivation

Existing methods often rely on supervised fine-tuning on labeled datasets or prompt engineering for zero-shot classification, which may not generalize well across diverse contexts and can perpetuate biases present in training data. Recent advances in contrastive learning and adversarial training inspire our novel approach to make language models more robust and fair in detecting media bias. By leveraging synthetic data generation and adversarial techniques, we aim to create a model that can identify bias while remaining invariant to specific demographic attributes.


Proposed Method

We introduce Bias-Aware Contrastive Learning (BACL), a two-stage training process. First, we generate a large corpus of synthetic biased and unbiased text pairs using a diverse set of demographic attributes and topics. We then train the model using a contrastive loss that encourages it to distinguish between biased and unbiased versions of the same content. To ensure fairness, we incorporate an adversarial component that penalizes the model for relying too heavily on specific demographic attributes when making bias predictions. This is achieved by training a separate adversarial classifier to predict the demographic attributes from the model's internal representations, and adding a gradient reversal layer to discourage the main model from learning these spurious correlations. Additionally, we introduce a novel 'bias swapping' data augmentation technique, where we automatically transform biased text to target different demographic groups, forcing the model to focus on the underlying bias rather than specific group associations.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Data Preparation

Use GPT-4 to generate a synthetic dataset of 10,000 text pairs (biased and unbiased versions) covering various topics and demographic attributes. Ensure diversity in topics (e.g., politics, sports, entertainment) and demographic attributes (e.g., race, gender, age, religion). Use prompts like: 'Generate a biased news headline about [TOPIC] targeting [DEMOGRAPHIC]. Now rewrite it as an unbiased version.'

Step 2: Model Selection

Choose BERT-base-uncased as the base model for fine-tuning. We'll use the Hugging Face Transformers library for implementation.

Step 3: Implement BACL

a) Contrastive Learning: Implement a contrastive loss function that maximizes the similarity between the embeddings of a text and its unbiased version while minimizing similarity with other samples. b) Adversarial Component: Implement an adversarial classifier that tries to predict demographic attributes from the model's representations. Use a gradient reversal layer to make the main model invariant to these attributes. c) Bias Swapping: Implement a function that takes a biased text and swaps demographic attributes to create additional training samples.

Step 4: Training

Train the model using the BACL approach. Use 80% of the synthetic data for training and 20% for validation. Monitor the contrastive loss and the adversarial classifier's performance during training.

Step 5: Evaluation

Evaluate the model on existing media bias datasets such as MBIC and BABE, as well as on hate speech detection benchmarks like HateXplain. Compare against baselines including fine-tuned BERT and RoBERTa models, as well as few-shot prompting with GPT-3.5 and GPT-4. Metrics to use: accuracy, F1-score, and demographic parity difference (to measure fairness across groups).

Step 6: Ablation Studies

Conduct ablation studies to quantify the impact of each component of BACL: a) Train without the adversarial component. b) Train without bias swapping. c) Train with different sizes of synthetic data.

Step 7: Analysis

Analyze the model's performance across different demographic groups and topics. Use techniques like LIME or SHAP to interpret the model's decisions and ensure it's not relying on spurious correlations.

Test Case Examples

Baseline Prompt Input (Fine-tuned BERT)

Immigrants are flooding into our country, taking jobs from hardworking citizens.

Baseline Prompt Expected Output (Fine-tuned BERT)

Biased

Baseline Prompt Input (Few-shot GPT-4)

Classify the following statement as biased or unbiased: 'Women are too emotional to be effective leaders in high-stress situations.'

Baseline Prompt Expected Output (Few-shot GPT-4)

Biased

Proposed Prompt Input (BACL)

Young people these days are lazy and entitled, always expecting handouts.

Proposed Prompt Expected Output (BACL)

Biased (with confidence score and explanation: 'This statement makes a sweeping generalization about an entire age group, using negative stereotypes without factual basis.')

Explanation

While both baselines can identify obvious bias, BACL provides a more nuanced understanding by offering confidence scores and explanations. It's also designed to be more consistent across different demographic groups, which we would demonstrate through multiple examples targeting various groups.

Fallback Plan

If BACL doesn't significantly outperform baselines, we can pivot to an analysis paper exploring the challenges of fair bias detection. We would conduct a thorough error analysis to understand where and why the model fails, particularly focusing on differences across demographic groups. We could also explore the quality and diversity of our synthetic data, analyzing how different prompting strategies for data generation affect model performance. Additionally, we might investigate how the model's performance varies with the subtlety of bias, creating a spectrum from obvious to very subtle biases and analyzing performance across this spectrum. This could provide valuable insights into the limitations of current approaches and guide future research directions in fair AI for content moderation.


References

  1. MAGPIE: Multi-Task Analysis of Media-Bias Generalization with Pre-Trained Identification of Expressions (2024)
  2. IndiTag: An Online Media Bias Analysis and Annotation System Using Fine-Grained Bias Indicators (2024)
  3. Detection of Conspiracy Theories Beyond Keyword Bias in German-Language Telegram Using Large Language Models (2024)
  4. CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text (2025)
  5. Is External Information Useful for Stance Detection with LLMs? (2025)
  6. ChatGPT v.s. Media Bias: A Comparative Study of GPT-3.5 and Fine-tuned Language Models (2023)
  7. Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception (2024)
  8. IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators (2024)
  9. ROBBIE: Robust Bias Evaluation of Large Generative Language Models (2023)
  10. Ceasefire at FIGNEWS 2024 Shared Task: Automated Detection and Annotation of Media Bias Using Large Language Models (2024)