Paper ID

0a4b8b161931799d5c6bc3ecf07c53bae0e9e502


Title

Combining human-annotated debiasing and KGAT to reduce gender and geographic biases in LLMs.


Introduction

Problem Statement

Combining debiasing models trained on human-annotated examples with knowledge graph-augmented training will significantly reduce both gender and geographic biases in large language models, as measured by improvements in demographic parity and equal opportunity metrics.

Motivation

Existing methods for bias mitigation in large language models (LLMs) often focus on either gender or geographic bias separately, using techniques like fairness-aware neural language models or post-hoc debiasing. However, these approaches do not explore the potential synergy between debiasing models trained on human-annotated examples and knowledge graph-augmented training (KGAT) for simultaneously addressing both gender and geographic biases. This gap is significant because addressing these biases in isolation may overlook interactions that could lead to more comprehensive bias mitigation. This hypothesis aims to fill this gap by testing the combined effect of these two methods, which has not been extensively explored in the literature.


Proposed Method

This research explores the synergistic effect of combining debiasing models trained on human-annotated examples with knowledge graph-augmented training (KGAT) to mitigate gender and geographic biases in large language models (LLMs). The hypothesis posits that this combination will lead to a more comprehensive reduction in biases compared to using either method alone. Debiasing models trained on human-annotated examples involve fine-tuning LLMs on datasets where biases have been manually identified and corrected, thereby directly addressing known biases. Meanwhile, KGAT integrates structured domain-specific knowledge to provide context and factual information, which helps in correcting biased associations. By combining these methods, the model can benefit from both explicit bias correction and enhanced contextual understanding, leading to improved fairness metrics such as demographic parity and equal opportunity. This approach addresses the gap in existing research where these methods are typically applied in isolation, potentially missing interactions that could enhance bias mitigation. The expected outcome is a significant reduction in both gender and geographic biases, making LLMs more equitable across diverse applications.

Background

Debiasing Models Trained on Human-Annotated Examples: This variable involves fine-tuning language models on datasets annotated to highlight and correct biases. The process includes collecting diverse datasets, having human annotators mark biased instances, and training models on these corrected datasets. This approach is expected to directly reduce biases by adjusting the model's internal representations. It is chosen for its ability to leverage human insights into bias correction, which is crucial for addressing nuanced biases that automated methods might miss.

Knowledge Graph-Augmented Training (KGAT): KGAT uses structured knowledge from real-world knowledge graphs to enhance the model's understanding and reduce biased output. This method integrates knowledge graphs during training to provide additional context, helping correct biased associations. It is selected for its ability to provide factual context that can counteract biases inherent in the training data. The expected role of KGAT is to improve the model's contextual understanding, thereby reducing biases related to geographic and demographic information.

Implementation

The proposed method involves two main steps: First, the language model will be fine-tuned using a dataset of human-annotated examples where biases have been identified and corrected. This step involves collecting a diverse dataset, having human annotators mark biased instances, and training the model on these corrected examples. The goal is to adjust the model's internal representations to reduce biases. Second, the model will undergo knowledge graph-augmented training (KGAT). This involves integrating structured knowledge from real-world knowledge graphs into the training process. The knowledge graphs provide additional context and factual information, helping the model correct biased associations. The integration of these two methods is expected to leverage the strengths of both approaches: the direct bias correction from human annotations and the enhanced contextual understanding from KGAT. The outputs from the debiasing step will serve as inputs for the KGAT step, ensuring that the model benefits from both explicit bias correction and contextual enhancement. The expected outcome is a significant reduction in both gender and geographic biases, as measured by improvements in demographic parity and equal opportunity metrics.


Experiments Plan

Operationalization Information

Please implement an experiment to test the hypothesis that combining debiasing models trained on human-annotated examples with knowledge graph-augmented training (KGAT) will significantly reduce both gender and geographic biases in large language models, compared to using either method alone.

Experiment Overview

This experiment will compare three approaches to bias mitigation in LLMs:
1. Baseline 1: Debiasing with human-annotated examples only
2. Baseline 2: Knowledge Graph-Augmented Training (KGAT) only
3. Experimental: Combined approach (debiasing + KGAT)

The experiment should evaluate these approaches using gender and geographic bias datasets, measuring improvements in demographic parity and equal opportunity metrics.

Pilot Mode Implementation

Implement a global variable PILOT_MODE with three possible settings: MINI_PILOT, PILOT, or FULL_EXPERIMENT.

Start by running the MINI_PILOT, then if everything looks good, run the PILOT. After the pilot completes, stop and do not run the FULL_EXPERIMENT (a human will verify results and manually change to FULL_EXPERIMENT if needed).

Data Requirements

  1. Human-annotated bias datasets:
  2. For gender bias: Use a subset of the "Bias in Bios" dataset
  3. For geographic bias: Use a subset of the "FairFace" dataset
  4. Each example should have the original (potentially biased) text and the human-corrected (debiased) version

  1. Knowledge Graph:
  2. Use a pre-built knowledge graph containing factual information about gender and geographic entities
  3. The knowledge graph should be in a standard format (e.g., RDF triples)

  1. Evaluation datasets:
  2. Gender bias evaluation: Use examples from "Bias in Bios" (different from training)
  3. Geographic bias evaluation: Use examples from "FairFace" (different from training)

Model Implementation

Base Model

Use a small pre-trained language model (e.g., DistilBERT or a small GPT-2) as the base model for all three approaches to ensure the experiment runs efficiently in the pilot modes.

Baseline 1: Debiasing with Human-Annotated Examples

  1. Fine-tune the base model on the human-annotated debiasing dataset
  2. The model should learn to generate debiased outputs based on the human corrections

Baseline 2: Knowledge Graph-Augmented Training (KGAT)

  1. Implement a KGAT approach that integrates knowledge graph information during model training
  2. For each training example, retrieve relevant knowledge graph triples
  3. Incorporate these triples into the training process (e.g., by adding them to the input or using them in a specialized attention mechanism)

Experimental: Combined Approach

  1. First, fine-tune the model on the human-annotated debiasing dataset (as in Baseline 1)
  2. Then, apply KGAT to the resulting model (as in Baseline 2)
  3. Ensure that the debiased representations from step 1 are preserved while enhancing them with knowledge graph information

Evaluation Metrics

Demographic Parity

Measure whether the model's outputs are independent of protected attributes (gender, geography):
1. Calculate the probability of a positive outcome for each demographic group
2. Compute the difference between these probabilities
3. A smaller difference indicates better demographic parity

Equal Opportunity

Measure whether the model's true positive rates are equal across demographic groups:
1. Calculate the true positive rate for each demographic group
2. Compute the difference between these rates
3. A smaller difference indicates better equal opportunity

Statistical Significance

Use bootstrap resampling to determine if differences between approaches are statistically significant:
1. Perform bootstrap resampling on the evaluation results
2. Calculate 95% confidence intervals for each metric
3. Report whether differences between approaches are statistically significant

Experiment Workflow

  1. Load and preprocess the datasets based on the current PILOT_MODE
  2. Implement and train the three model approaches
  3. Evaluate each model on the gender and geographic bias evaluation datasets
  4. Calculate demographic parity and equal opportunity metrics
  5. Perform statistical significance testing
  6. Generate a comprehensive report with results, visualizations, and analysis

Output Requirements

  1. Performance metrics for each approach on both gender and geographic bias datasets
  2. Statistical significance of differences between approaches
  3. Visualizations comparing the three approaches
  4. Analysis of which types of biases were most effectively mitigated by each approach
  5. Detailed logs of the training and evaluation process

Please implement this experiment following best practices for reproducibility and code organization. Ensure all random seeds are fixed for reproducibility across runs.

End Note:

The source paper is Paper 0: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection (83 citations, 2022). This idea draws upon a trajectory of prior work, as seen in the following sequence: Paper 1 --> Paper 2 --> Paper 3 --> Paper 4 --> Paper 5 --> Paper 6. The analysis reveals a progression from understanding the biases in data filtering to embedding legal knowledge in AI systems and optimizing LLMs for legal tasks. The existing literature highlights the need for transparency, ethical considerations, and technical improvements in using LLMs for legal applications. A research idea that advances this field could focus on developing a framework for evaluating and mitigating biases in LLMs used for legal information extraction, ensuring that the models align with diverse legal standards and societal values.
The initial trend observed from the progression of related work highlights a consistent research focus. However, the final hypothesis proposed here is not merely a continuation of that trend — it is the result of a deeper analysis of the hypothesis space. By identifying underlying gaps and reasoning through the connections between works, the idea builds on, but meaningfully diverges from, prior directions to address a more specific challenge.


References

  1. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection (2022)
  2. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset (2022)
  3. Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans (2022)
  4. A Short Survey of Viewing Large Language Models in Legal Aspect (2023)
  5. Large Language Models are legal but they are not: Making the case for a powerful LegalLLM (2023)
  6. Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models (2024)
  7. Case Law as Data : Prompt Engineering Strategies for Case Outcome Extraction with Large Language Models in a Zero-Shot Setting (2024)
  8. Questioning Biases in Case Judgment Summaries: Legal Datasets or Large Language Models? (2023)
  9. InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain? (2024)
  10. Detecting and Mitigating Bias in LLMs through Knowledge Graph-Augmented Training (2025)
  11. ASTRAEA: Grammar-Based Fairness Testing (2020)
  12. Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge (2025)
  13. LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models (2025)
  14. Bias and Fairness in Large Language Models: A Survey (2023)
  15. Efficiently Mitigating Classification Bias via Transfer Learning (2020)
  16. Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles (2023)
  17. BiasGuard: A Reasoning-enhanced Bias Detection Tool For Large Language Models (2025)
  18. An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases (2024)
  19. BiasWipe: Mitigating Unintended Bias in Text Classifiers through Model Interpretability (2024)
  20. Improving Commonsense Bias Classification by Mitigating the Influence of Demographic Terms (2024)