Paper ID

45653ad43124f02dc2cf2db3357be1d1d78ddb18


Title

Adversarial Fact Verification: Enhancing Robustness of Large Language Models Against Deceptive Claims


Introduction

Problem Statement

Current fact verification systems are vulnerable to adversarial attacks and can be easily fooled by subtle changes in claim phrasing or evidence presentation. This vulnerability limits their reliability in real-world applications where deliberate misinformation is prevalent.

Motivation

Existing fact verification benchmarks and systems primarily focus on naturally occurring claims and do not explicitly consider adversarial scenarios. In real-world applications, fact-checkers need to be robust against deliberate attempts to mislead. An adversarially-trained system would be more resilient and trustworthy, better equipped to handle the challenges of modern misinformation landscapes.


Proposed Method

We propose an Adversarial Fact Verification (AFV) framework consisting of two competing models: a Verifier and a Deceiver. The Verifier is trained to classify claims as true or false, while the Deceiver is trained to generate claims that fool the Verifier. These models are trained in an adversarial manner, similar to GANs. The Deceiver uses a large language model to generate claims and supporting evidence, employing techniques like paraphrasing, fact mixing, and context manipulation. The Verifier is a transformer-based model that learns to identify subtle inconsistencies and logical flaws. We introduce a novel 'deception score' that the Deceiver tries to maximize and the Verifier tries to minimize. This score combines the confidence of the Verifier's prediction with the semantic distance between the generated claim and the truth. To ensure relevance and coherence of generated claims, we include a 'naturalness' constraint in the Deceiver's objective.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Data Preparation

Use the FEVER dataset as the primary source of true and false claims. Split the dataset into training, validation, and test sets. Ensure a balanced distribution of true and false claims in each set.

Step 2: Implement the Deceiver

Use GPT-3.5 (text-davinci-003) as the Deceiver. Create prompts that instruct the model to generate deceptive claims based on true claims from the FEVER dataset. For example: 'Given the true claim "X", generate a false but plausible claim that is semantically similar.' Generate 3 deceptive claims for each true claim in the training set.

Step 3: Implement the Verifier

Use a pre-trained BERT model as the initial Verifier. Fine-tune it on the original FEVER dataset and the generated deceptive claims. Use binary cross-entropy loss for training.

Step 4: Implement the Deception Score

Define the deception score as: score = alpha * verifier_confidence + (1 - alpha) * (1 - semantic_similarity), where alpha is a hyperparameter, verifier_confidence is the Verifier's confidence in its prediction, and semantic_similarity is calculated using cosine similarity between BERT embeddings of the original and deceptive claims.

Step 5: Adversarial Training Loop

For each epoch: a) Use the Deceiver to generate a batch of deceptive claims. b) Calculate the deception score for each claim. c) Update the Deceiver using the deception score as a reward signal (use REINFORCE algorithm). d) Train the Verifier on this batch of deceptive claims along with an equal number of true claims. e) Evaluate performance on the validation set.

Step 6: Evaluation

Test the final Verifier model on: a) The original FEVER test set. b) A new set of adversarially generated claims (using the trained Deceiver). c) A human-curated set of challenging claims (if available). Compare performance with baseline models (e.g., BERT fine-tuned on original FEVER only).

Step 7: Analysis

Analyze the types of deceptive claims that are most successful in fooling the Verifier. Categorize them based on the deception techniques used (e.g., paraphrasing, fact mixing). Examine the Verifier's attention patterns on these challenging examples.

Test Case Examples

Baseline Prompt Input

Claim: The film 'Jaws' was directed by Steven Spielberg. Evidence: Jaws is a 1975 American thriller film directed by Steven Spielberg and based on Peter Benchley's 1974 novel of the same name.

Baseline Prompt Expected Output

True

Baseline Prompt Input (Adversarial)

Claim: The film 'Jaws' was produced by Steven Spielberg. Evidence: Jaws is a 1975 American thriller film directed by Steven Spielberg and based on Peter Benchley's 1974 novel of the same name.

Baseline Prompt Expected Output (Adversarial)

True (Incorrect)

Proposed Prompt Input

Claim: The film 'Jaws' was produced by Steven Spielberg. Evidence: Jaws is a 1975 American thriller film directed by Steven Spielberg and based on Peter Benchley's 1974 novel of the same name.

Proposed Prompt Expected Output

False

Explanation

The baseline model might be fooled by the subtle change from 'directed' to 'produced', while our adversarially-trained Verifier should be able to detect this nuanced difference and correctly classify the claim as false.

Fallback Plan

If the proposed AFV framework doesn't significantly improve robustness against adversarial claims, we can pivot to an analysis paper. We would focus on understanding why certain types of deceptive claims are particularly challenging for fact verification systems. This could involve: 1) Categorizing the successful deceptive claims based on the techniques used (e.g., subtle word substitutions, context manipulation). 2) Analyzing the attention patterns of the Verifier on both successful and unsuccessful deceptive claims to identify potential weaknesses in the model's reasoning. 3) Conducting ablation studies on the components of the deception score to understand which aspects contribute most to the model's performance. 4) Exploring the trade-off between robustness to adversarial claims and performance on standard fact verification tasks. This analysis could provide valuable insights into the limitations of current fact verification systems and guide future research in developing more robust models.


References

  1. Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM (2024)
  2. Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents (2025)
  3. Improving Large-Scale Fact-Checking using Decomposable Attention Models and Lexical Tagging (2018)
  4. KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models (2023)
  5. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information (2021)
  6. Evaluating Verifiability in Generative Search Engines (2023)
  7. FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality (2025)
  8. If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition (2025)
  9. FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios (2023)
  10. Language Models Hallucinate, but May Excel at Fact Verification (2023)