Paper ID

45653ad43124f02dc2cf2db3357be1d1d78ddb18


Title

Multi-Modal Fact Verification: Integrating Text, Image, and Audio for Robust Misinformation Detection


Introduction

Problem Statement

Current fact verification systems primarily rely on textual information, but many claims involve visual or auditory elements that are challenging to verify using text alone. This limitation hinders the effectiveness of fact-checking in real-world scenarios where misinformation often combines multiple modalities.

Motivation

Existing benchmarks like FEVER and FEVEROUS focus on text-based fact-checking, while some work has been done on image-text fact verification. However, comprehensive multi-modal fact-checking remains largely unexplored. Real-world misinformation often combines text, images, and audio, making a system that can integrate and reason across multiple modalities more robust and applicable to real-world scenarios. By leveraging the strengths of large language models in processing and reasoning across different modalities, we aim to create a more effective fact verification system.


Proposed Method

We propose a Multi-Modal Fact Verification (MMFV) framework that fuses information from text, images, and audio to verify claims. The system consists of three main components: (1) A multi-modal encoder that processes text, images, and audio separately using pre-trained models (e.g., BERT for text, ViT for images, Wav2Vec for audio) and then fuses the representations. (2) A cross-modal attention mechanism that allows each modality to attend to relevant information in other modalities. (3) A verification head that takes the fused representation and predicts the veracity of the claim. We will use a large language model (LLM) as the backbone for our system, leveraging its ability to process and reason across different modalities through prompting.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Dataset Creation

Create a new Multi-Modal Fact Verification dataset by collecting claims from social media that involve text, images, and/or audio. Annotate these claims for veracity (true, false, or partially true). Aim for a dataset of at least 1000 multi-modal claims.

Step 2: Data Preprocessing

Process the collected data into a format suitable for LLM input. For text, use standard tokenization. For images, use CLIP to generate image embeddings. For audio, use Wav2Vec to generate audio embeddings. Store these preprocessed inputs alongside the original data.

Step 3: Baseline Models

Implement three baseline models: (1) Text-only model using GPT-3.5, (2) Image-text model using CLIP + GPT-3.5, (3) Simple concatenation of all modalities using GPT-3.5.

Step 4: MMFV Model Implementation

Implement the MMFV model using GPT-4 as the backbone. Create prompts that instruct the model to consider all available modalities and their interactions. Use few-shot examples in the prompt to guide the model.

Step 5: Training and Evaluation

Split the dataset into train, validation, and test sets. Use the train set for few-shot prompting. Evaluate all models on the test set using accuracy, F1 score, and a new 'cross-modal consistency' metric that measures how well the system's predictions align across different modalities for the same claim.

Step 6: Analysis

Perform error analysis to identify strengths and weaknesses of the MMFV model compared to baselines. Analyze performance across different types of claims and modality combinations.

Test Case Examples

Baseline Prompt Input (Text-only)

Claim: The video shows a politician giving a speech about climate change. Verify if this claim is true or false.

Baseline Prompt Expected Output (Text-only)

I'm sorry, but I cannot verify the claim based solely on the text provided. To accurately verify if the video shows a politician giving a speech about climate change, I would need to see the video itself or have more detailed information about its content.

Proposed Prompt Input (MMFV)

Claim: The video shows a politician giving a speech about climate change. [IMAGE_EMBEDDING] [AUDIO_EMBEDDING] Analyze the text, image, and audio to verify if this claim is true or false.

Proposed Prompt Expected Output (MMFV)

After analyzing the provided text, image embedding, and audio embedding, I can verify that the claim is true. The image embedding shows a person standing at a podium, which is consistent with giving a speech. The audio embedding contains patterns typical of human speech. The content of the speech, as evidenced by the audio analysis, includes keywords related to climate change such as 'global warming', 'emissions', and 'renewable energy'. Therefore, based on the multi-modal evidence, I conclude that the video indeed shows a politician giving a speech about climate change.

Explanation

The MMFV model successfully integrates information from all three modalities to verify the claim, while the text-only baseline is unable to make a determination due to lack of visual and auditory information.

Fallback Plan

If the proposed MMFV method doesn't significantly outperform the baselines, we can pivot to an analysis paper exploring the challenges of multi-modal fact verification. We would conduct ablation studies to understand the contribution of each modality and the effectiveness of different fusion techniques. We could also investigate cases where multi-modal information leads to incorrect conclusions, analyzing whether this is due to limitations in the model's reasoning capabilities or inherent ambiguities in multi-modal data. Additionally, we could explore alternative architectures, such as using separate LLMs for each modality and then combining their outputs, or fine-tuning smaller open-source models on our dataset to compare with the prompting-based approach.


References

  1. Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM (2024)
  2. Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents (2025)
  3. Improving Large-Scale Fact-Checking using Decomposable Attention Models and Lexical Tagging (2018)
  4. KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models (2023)
  5. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information (2021)
  6. Evaluating Verifiability in Generative Search Engines (2023)
  7. FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality (2025)
  8. If We May De-Presuppose: Robustly Verifying Claims through Presupposition-Free Question Decomposition (2025)
  9. FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios (2023)
  10. Language Models Hallucinate, but May Excel at Fact Verification (2023)