Paper ID

0b7cc0e510ef05ad394a36d9cee9ddf5f2ae912f

Title

SciFactFusion: A Multi-Modal Architecture for Robust Scientific Claim Verification

Problem Statement

Verifying the accuracy of scientific claims often requires synthesizing information from text, images, and data visualizations, which current language models struggle with. Most existing fact-checking systems focus solely on textual information or treat different modalities separately, limiting their ability to comprehensively assess complex scientific claims.

Motivation

Existing methods for scientific claim verification are limited by their focus on single modalities or their inability to effectively integrate information across modalities. By developing a unified multi-modal representation that can reason across text, images, and data, we can create more robust scientific claim verification systems. This approach is inspired by how humans verify scientific claims, often relying on a combination of textual descriptions, visual evidence, and data analysis.

Proposed Method

We introduce SciFactFusion, a novel architecture for multi-modal scientific claim verification. The model consists of three main components: (1) A text encoder based on a scientific domain-adapted language model, (2) An image encoder using a vision transformer pretrained on scientific figures and data visualizations, and (3) A multi-modal fusion transformer that learns to align and reason across the text and visual modalities. We use a combination of masked language modeling, image-text matching, and graph-to-text generation tasks for pretraining. For fine-tuning, we employ a contrastive learning objective where the model must distinguish between accurate and subtly altered scientific claims when presented with multi-modal evidence. We also incorporate a novel attention mechanism called 'evidence routing' that learns to selectively attend to relevant parts of the multi-modal input when verifying specific aspects of a claim.

Step-by-Step Experiment Plan

Step 1: Dataset Preparation

Create MultiSciCheck, a new benchmark for multi-modal scientific claim verification. Collect claims paired with evidence from scientific papers, including text, images, and data visualizations. Ensure a diverse range of scientific domains and claim types. For each claim, create subtle alterations to serve as negative examples.

Step 2: Model Implementation

Implement the SciFactFusion architecture using the Hugging Face Transformers library. Use a pretrained scientific language model (e.g., SciBERT) for the text encoder and a vision transformer (e.g., ViT) for the image encoder. Implement the multi-modal fusion transformer and the evidence routing mechanism.

Step 3: Pretraining

Pretrain the model on a large corpus of scientific papers and their associated figures. Use masked language modeling for text, image-text matching for figures, and graph-to-text generation for data visualizations. Utilize black-box LLM APIs (e.g., GPT-3.5) to generate pretraining data if necessary.

Step 4: Fine-tuning

Fine-tune the model on the MultiSciCheck dataset using the contrastive learning objective. Implement the evidence routing mechanism during this phase. Use a batch size of 32 and learning rate of 1e-5 with AdamW optimizer. Train for 10 epochs or until convergence.

Step 5: Baseline Implementation

Implement baseline models for comparison: (1) Text-only model using SciBERT, (2) Image-only model using ViT, (3) Simple concatenation of text and image features, (4) Existing multi-modal models like CLIP or VisualBERT adapted for scientific claim verification.

Step 6: Evaluation

Evaluate SciFactFusion and baselines on the MultiSciCheck test set. Use metrics such as accuracy, F1 score, and AUC-ROC for claim verification. Additionally, evaluate evidence selection precision and explanation generation quality using ROUGE and BERTScore.

Step 7: Ablation Studies

Conduct ablation studies to assess the impact of different components: (1) Remove evidence routing, (2) Use different pretraining tasks, (3) Vary the architecture of the multi-modal fusion transformer.

Step 8: Targeted Tests

Perform a series of targeted tests to assess specific capabilities: (1) Reconciling contradictory information across modalities, (2) Identifying cherry-picked data in visualizations, (3) Handling claims that require integrating information from multiple figures or data points.

Step 9: Error Analysis

Analyze cases where SciFactFusion fails and categorize error types. Use this information to identify areas for improvement and potential limitations of the approach.

Step 10: Reporting Results

Compile all results, ablation studies, and analyses into a comprehensive report. Prepare visualizations of the model architecture, performance comparisons, and example outputs for clear communication of findings.

Test Case Examples

Baseline Prompt Input (Text-only Model)

Claim: The rate of sea level rise has remained constant over the past century. Evidence: [A paragraph describing sea level measurements and a graph showing sea level changes over time]

Baseline Prompt Expected Output (Text-only Model)

The claim is false. While the text mentions some variation in sea level rise rates, without analyzing the graph, the model might miss the clear acceleration trend in recent decades.

Proposed Prompt Input (SciFactFusion)

Claim: The rate of sea level rise has remained constant over the past century. Evidence: [Same paragraph and graph as above]

Proposed Prompt Expected Output (SciFactFusion)

The claim is false. The textual evidence suggests some variation in sea level rise rates, but the graph clearly shows an acceleration in the rate of sea level rise, particularly in recent decades. The model correctly integrates information from both the text and the visualization to arrive at an accurate conclusion.

Explanation

SciFactFusion successfully integrates information from both the textual description and the visual graph to correctly identify the acceleration in sea level rise, which the text-only model missed due to its inability to process the graphical evidence.

Fallback Plan

If SciFactFusion doesn't significantly outperform baselines, we can pivot to an analysis paper focusing on the challenges of multi-modal scientific claim verification. We would conduct a thorough error analysis to identify specific types of claims or evidence that prove challenging for our model and existing approaches. This could involve categorizing errors based on the modalities involved, the complexity of the reasoning required, or the specific scientific domains. We could also investigate how different components of our model (text encoder, image encoder, fusion transformer) contribute to successes and failures. Additionally, we might explore how the model's performance varies across different scientific disciplines or types of visual evidence (e.g., photographs vs. graphs vs. diagrams). This analysis could provide valuable insights into the current limitations of multi-modal models in scientific contexts and guide future research directions in this area.