Paper ID

a76209fea4627974b5e12d8b4942268eb17bc7df


Title

Self-Supervised Relevance Distillation for Improved Retrieval-Augmented Generation


Introduction

Problem Statement

Retrieval-augmented generation systems often struggle with noisy or irrelevant retrieved information, leading to decreased performance and increased hallucination, especially when dealing with diverse tasks and domains. This problem is particularly acute when systems need to adapt to new domains or tasks without task-specific supervision.

Motivation

Current approaches often rely on supervised relevance labeling or use simple heuristics like TF-IDF scoring to filter retrieved passages. Some recent works have explored using language models to assess relevance, but they typically require task-specific fine-tuning. Inspired by the human ability to quickly identify relevant information from a large context, we propose a method that learns to distill relevant information from retrieved passages without task-specific supervision. This approach allows for unsupervised adaptation to new domains and tasks, as the relevance model continuously improves its understanding of relevance through interaction with the retrieval and generation processes.


Proposed Method

We introduce Self-Supervised Relevance Distillation (SSRD), a novel approach to refine retrieved information. SSRD consists of two key components: (1) A relevance distillation model that learns to extract the most pertinent information from retrieved passages. This model is trained using a novel self-supervised objective: given a set of retrieved passages and a generated answer, it learns to reconstruct the answer using only a small subset of the retrieved information. The reconstruction loss encourages the model to identify the most relevant parts of the retrieved passages. (2) A contrastive learning module that further improves the relevance model by contrasting true retrieved passages with artificially constructed irrelevant passages. To generate these contrasts, we use the language model to paraphrase retrieved passages while changing key details, creating 'near-miss' distractors. During inference, SSRD processes the retrieved passages to produce a condensed, highly relevant context for the language model to use in generation.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Data Preparation

Collect datasets for diverse tasks: (a) Open-domain QA: Natural Questions, (b) Multi-document summarization: Multi-News, (c) Task-oriented dialogue: MultiWOZ. Split each dataset into train, validation, and test sets.

Step 2: Retrieval System Setup

Implement a basic retrieval system using BM25 or Dense Passage Retrieval (DPR) to retrieve relevant passages for each input query or document.

Step 3: Baseline Implementation

Implement standard RAG baselines using GPT-3.5 and GPT-4 APIs. For each input, retrieve top-k passages and use them as context for generation.

Step 4: SSRD Model Implementation

Implement the SSRD model using GPT-3.5 or GPT-4 API. The model should take retrieved passages and a generated answer as input, and output a relevance score for each passage.

Step 5: Self-Supervised Training

Train the SSRD model using the reconstruction objective. For each training example: (a) Retrieve passages, (b) Generate an initial answer using the baseline RAG model, (c) Use SSRD to select a subset of passages, (d) Reconstruct the answer using only the selected passages, (e) Compute reconstruction loss and update the SSRD model.

Step 6: Contrastive Learning

Implement the contrastive learning module. For each positive example (true retrieved passage), generate negative examples by paraphrasing the passage and changing key details. Train the SSRD model to distinguish between positive and negative examples.

Step 7: Inference Pipeline

Implement the full SSRD inference pipeline: (a) Retrieve passages, (b) Use SSRD to select and rerank passages, (c) Use selected passages as context for final answer generation.

Step 8: Evaluation

Evaluate the SSRD model against baselines on all three tasks. Use task-specific metrics: (a) QA: Exact Match and F1 scores, (b) Summarization: ROUGE scores, (c) Dialogue: Task Completion Rate and BLEU scores. Also conduct human evaluation for relevance and factual consistency on a subset of examples.

Step 9: Analysis

Analyze model performance across different tasks and domains. Investigate how the SSRD model adapts to new domains by testing on out-of-distribution datasets.

Test Case Examples

Baseline Prompt Input

Question: What is the capital of France? Context: Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres. The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.

Baseline Prompt Expected Output

The capital of France is Paris.

Proposed Prompt Input (SSRD Step 1: Relevance Scoring)

Question: What is the capital of France? Retrieved Passages: [1] Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres. [2] The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017. [3] France is a country located in Western Europe. It shares borders with Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Andorra, and Spain. Generated Answer: The capital of France is Paris. Score the relevance of each passage to answering the question.

Proposed Prompt Expected Output (SSRD Step 1: Relevance Scoring)

Relevance scores: [1] 10/10 - This passage directly states that Paris is the capital of France, which is the exact information needed to answer the question. [2] 5/10 - This passage mentions Paris as a seat of government, which is related to being a capital, but doesn't directly state it's the capital of France. [3] 1/10 - This passage provides general information about France but doesn't mention anything about its capital.

Proposed Prompt Input (SSRD Step 2: Final Answer Generation)

Question: What is the capital of France? Relevant Context: Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres. Generate a concise answer to the question based on this context.

Proposed Prompt Expected Output (SSRD Step 2: Final Answer Generation)

The capital of France is Paris.

Explanation

The SSRD method improves over the baseline by explicitly scoring the relevance of each retrieved passage and selecting the most relevant one for the final answer generation. This helps in focusing on the most pertinent information and potentially reducing noise or irrelevant details in the generation process.

Fallback Plan

If the proposed SSRD method doesn't show significant improvements over the baselines, we can pivot the project in several directions. First, we could conduct a detailed error analysis to understand where and why SSRD is failing. This could involve categorizing errors (e.g., relevance misjudgments, factual inconsistencies) and analyzing patterns across different tasks and domains. Second, we could explore variations of the self-supervised training objective, such as incorporating multiple correct answers or using different reconstruction targets. Third, we could investigate the impact of different contrastive learning strategies, including more sophisticated methods for generating negative examples. Finally, if the self-supervised approach doesn't yield improvements, we could explore a hybrid approach that combines lightweight supervision with self-supervised learning, potentially using a small amount of human-labeled data to guide the relevance model. These analyses and variations could provide valuable insights into the challenges of unsupervised relevance learning and inform future research directions in this area.


References

  1. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (2023)
  2. Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation (2025)
  3. MAO-ARAG: Multi-Agent Orchestration for Adaptive Retrieval-Augmented Generation (2025)
  4. Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation (2025)
  5. SGIC: A Self-Guided Iterative Calibration Framework for RAG (2025)
  6. Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation (2025)
  7. Investigating the Robustness of Retrieval-Augmented Generation at the Query Level (2025)
  8. HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (2025)
  9. Multimedia Graph Codes for Fast and Semantic Retrieval-Augmented Generation (2025)
  10. PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG (2025)