Paper ID

0a4b8b161931799d5c6bc3ecf07c53bae0e9e502


Title

Contextual Reputation Scoring for Quality Filtering of High School Newspaper Articles


Introduction

Problem Statement

Current quality filtering methods for language models often fail to capture the nuanced context and diverse perspectives present in high school newspaper articles, leading to suboptimal selection of training data and potentially biased model outputs.

Motivation

Existing approaches typically use simple heuristics or pre-trained classifiers to assess text quality, which may not adequately represent the unique characteristics of student journalism. High school newspapers offer a rich source of diverse perspectives and writing styles that could enhance language model training. However, their quality varies widely and requires careful filtering that considers both content and context. Our proposed Contextual Reputation Scoring (CRS) system aims to address these limitations by combining multi-faceted quality assessment with localized reputation modeling, potentially improving the selection of high-quality, diverse training data for language models.


Proposed Method

We propose a Contextual Reputation Scoring (CRS) system that combines multi-faceted quality assessment with localized reputation modeling. The CRS pipeline involves: 1) Fine-tuning a language model on a curated dataset of exemplary high school journalism to learn domain-specific quality indicators. 2) Developing a graph-based reputation system that models relationships between schools, regions, and historical article quality. 3) Implementing a context-aware encoder that captures local cultural nuances and writing styles. 4) Creating an ensemble scoring mechanism that integrates content quality assessment, reputation scores, and contextual relevance. 5) Employing active learning to continuously refine the scoring system based on expert feedback.


Experiments Plan

Step-by-Step Experiment Plan

Step 1: Data Collection

Gather a large corpus of high school newspaper articles from diverse sources. Aim for at least 10,000 articles from 100+ schools across different regions. Use web scraping tools to collect articles from school newspaper websites and aggregate platforms.

Step 2: Baseline Model Preparation

Implement three baseline quality filtering methods: 1) Simple heuristics (e.g., article length, readability scores), 2) Pre-trained text classification model (e.g., BERT fine-tuned on general web content quality), and 3) GPT-based zero-shot classification.

Step 3: Fine-tune Quality Assessment Model

Curate a small dataset (500-1000 articles) of high-quality high school journalism, annotated by journalism educators. Fine-tune a BERT-based model on this dataset to learn domain-specific quality indicators. Use prompts like 'Rate the quality of this high school newspaper article on a scale of 1-10:' for GPT-3.5 and GPT-4 to generate pseudo-labels for the remaining articles.

Step 4: Develop Graph-based Reputation System

Create a graph database representing schools, regions, and articles. Calculate initial reputation scores based on average article quality and school prestige (e.g., journalism awards). Implement PageRank-like algorithm to propagate reputation through the graph.

Step 5: Implement Context-aware Encoder

Fine-tune a BERT model on the entire corpus of high school articles, masked by school and region, to learn local writing styles and cultural nuances. Use this model to encode articles for contextual relevance scoring.

Step 6: Create Ensemble Scoring Mechanism

Combine scores from the fine-tuned quality assessment model, graph-based reputation system, and context-aware encoder using a weighted average. Tune weights using a small held-out set of expert-rated articles.

Step 7: Implement Active Learning Loop

Set up an interface for expert feedback on a sample of articles. Use this feedback to periodically retrain the quality assessment model and adjust reputation scores.

Step 8: Evaluation

Compare CRS against baseline methods on a test set of 1000 expert-rated articles. Metrics include correlation with expert ratings, diversity of selected articles (measured by topic and style variance), and downstream performance on tasks like summarization and style transfer using a fine-tuned T5 model.

Step 9: Ablation Studies

Conduct ablation studies by removing each component of the CRS system to assess its impact on overall performance.

Step 10: Analysis and Reporting

Analyze results, focusing on improvements in quality assessment accuracy, diversity of selected articles, and impact on downstream tasks. Prepare a comprehensive report and visualization of findings.

Test Case Examples

Baseline Prompt Input

Please rate the quality of this high school newspaper article on a scale of 1-10: [Article text]

Baseline Prompt Expected Output

7

Proposed Prompt Input

Analyze this high school newspaper article:
1. Assess overall quality (1-10)
2. Identify key strengths and weaknesses
3. Consider the school's reputation and regional context
4. Evaluate writing style and cultural relevance
[Article text]

Proposed Prompt Expected Output

  1. Overall quality: 8/10
  2. Strengths: Well-researched, balanced perspective, clear writing. Weaknesses: Slightly verbose introduction, one unsupported claim.
  3. School context: Reputable journalism program, consistent high-quality output. Region known for environmental activism, article aligns with local interests.
  4. Writing style: Engaging, age-appropriate vocabulary. Culturally relevant: Addresses local environmental concerns, mentions local landmarks and figures.

Explanation

The proposed method provides a more comprehensive analysis, considering multiple factors beyond just overall quality. It takes into account the school's reputation, regional context, and cultural relevance, which are crucial for accurately assessing high school journalism.

Fallback Plan

If the proposed CRS system doesn't significantly outperform baselines, we can pivot to an analysis paper exploring the challenges of quality assessment in student journalism. We would conduct in-depth error analysis to understand where CRS fails, potentially revealing insights about the unique characteristics of high school newspapers. We could also explore the relationship between article quality and factors like school resources, geographic location, and student demographics. Additionally, we might investigate how different components of CRS (e.g., reputation scores, contextual relevance) correlate with various aspects of article quality, providing valuable insights for future research in this area.


References

  1. Sem-Rouge: Graph-Based Embedding for Automated Text Summarization with Using Large Language Models (2025)
  2. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection (2022)
  3. On the Predictive Power of Representation Dispersion in Language Models (2025)
  4. StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation (2025)
  5. WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia (2025)
  6. Mangosteen: An Open Thai Corpus for Language Model Pretraining (2025)
  7. A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations (2025)
  8. Beyond the Link: Assessing LLMs' ability to Classify Political Content across Global Media (2025)
  9. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 (2021)
  10. MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining (2025)