Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs > arXiv:2509.25494

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science > Information Retrieval

arXiv:2509.25494 (cs)
[Submitted on 29 Sep 2025]

Title:On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

Authors:Nick Hagar, Nicholas Diakopoulos, Jeremy Gilbert
View a PDF of the paper titled On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search, by Nick Hagar and 2 other authors
View PDF HTML (experimental)
Abstract:Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.
Comments: Accepted to Computation + Journalism Symposium 2025
Subjects: Information Retrieval (cs.IR)
Cite as: arXiv:2509.25494 [cs.IR]
  (or arXiv:2509.25494v1 [cs.IR] for this version)
  https://doi.org/10.48550/arXiv.2509.25494
arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Nick Hagar [view email]
[v1] Mon, 29 Sep 2025 20:50:40 UTC (73 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search, by Nick Hagar and 2 other authors
  • View PDF
  • HTML (experimental)
  • TeX Source
license icon view license
Current browse context:
cs.IR
< prev   |   next >
new | recent | 2025-09
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack