Case Study

Lenfest AI Collaborative and Fellowship Program: Dewey, the Archivist

An overview of The Philadelphia Inquirer’s AI archive research assistant + open source code

September 19, 2025

The Philadelphia Inquirer’s newsroom relies heavily on its deep archives to provide historical context for investigative reporting, politics, and culture. But fragmented systems created a daily frustration: newer stories live in the CMS, while older coverage resides in a separate print archive with different logins and search interfaces. Reporters must guess date ranges, retry queries with slight keyword changes, and reconcile mismatched metadata — all under deadline pressure.

The Inquirer set out to test whether a generative AI “archive research assistant” could help automate this process. The goal was not to replace reporters’ judgment, but to free them from hours of tedious searching, give back creative time, and model a solution that other newsrooms with deep archives could adopt.

Their bet: A conversational AI assistant could compress archive research from days to hours, surface more relevant context, and prepare reporters to pursue stories with sharper historical grounding.

Product design & development

Approach: The Inquirer designed an AI-powered assistant (nicknamed Dewey) that retrieves and summarizes archival content, links directly to the correct source system, and provides transparent citations.

What it does:

Accepts natural language questions across decades of content
Retrieves relevant passages from both CMS and digitized print archives
Synthesizes answers with inline citations and clickable sources
Applies deterministic filters (dates and authors) for precision
Displays inferred parameters (time range, author) so reporters can adjust

How it works (plain-English stack):

Interface: Chat-style assistant, familiar UI for reporters
Retrieval: Retrieval-Augmented Generation (RAG) pipeline with Azure AI Search (vector + hybrid semantic re-rank, plus recency weighting)
Indexing: Content chunked at ~512 tokens with ~20% overlap; metadata structured in JSON-like blocks
Filters: Deterministic date and author constraints derived via OpenAI function-calling
Data: Approximately 127,000 web articles (rich JSON) and 200,000 digitized print articles (dating back to 1978, parsed via spaCy)
Prototype: MVP built in around 2 weeks on Microsoft’s Azure demo repo, then customized heavily for newsroom use

Process: Reporters, product staff, and engineers collaborated in iterative cycles. Transparency (citations, exposed filters) and speed-to-value were prioritized.

Team: AI Fellow (technical lead), newsroom and archives leaders, product team, with support from OpenAI and Microsoft (especially the Microsoft Artificial Intelligence Development Acceleration Program (MAIDAP)).

Key learnings

Successes

Time saved: Multi-decade research compressed from days to hours. For easy questions like “who was the mayor in XXXX”, Dewey frankly took as long as a Google search. Still, reporters favored the natural conversation and summaries that Dewey provided over Google. For longer term tasks like “Summarize a decade of coverage on XXX”, reporters said it could save several days of research
Transparency: Inline citations and visible filters built user trust
Adoption: Pilot users (around 50 staff, 15–20 reporters regularly) became strong advocates
Extended value: Marketing and ad sales used Dewey to summarize sector/client coverage

Challenges

Accuracy: Early hallucinations and name collisions → mitigated with deterministic filters and clickable verification
Temporal understanding: Solved by explicitly providing today’s date and linguistically inferring date ranges from user input
Retrieval scaling: With approximately 350,000 articles indexed, tuning relevance remains ongoing; knowledge graphs explored for future improvements
Change management: Reporters required training to adapt prompts and workflows; transparent controls helped build confidence

What’s next

Short-term roadmap:

Rollout to full newsroom (around 200 staff) with training modules
Expand pipelines to earlier archives and experiment with image indexing
Improve evaluation metrics (moving from hand-labeled NDCG to LLM-as-judge frameworks)

Key questions:

How to balance speed and accuracy in high-stakes reporting?
Can Dewey scale across larger archives without overwhelming users?
What business models (education, licensing, subscriber add-ons) could extend value?

Looking ahead: Open-sourcing the core prompts and modular architecture so other newsrooms can stand up RAG pipelines faster.

Why this matters

The Inquirer’s project shows how AI can strengthen the core editorial mission of local journalism by unlocking archives that were once siloed and underused. By giving reporters faster access to trustworthy historical context, Dewey turns the archive into an active asset — and offers a replicable model for any publisher with deep historical reporting.