← Back
Vision indexingDual-consumer RAGQuality gates

The Archive

55,000+ pages of gaming history, indexed so both humans and AI can query primary sources.

The corpus

277 issues spanning Gamest, Famitsu, and Dengeki. Cover art, strategy guides, developer interviews, hardware ads, launch coverage. The browser shows every issue indexed, filterable by magazine and era. Each cover is a doorway into 100-250 pages of timestamped gaming history.

The corpus 1

A searchable archive of 55,000+ Japanese gaming magazine pages. Structured so both humans and AI can query primary sources.

Search and retrieval

Every page is searchable across 11 extracted fields: games, developers, people, content types, platforms, historical tags, and article-worthiness notes. Results surface the issue, page number, a generated description, and the tags that made it article-worthy. 24,222 pages flagged with human-verified notes on why they matter. The same index that serves human research grounds the AI synthesis pipeline.

Search and retrieval 1

The pages

Individual page viewer with the original scan alongside the generated description at the bottom. Strategy guides, hardware ads, developer features; each page carries its own context: what publication, what year, what it documents. Pages can be paired for multi-page spreads.

The pages 1
The pages 2
The pages 3

Ask it anything

Imagine asking your LLM when a certain game released, how many articles covered Virtua Fighter's launch, or to pull every advertisement for a specific title across all indexed magazines. The answers come from the actual historical pages, scanned and indexed, rather than from the web. Your own primary source: 55,995 pages of contemporary coverage, ads, and editorial from the years these games shipped.

What the pipeline taught

Processing 55,000 pages exposed failure modes that only appear at scale. Early attempts used a single orchestrator agent to manage an entire issue (250 pages at once). It would silently stop mid-run: not crash, not error, just stop. No trace, no diagnostic, pages simply not indexed. The fix was structural: 14-page batches with explicit completion checks at every boundary. An agent that can only fail on 14 pages is an agent that can be retried without consequence. LLM compliance is the problem that keeps returning. A model given a strict extraction schema, a rules document, and a workflow will follow all three, until it doesn't. Game names get garbled. Required fields get invented from context. Array fields come back as plain strings. A batch that looks clean at 20 pages has hallucinated entries by page 180. The quality system (7 audit types, a 50-character description detector, per-issue average length thresholds) exists because the model cannot be trusted to self-police across an entire magazine run. The most concrete measure of this: 109 issues had to be re-indexed after a quality audit surfaced systematic degradation. The Gamest mass-indexing run produced issues where average description length was 63-68 characters and 38-106 pages per issue had descriptions under 50 characters, not just thin but placeholder-quality outputs that passed the pipeline silently. Dengeki PlayStation issues came back with garbled Japanese game titles: the model had stopped transliterating correctly and was emitting corrupted romaji. Each re-index pass was a session of 1-2 issues, and re-indexing introduced its own failure: batch scripts that created a duplicate magazine entry instead of updating the existing one, requiring a post-session deduplication query before the database was clean. Nine successive normalization scripts for Gamest alone, each one written to handle a new edge case or failure mode that the previous version missed. That number is an honest record of how many times the pipeline was believed to be finished.

Highlights

Under the hood

Data ingestionVision indexingQuality gatesDrift detectionHybrid LLM routingFull-text search