PageIndex.ai: A persuasive “vectorless RAG” idea—especially for real PDFs

by

in

For the last two years, “RAG” has mostly meant the same pipeline: split documents into chunks, embed them, retrieve top-k by similarity, then ask a model to answer from those snippets. It works well enough for lots of workloads, but it also fails in familiar ways: it grabs text that sounds related but isn’t actually the right place to look, it fractures context at chunk boundaries, and it struggles with cross-references (“see Appendix G”, “as described in Section 4.2”). PageIndex.ai proposes a different core metaphor: don’t search for similar text—navigate the document like a person would. 

PageIndex describes itself as a “vectorless, reasoning-based RAG engine” that transforms documents into a hierarchical tree representation (think: an “intelligent table of contents”) and then uses multi-step reasoning to traverse that structure and retrieve coherent sections rather than fixed-length chunks. The pitch is that the model reasons about where the answer should live (“this sounds like risk factors”, “this belongs to debt notes”, “this is likely in the appendix”), and the retrieval trail remains traceable back to specific locations. 

Conceptually, that is the right direction for long, structured documents. Chunking is a blunt instrument. If your source material is annual reports, contracts, policy manuals, technical documentation, or “PDFs that were clearly authored with a table of contents in mind”, the document’s own structure is often a stronger signal than any embedding space. PageIndex leans into that by making structure first-class: build the tree, reason over the tree, then retrieve the right section(s). 

The most compelling part is not “no vectors” as an ideology, but the practical benefits that follow from structure-aware navigation:

First, it’s easier to preserve coherence. Instead of returning five disjoint paragraph chunks that merely share keywords with the query, a tree-guided approach can retrieve a full, semantically complete section (or a small set of adjacent sections) that actually contains the answer and its qualifiers. That matters in domains where footnotes, exceptions, and definitions are the real content. 

Second, cross-references become a feature rather than a bug. Human readers follow references; most chunk-RAG systems do not. PageIndex’s framing explicitly targets that: retrieval is meant to be explainable, and the system is optimized for the “where in the document is this handled?” style of question. 

Third, the product story is reasonably coherent. There is a hosted chat experience, an API for integration, and an MCP server so you can wire “read and reason over whole PDFs” into agentic workflows in tools like Claude Desktop or Cursor. The MCP angle is particularly pragmatic: it makes long-document reasoning available as a tool in ecosystems that already speak MCP. 

So why not declare victory and move on? Because “reasoning-based retrieval” shifts the trade-offs. In a typical vector RAG setup, retrieval is cheap and deterministic-ish; the model work happens after retrieval. In a reasoning-first setup, you often spend model capacity during retrieval itself—tree search, iterative refinement, selection, sometimes summarization—before you even answer. PageIndex’s docs and pricing hint at that through distinct components like tree generation and OCR, and per-query pricing for the chat API. 

This leads to three practical questions any serious evaluation should answer:

  1. Latency and cost under load. If you run this against hundreds of pages across many documents, does the multi-step traversal stay fast enough for interactive use? PageIndex offers a Pro plan with usage-based pricing for queries, plus usage-based OCR and per-page tree generation (even if “unlimited” is framed as availability rather than “free”). You’ll want to model your own document volumes and query patterns. 
  2. Robustness on messy inputs. A structure-driven approach is only as good as the structure it can reconstruct. For native PDFs with a real ToC, that’s great. For scans, slide decks exported to PDF, or reports with broken headings, the system relies more heavily on OCR and tree generation quality. PageIndex positions OCR that “preserves global structure” as part of the toolkit, which is encouraging, but you still need to test with your ugliest real-world PDFs. 
  3. Benchmark claims and generalization. PageIndex and the associated Mafin 2.5 evaluation emphasize 98.7% accuracy on FinanceBench. That is a striking number, and the code/results are published. But FinanceBench is a specific regime (financial QA over reports), and the evaluation is produced by the same organization that builds the system—useful, but not the same as independent replication. The right way to read it is: “this approach can be extremely strong on structured financial documents,” not “this replaces all RAG everywhere.” 

My overall take: PageIndex is worth paying attention to because it attacks a genuine failure mode of mainstream RAG: the mismatch between semantic similarity and actual relevance in structured documents. For document-heavy workflows where traceability matters (finance, legal, compliance, technical operations), “navigate then retrieve” feels like the correct primitive. For short, unstructured corpora, or for ultra-low-latency applications, classic embedding retrieval may remain the better default.

If you want to test it quickly, here is a compact evaluation plan that tends to separate “nice demo” from “production fit”:

Pick three document types you actually use: one clean (ToC-rich), one typical, one ugly (scanned or inconsistent formatting). Ask 20 questions per document: ten that require finding the right section (not just keyword overlap), five that require following a reference, and five that require careful qualifiers (“except”, “unless”, “as of”). Measure: answer correctness, citation/trace accuracy, time-to-first-answer, and total cost per question. Then repeat with a baseline chunk-RAG pipeline and compare. If PageIndex wins decisively on the reference-following and qualifier questions without blowing up latency/cost, you have a strong case for adopting it in that domain.

PageIndex doesn’t “kill RAG”. It proposes a more document-native retrieval strategy for the cases where RAG hurts the most: long PDFs that were meant to be read with structure. That is a valuable niche—and in enterprise document work, it’s a very large niche.