CasePilot — How a Reasoning Framework Beats a Bigger Model
Why agentic AI inside a verification discipline produces output a court will accept — and chat AI cannot.
A family-law case can produce three thousand pages of evidence: bank statements, sworn declarations, WhatsApp threads, school records, deposition transcripts, scanned receipts that someone photographed at an angle. Buried in that stack are the four or five facts that decide the case. Surfacing them is forty hours of paralegal work. And every chat-based AI tool a litigant tries fabricates citations, miscounts deposits, and forgets what it read three pages ago. Those failures are not bugs that the next model release will fix. They are properties of asking a single language model to read a corpus, do the math, write the brief, and check its own work. CasePilot is built on a different premise: the model writes, but deterministic code retrieves, calculates, and verifies. What follows is why that architecture exists and what it produces.
The problem isn't the model. It's the architecture.
If you give a leading language model a case file and ask it to find the discrepancies between a parent's stated and actual income, three things will happen, reliably, in production.
It will hallucinate citations. Stanford's HAI group has documented this at rates between fourteen and thirty-three percent for legal queries, depending on the tool. A confident-sounding paragraph will quote a passage that does not exist on the page it cites, or cite a case that was never decided.
It will miscalculate. Language models are pattern-completion engines, not arithmetic engines. Asked to total twelve months of bank deposits from a PDF, they will produce a number that looks right and is wrong by hundreds or thousands of dollars. The error is silent because the prose around it is fluent.
It will lose the middle. The "lost in the middle" effect is well-studied: large-context models reliably retrieve from the beginning and end of long inputs and skip what sits in the center. In a case file, that means the email that proves the contradiction is the email the model glanced past.
And it will never tell you what it didn't find. A chat tool's failure mode is to paper over gaps with confident prose. Legal reasoning lives or dies on knowing what you don't know, and a system optimized for fluency will not declare its own ignorance.
You cannot prompt-engineer your way out of these. They are structural properties of running a single neural network across a long, messy corpus. Fixing them requires changing the architecture.
Software engineers solved this two years ago.
In 2024, programmers were asking the same question lawyers are asking now: how do you let a language model work productively across a large body of text it cannot fit in its head? A modern codebase has thousands of files and millions of lines. No single model call can reason about all of it.
The answer that emerged is the agentic IDE. Cursor, Claude Code, and the rest of the modern wave do not stuff the codebase into a prompt. They give the model tools: a file reader, a search index, a test runner, a type checker, a Python interpreter. The model plans the change, calls the tools to gather evidence, writes a candidate edit, runs the tests, and either ships or revises. Deterministic systems handle retrieval, computation, and verification. The neural component does what it is good at: planning and writing prose-shaped artifacts. Everything else happens in code that either succeeds or fails in a way you can inspect.
Litigation is the same problem in a different costume. A case file is a codebase made of evidence. Retrieval is search. Arithmetic is the test runner. Citation-checking is the type checker. The architecture that works for thousands of files of code works for thousands of pages of evidence, and for the same reasons.
CasePilot is that architecture, applied to family law.
What CasePilot actually is
CasePilot is structured around a five-stage reasoning framework. Each stage produces an artifact the next consumes. Each stage is independently auditable. The language model participates in some stages and is forbidden from others.
Stage 1. Ingest and structure. Every page, table, and date is parsed at upload and indexed three ways: as semantically searchable text, as structured rows in a SQL store, and as a metadata catalog of parties, dates, and document types. The SQL store is what later makes it possible to answer numerical questions without the model ever reading a number off a page.
Stage 2. Decompose the question. Before any retrieval runs, the system turns the user's request into a tree of specific, retrievable sub-questions. The user can edit the tree. Once approved, it is the contract for the rest of the run.
Stage 3. Gather evidence. Each leaf question is routed by type. Quotation questions go through hybrid semantic search and return verbatim chunks. Numerical questions go through deterministic SQL against the structured tables, and Python performs the math. The model never reads numbers off a PDF.
Stage 4. Reconcile evidence. The gathered evidence is cross-referenced and labeled: corroborated, contradicted, single-sourced, or missing. Gaps become first-class outputs, not absences.
Stage 5. Synthesize and validate. The model drafts the final document from the reconciled evidence only. A deterministic Python validator then checks every quote, every number, and every inline computation against the source. A draft that fails validation is returned to the model with the specific failures. After three failed attempts, the run is marked failed and the document is never shown to the user.
A more technical specification of these stages, including the data structures and prompts, is published as the engineer-grade framework specification →. The piece I want to draw out here is the architectural shape, because the shape is what produces the properties.
The deterministic fact layer
The single most consequential decision in the architecture is that numbers and dates never pass through the language model. A W-2's box 1 figure is extracted at ingestion as structured data. When a user asks "what was the respondent's stated 2024 W-2 income," the system runs a SQL query against that structured store, returns the value, and logs the computation. The model is told the answer; it is not asked to find it. The same is true for date arithmetic, for sums of bank deposits, for the difference between two declared incomes. Math is the responsibility of Python. Citation-matching is the responsibility of a byte-level string check. The neural layer does not get a vote on either.
This is the deterministic fact layer. It is how a system avoids the most embarrassing class of legal-output errors, the kind where a memo's totals are off by four thousand dollars and the entire credibility of the document collapses on the second page.
What the deterministic fact layer covers today, in shipped code, is the financial-audit workflow: when a user commissions an income reconciliation or a deposit-totaling run, the numbers are produced by a Python sum over a deterministic search across the indexed text — find_quotes followed by regex extraction — not by the language model reading the PDF. Full typed-row population of core.extracted_tables — where every figure on every form lives as a structured database row at ingestion, queryable directly by SQL — is the next milestone (Sprint AM scope). The architectural commitment is the same; the surface area grows as more document types get typed extractors. The honest framing is: numbers in financial workflows are not read by the model today; numbers across every other workflow will not be read by the model after Sprint AM.
The symbolic-neural split
The same principle generalizes. The architecture treats the language model as a semantic parser, not a reasoner. It is good at turning "find the contradictions in his income reporting" into a structured plan, and at turning a reconciled evidence map into clean prose. It is bad at executing logic, doing math, and checking its own citations. So those jobs go to symbolic engines: SQL for numerical retrieval, deterministic search for textual retrieval, a constraint-checker for logical consistency, a byte-level validator for citation fidelity.
The dividing line is the trust boundary. Above it: probabilistic, fluent, useful for shaping language. Below it: deterministic, slow, useful for being right. Every claim that ends up in a CasePilot output has crossed from the first region into the second.
Investigation, not chat
The last architectural choice worth naming is that CasePilot is not a question-answering tool. It is an investigation tool. A user does not ask a question and get a paragraph. They commission a finding. The system plans the investigation, gathers the evidence, reconciles what it finds, and produces a structured artifact: a memo, a contradictions report, a financial audit, a timeline. Each of those artifacts is built to be reviewed by an attorney, with every assertion clickable and every number reproducible.
This matters because a chat interface is, structurally, the wrong shape for legal work. A case is a theory of the facts, supported by evidence, with explicit acknowledgment of what is unknown. The artifact CasePilot produces matches the shape of the work.
What this looks like for the user
Every claim in a CasePilot draft is clickable. One tap opens the source page with the relevant quote highlighted. An attorney reviewing a fifteen-page memo can spot-check any sentence and either accept it or know exactly which document to verify against.
Every number has a logged Python computation. The total of bank deposits for a given year is not a figure a model produced; it is a query result, with the source rows attached.
When the system does not know something, it says so. A [GAP: prior-year W-2 not in corpus] marker is more useful than a confident sentence that papers over the absence.
A draft that cannot be supported by the evidence is never produced. The validator can fail the run. This is the property that makes the output trustable: the alternative to a defensible memo is no memo, not a flawed memo dressed up to look defensible.
Every export is marked [DRAFT — ATTORNEY REVIEW REQUIRED]. The system does not replace the lawyer. It produces the kind of artifact a careful paralegal would produce, in a fraction of the time, with citations the lawyer can audit.
Why this becomes table stakes
Right now the legal-AI market is split between research giants chasing Am Law 100 firms at twelve hundred dollars a seat, and a long tail of chat tools selling ChatGPT with a legal coat of paint. Both categories rely, in the end, on the language model being right. Both categories will keep producing hallucinated citations and arithmetic errors at industrial rates.
The market has not yet had its first publicly traceable malpractice incident, where an attorney files a brief built on a chat tool's fabricated quote and a bar association investigates. That incident will arrive. When it does, "the model said so" will not be a defense, and the standard for what a legal-AI tool must produce will move sharply toward what CasePilot already produces: traceable citations, reproducible numbers, auditable refusals to fabricate. Chat-based legal AI is a transitional artifact. The reasoning-framework architecture is what comes after.
A note on why this exists
CasePilot started because I lived inside a family-law case for a year and watched off-the-shelf AI tools confidently produce documents I could not safely use. The issue was the architecture, not the models or the prompts. Building the right architecture turned out to be a real engineering project, with real tradeoffs, and the only honest way to talk about it is to show the tradeoffs.
The promise CasePilot makes is narrower than perfection. I will not invent citations. I will not do hidden arithmetic. I will not ship a claim I cannot trace to a source. I will tell you what I don't know. That is a smaller promise than "trust the AI," and it is the only promise a legal tool has any business making.
— Assaf Mevorach Founder, CasePilot Portland, Oregon · 2026