Mortgage document OCR and LLM extraction — zero-touch data entry, calibrated to never guess

The work nobody wants but everybody pays for

If you run mortgage operations, you already know where the hours go. Payslips, bank statements, tax notices — every application is a stack of PDFs and phone-camera scans, and someone is sitting there re-typing the numbers into your system one field at a time. Throughput is capped by headcount: hire faster than you can train and quality slips; train carefully and the queue backs up.

The cruel part is which fields get fudged. The numbers that drive underwriting and compliance — income, balances, the figures a regulator might one day ask you to defend — are exactly the ones easiest to mis-key. A transposed digit in a salary field doesn’t announce itself; it flows straight into the credit decision, and unwinding it later costs far more than the minute it took to type wrong. So the brief looked simple and wasn’t: take the data entry off people entirely, but never at the price of a wrong number reaching a credit file. The fields had to be right, and when something went into the system you had to be able to say why.

The constraint that shaped everything

We took the asymmetry as the design axis, because in this domain the two ways to be wrong aren’t equal. A field the system declines to fill costs nothing but a quick automated message back to the borrower; a field it fills wrong costs you a re-opened file, a compliance question, and the trust of whoever signed off on it. One is a round-trip, the other is the thing that gets you audited.

That ruled out the obvious approach — run OCR, drop the output into the form, let a reviewer catch the mistakes. It looks like automation but it just moves the proofreading downstream, and proofreading is the part people are worst at: a reviewer staring at a pre-filled form anchors on the value already there, and the ones they miss are exactly the plausible-looking wrong reads, which are also the most dangerous. We built for the opposite — the system fills a field only when it can back it up, and when it can’t, it doesn’t park the gap for a human, it closes the gap itself. That asymmetry is wired into the metric the system is tuned against, so every threshold and tie-break resolves the same way: when in doubt, ask, don’t write.

Extraction and decision are separate stages

The first structural decision was to keep extraction and decision apart, because fusing them is how confident wrong answers get made. OCR and an LLM produce candidate values; they don’t write anything into the system. Each candidate is tagged with the method that produced it — which engine or pass found it — and a confidence signal, and a field can carry several candidates from several methods, kept side by side rather than collapsed early into one “answer.” That’s what makes the later checks possible: if you reduce three reads to a single value at extraction time you’ve thrown away the disagreement, and the disagreement is the most useful thing you have — two methods saying one thing and a third saying another isn’t noise to average out, it’s a flag that the field needs confirmation. So extraction surfaces labelled evidence, and a separate gate is the only thing allowed to turn evidence into a written field. The split also paid off operationally: we could change the extraction side without touching the rule that decides what’s safe to write, keeping the decision logic small and testable on its own — which matters for a component whose whole job is to be conservative.

The decision gate: two agreeing sources, normalized, high confidence

The gate asks two questions per field: do independent sources agree, and is confidence high enough to act. A field is written automatically only when at least two independent extractions agree and every signal backing it is high. Anything short of that is not written — it’s routed to the borrower loop instead. The system never guesses to fill a blank, and it never promotes a single uncorroborated read into a credit file.

The word doing the work is independent. Two passes of the same model over the same image are not two sources; they fail together and agree for the wrong reasons. Real corroboration comes from genuinely different paths to the same fact — a different extraction method, a different region of the document, a structured field cross-checked against a figure read from prose. Counting non-independent reads as agreement manufactures false confidence that looks exactly like the real thing, which is the subtle failure we designed against.

Comparison only means something after the values are canonicalized. Dates, currency and names are each normalized by type before anything is compared, so “agreement” means the two sources actually say the same thing rather than happening to match — or fail to match — as raw strings. A date read as 01/02/2026 and 2026-02-01 is agreement once both reduce to a canonical date, and disagreement only if the underlying day genuinely differs; currency reduces to a normalized numeric so $4,500.00 and 4500 collapse to one value; names fold for case, spacing and ordering so a payslip header and a bank statement don’t look like a conflict over formatting. Skip this step and you get false agreement on formatting noise — worse than no agreement, because it passes the gate while a real mismatch slips through wearing the costume of a match.

When it can’t be sure, it asks the borrower — not a staffer

Anything that doesn’t clear the gate — fewer than two agreeing high-confidence reads, or a value that’s simply not in the documents at all — triggers the loop that replaces the review queue. Instead of parking the field for someone to chase, the system generates a targeted request to the borrower: confirm this one value, or upload the one document still missing. The request is scoped to the specific gap, not a generic “your application is incomplete,” so the borrower is answering a precise question about their own data rather than re-reading a checklist.

There are two distinct triggers, and they generate different asks. A field that’s present but uncertain — read, but not corroborated to the bar — produces a confirmation request: here’s what we read, is this right. A field with no source at all — the whole document is missing — produces a document request: we still need your most recent payslip. Treating these as one thing produces bad messages; the borrower who already sent the payslip shouldn’t be told to send it again because one figure on it read soft. When the reply comes back the pipeline completes the field and carries on — a confirmed value clears the gate because the borrower is, by definition, authoritative for their own data, and a freshly uploaded document re-enters extraction and runs the same gate as everything else. So uncertainty resolves itself without a staffer in the loop, the only person who ever touches an uncertain field being the borrower confirming their own number. That’s the line that lets us say zero re-keying and mean it: there’s no review queue downstream of the gate, because the gate’s overflow goes to the customer, not to a desk.

Provenance travels with every field

An inferred value is never shown as fact. Every field carries where it came from — which method read it, what corroborated it, whether it was confirmed by the borrower — so any number can be traced back to the document and the path that produced it. We treated “an inferred value presented as a known fact” as a defect class in its own right, because in lending that’s exactly the quiet error that survives until an audit. This isn’t paperwork; it’s what makes the asymmetry enforceable after the fact. When someone asks why a field holds the value it does, the answer is in the record: two methods agreed, here’s the normalized value they agreed on — or the borrower confirmed it on this date. It also means that when a model or prompt changes, we can re-run the gate over historical inputs and see exactly which fields would have decided differently, turning “did this change break anything” from a guess into a query.

Precision is gated in CI

Extraction is measured against a labelled evaluation set — real documents with known-correct outputs — and the metric that gates the build is precision on the auto-filled slice, not raw accuracy across everything. Accuracy rewards a system for getting easy fields right; it says nothing about whether the fields it chose to act on were safe to act on. Precision on the gated slice measures the only thing that matters here: of the values the system wrote without asking, how many were correct. We hold that line deliberately, for the same asymmetry reason that drives the rest of the design — better to ask the borrower one more question than to write one wrong number.

The eval set is wired in as a regression gate, not a one-time benchmark. A change to a model, a prompt, or the parsing layer triggers the full run, and a commit that drops precision on the gated slice below the line fails — it does not ship. Prompt changes are code changes and break things the same way; the only reason teams don’t catch them is that they don’t measure. We built the baseline before turning on automation — a held set of real cases scored field by field, with the auto-fill slice required to be clean before anything went live — and that baseline is also the contract for every future change: you don’t get to lower it quietly. The same eval also adjudicates model selection. Some reads need a multimodal model that sees the page image while plainer structured passes run fine on a cheaper text-only one, so measured precision on the gated slice — not taste — decides which model earns which job, instead of defaulting to the largest model for every page and paying for it.

PII and access stay locked down

Captured documents are full of PII — names, account numbers, income — so they never leave the processing boundary and stay out of version control by policy; the fixtures we test against are scrubbed or gitignored so real borrower data is never committed alongside the code. That shaped the test setup as much as the pipeline: the eval baseline has to prove precision on realistic documents without those documents becoming a second copy of sensitive data sitting in a repo.

Field-level access follows the same logic as OWASP Broken Access Control: a field is shown only to a request that owns it, and that ownership is enforced server-side, not in the UI. Hiding a value in the front end is not access control; a field tied to one borrower’s application has to be unreadable by a request scoped to another, checked on the server on every read. The automated pipeline gets no exemption from this — an autonomous component that writes and reads fields is still a privileged client that passes the same checks as any user, because “the system is confident” is not the same question as “the system is allowed.”

The result

The repetitive entry is gone — not moved to a reviewer, gone. The fields the system auto-fills are the ones it can defend: two agreeing independent sources, normalized so the agreement is real, high confidence, provenance attached. Everything else gets confirmed by the borrower automatically, so a missing payslip or a soft read resolves itself without a staffer touching it. Precision on the auto-filled slice is held by a gate in CI, so the bar can’t erode quietly across model and prompt changes. Nothing lands in a file that the pipeline couldn’t back up, which is the part that lets operations sleep at night — and the part that holds up when an auditor asks where a number came from.