Technical

From upload to answer

Donna

The Donna team

20 May 2026 · 7 min read

The moment a contract lands in a Space, a race starts: how fast can that PDF become something you can ask questions of, safely? Here is the pipeline that runs in that minute, and the two design decisions that make it boring, which for infrastructure is the highest compliment available.

Five stages, every document

Scanmalware gateReadtext + layoutChunkpassages that rememberEmbedmeaning as vectorsIndexsearchable matterdurable checkpoint after every stagemalicious: blocked,file destroyedtransient hiccup: resumes fromthe last checkpoint, silently
The ingestion pipeline
Scan, read, chunk, embed, index. A durable checkpoint lands after every stage, so failures resume instead of restarting, and a malicious file never gets past the gate.

Scan. Nothing touches the matter until it clears a malware gate. A file that fails is not quarantined for an admin to poke at later: it is blocked, destroyed, its hash remembered so the same payload cannot be retried, and the event is audited. The uploader finds out immediately.

Read. Legal documents arrive as beautiful typography and terrible data: scans, faxes of scans, tables that were once tabs. This stage recovers text and layout together, because layout is meaning in legal drafting. Crucially, we keep the geometry: where every token sits on every page. That geometry is what later lets a citation highlight the exact region it relies on.

Chunk. Documents are split into passages sized for retrieval, and each passage remembers its provenance: the document, the version, the page, the position. A passage is never just text; it is text with a return address.

Embed and index. Passages get vector representations for semantic search and land in the matter’s index alongside keyword search. From this point the document is askable: retrieval serves both the answer and the citation.

Decision one: checkpoint everything

Pipelines fail. A model endpoint has a bad minute, a huge brief times out. The question is what failure costs. Every stage in Donna’s pipeline commits a durable checkpoint, so a failure resumes from the last completed stage rather than restarting a ninety-page document from zero. Transient failures are not even surfaced to the user: the document simply arrives a little later, because the system retried and resumed on its own. Only failures that need a human decision, a corrupt file, a password-protected PDF, become visible ones.

Resume, not restart
A transient failure costs one stage, not one document. Users mostly never learn it happened.

Decision two: classify failure honestly

A single “processing failed” state is a design shrug. Donna classifies failure by what should happen next: malicious files are terminal and audited, invalid files say what is wrong with them, quota pauses continue after a top-up, and transient faults retry themselves. Each class gets a different behaviour and different words, because “try again” is bad advice for three of the four.

Built for the spike

Matters do not upload documents at a polite steady rate. A disclosure arrives as a hundred files in one drag. The pipeline scales out on queue depth: burst arrives, workers multiply, backlog drains, workers retire. Duplicate uploads are detected by content, so the same contract sent by three parties is processed once and known three times.

The result is a pipeline nobody thinks about, which is the point. Documents go in; a minute later the matter knows more. Everything Donna does downstream, answers, agents, timelines, stands on this unglamorous minute.

Reading about it is one thing. Working in it is another.

60 seconds. No credit card.