Extract

Structured Data Extraction from Any Document

Extract pulls structured fields from any document using a schema you define. One call in, typed JSON out.

Try the API free Request a demo

Helping everyone from startups to Fortune 10 enterprises unlock their data.

Extract

Define a schema, get structured JSON back

Definition: Extract returns specific fields from any document as schema-typed JSON. Define a schema, get back values matching it. Under the hood, Extract runs Parse to read the document, then uses an LLM to locate and pull the values you asked for, with optional citations on every one.

Who it's for: Teams that know what fields they need from each document and want typed, predictable output without writing per-template parsers.

The problem it solves: Off-the-shelf LLMs hallucinate fields and drift across runs. Extract grounds every value to the page it came from and constrains output to your schema, so results are consistent and auditable.

Extract in the platform

How Extract connects to the rest of the platform

Endpoint

Use when

Output

How it works with Extract

/parseParse

Structured content from any document is needed for LLM or RAG use.

Structured chunks with typed blocks, bounding boxes, and confidence scores.

Returns the full document when no fixed schema is defined yet.

/extractExtract

The fields to pull are defined and typed JSON is needed.

Schema-typed JSON with optional citations on every value.

Read the Extract docs

/splitSplit

One file contains multiple logical documents or sections.

Page ranges for each section, with confidence scores.

Separates sections so each maps to one schema response.

/classifyClassify

Files need to be routed by type before processing.

Best-matching category with per-criterion confidence.

Picks the right extraction schema per file.

/editEdit

A PDF form needs filling or a DOCX needs updating.

A downloadable edited file, plus a reusable form schema.

Writes extracted values back into a document.

Try out Extract in Studio or via the API.

Open Studio Request a demo

Where AI teams ship Extract

Extract the data you need

If your workflow ends with writing fields to a database, Extract is the step that gets them there accurately.

Invoice & AP automation

Pull header fields, taxes, and every line item into typed JSON. Citations let AP teams verify amounts quickly.

Contract & clause data

Effective date, expiration, parties, governing law, renewal terms. Define the fields once and Extract handles layout variations.

Financial statements & filings

Pull totals, holdings, and transactions from 10-Ks, brokerage statements, and fund factsheets. Deep Extract handles complex tables.

KYC, claims, and onboarding

Identity, employer, address, claim numbers, dates of loss. Citations on every value make audit straightforward.

Long arrays & transaction lists

Bank statements, ledgers, claim line items: Deep Extract verifies every field with an agentic loop so nothing is missed across long documents.

Extract across multiple files

Combine fields from several documents into a single schema response for data rooms, claim packets, and onboarding.

Try out Extract in Studio or via the API.

Open Studio Request a demo

Why Extract

Why teams switch to Extract

01
Schema-typed, every time
Output shape matches your schema. Enums normalize values, so downstream code never has to translate “Invoice” vs “INVOICE.”
02
Citations on every value
citations wraps each field with page, bbox, source text, and confidence for both extract and parse stages.
03
Complete extraction on long docs
Deep Extract uses an agentic loop to verify outputs across long documents, so hundreds of line items are captured accurately.
04
Deep Extract for complex documents
An agent harness that extracts, verifies against the source, and re-extracts until results meet your accuracy criteria. Built for long documents with thousands of rows across hundreds of pages.
05
Reuse parsed work via jobid://
Try a different schema on the same doc, or merge fields across many docs, without re-parsing. Pass a job ID or a list as input.
06
Schema or schemaless
Ship a schema for predictable production output. Pass a natural-language prompt for prototyping.

How Extract works

How Extract works in four steps

STEP 01
Send a file + schema
Upload a file or point at a URL. Define the fields you want in a schema.
POST /extract
STEP 02
Parse runs underneath
OCR, layout detection, and table reconstruction produce structured content for the extractor to read.
jobid:// available
STEP 03
LLM locates each field
Field names and descriptions guide the model. Array extract handles long lists and Deep Extract iterates for accuracy.
schema → values
STEP 04
You get typed JSON
Output matches your schema with optional citations on every value.
{ value, citations }

Read the full Extract reference in the docs

Built for production

Enterprise-ready from day one

SOC 2 Type II
HIPAA
Zero Data Retention
VPC · On-prem · Air-gapped
EU · AU regional endpoints
99.9%+ uptime SLA
Enterprise support

Visit the Trust Center

Try out Extract in Studio or via the API.

Open Studio Request a demo

The rest of the platform

What feeds Extract and what runs after it

/parse

Common questions about Extract

Document work starts here

Define the fields. Get cited values back

Drop a PDF in Studio or hit the API with one call. No setup, no credit card.

Try the API free Request a demo

API

Industries

Resources

Schema-typed, every time

Citations on every value

Complete extraction on long docs

Deep Extract for complex documents

Reuse parsed work via jobid://

Schema or schemaless

Send a file + schema

Parse runs underneath

LLM locates each field

You get typed JSON

Enterprise-ready from day one

Parse

Split

Classify

Edit

Studio

Reuse parsed work via `jobid://`