Announcing our $24.5M Series A led by Benchmark

Extraction Trouble? Here Are 5 Pitfalls to Avoid when Configuring Your JSON Schema

Even with a great document parsing model, a malformed or ambiguous schema will derail your extraction outputs—leading to missing fields, weird formatting, and you scratching your head. We've helped tons of teams fine-tune their document extraction setups, and these 5 schema pitfalls come up again and again.

Let’s walk through them with examples for each (so you don’t have to learn the hard way).
If you’re looking for general extraction API information, check out our documentation.


1. You left field key descriptions blank

Schema fields without descriptions is one of the most common mistakes we see. Descriptions help the LLM models disambiguate similar fields and also improve extraction accuracy – you should see your description as a “prompt” of sorts to guide the model.

Fix it: Always provide a short, natural-language description of what each field means and how it should be extracted.

❌ Don't do this✅ Do this
"properties": {
      "account_num": { "type": "string" }
}
"properties": {
      "account_num": {
          "type": "string",
          "description": "A unique identifier assigned to a specific 
              financial account. Typically appears in the top section of a 
              statement, next to or beneath the account holder’s name."
      }
}

2. Your field key names are too disconnected

If your key is id_32 but the content says “Invoice Date,” you're setting your model up to fail. Key names should be descriptive and semantically tied to the document—use invoice_date, not id_32. It also helps if the keys are in a logical order, like cascading or left to right.

Fix it: Use keys that try to match the document’s content as much as possible. If you’re extracting from a table, using column/row headers is very helpful.

❌ Don't do this✅ Do this
"properties": {
      "id_32": {
          "type": "string",
          "description": "Date the invoice was fulfilled"
      }
  }
"properties": {
      "invoice_date": {
          "type": "string",
          "description": "Date the invoice was fulfilled"
      }
  }

3. There are only a couple of potential outputs, but you didn’t add an enum type

If a field can only be one of a few values (e.g. "Yes", "No", "N/A"), but your schema allows free text, you're opening the door to inconsistencies like “Y”, “nope”, or “—”. A helpful tip is that you can also include enum descriptions in your field description if the names themselves aren’t self-explanatory.

Fix it: Use the optional enum field to define acceptable outputs. This gets rid of any room for error.

❌ Don't do this✅ Do this
"properties": {
      "currency_type": {
          "type": "string",
          "description": "International currency codes"
      }
  }
"properties": {
      "currency_type": {
      "type": "string",
          "enum": [
          "USD",
          "EUR",
          "JPY",
          "CAD",
          "AUD",
          "Other"
          ],
          "description": "International currency codes"
      }
  }

4. You’re embedding math into the schema prompts

LLMs are notorious for being bad at math. Prompting for values that don't exist on the document can lead to mixed results, so keep keys and values to what is on the page. For example, compute a “Yearly cost" by extracting a "Monthly cost" value and then multiplying by 12 downstream, not by prompting "Monthly cost * 12".

Fix it: Extract raw values in one pass, then handle calculations downstream. Keep extraction calls focused on what’s directly visible in the document.

❌ Don't do this✅ Do this
"properties": {
      "monthly_cost": { 
          "type": "string",
          "description": "The total monthly cost for 
          this service." 
      },
      "annual_cost": { // This doesn't exist in the document
          "type": "string",
          "description": "Calculate the total yearly cost for 
          this service with 'Total Monthly Price' * 12." 
      },
}
"properties": {
      "monthly_cost": { 
          "type": "string",
          "description": "The total monthly cost for 
          this service." 
      },
}
.
.
.
total_annual_price = 
  extract_result.json()["result"][0]["monthly_cost"] * 12

5. Your system prompt is lacking

A strong schema needs an equally strong system prompt. A good system prompt gives guidance to the model about the document as a whole. What kind of document is it? Is there specific jargon? Are certain tables located near each other?

Fix it: Explain what kind of document you're parsing, if there are unique aspects to it, and any nuances to be aware of.

❌ Don't do this✅ Do this
"system_prompt": null // No prompt! 
"system_prompt": "Be precise and thorough. This is a semi-structured 
      invoice typically spanning 1–3 pages, with sections like a header, vendor info, 
      a line-item table, and a summary total. Use visual layout cues such as bold 
      labels, column alignment, and section dividers to interpret structure. Extract 
      values like invoice number, dates, and itemized charges as they appear, 
      normalizing formats where needed and returning null for missing or 
      ambiguous fields…"

Next Steps

Your schema is more than just a list of fields — a well-designed schema dramatically boosts extraction accuracy, reduces unexpected outputs, and makes your pipeline easier to debug. Once you’ve done robust data extraction, you’ll see the effects trickle down to the rest of your implementation.

Our Reducto Playground makes it easy to test out different schemas, visualize the outputs, and also use AI prompting to better your schema. Once you’ve extracted relevant data, you are in the perfect position to begin playing with extraction pipelining, which can help utilize all our APIs (parse, split, extract) to build fully functional ingestion workflows!

Your new ingestion team

Find out why leading startups and Fortune 10 enterprises trust Reducto to accurately ingest unstructured data.