RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables. The benchmark includes a variety of challenging scenarios including scanned tables, handwriting, language detection, merged cells, and more.
The full dataset including labels is available here.
We also benchmarked the extraction performance of various models with RD-TableBench here. All result data points are available in the RD-TableBench Demo.
Reducto employed a team of PhD-level human labelers who manually annotated 1000 complex table images from a diverse set of publicly available documents. While other approaches benchmarked may have been trained on some of this data, it was unseen to Reducto's models during both training and validation.
The dataset was selected to include examples with different structures, text density, and language. The following graphs show a breakdown of table size by number of cells and language for each table.
For the initial release, we evaluated the following tools/methods: Reducto, Azure Document Intelligence, AWS Textract Tables, GPT4o, Google Cloud Document AI, Unstructured, Chunkr, and LlamaParse.
All of the tools, except AWS Textract Tables, are general document parsing solutions which first detect the presence of tables and then provide parsed outputs in a specific format. We used the highest quality settings for each tool where applicable (e.g., Unstructured IO and Chunkr were both run in the more expensive High Res modes, Llamaparse was invoked with Pro mode). Invocation and grading code is available here.
All tools were invoked by passing the table centered in a PDF with whitespace padding. For tools without direct PDF processing ability (GPT4o), we converted the PDFs to images using the poppler
library.
To benchmark effectively, we needed a metric to determine the distance between two tables. A table can be represented as a 2D string array, with each cell corresponding to a table cell. Merged cells are represented by repeating their values across every cell they occupy. For example:
Hello | this | is |
a | test |
becomes
Hello | this | is |
a | a | test |
A simple check for an exact match would penalize minor deviations heavily, so we adopted a more flexible approach using the Needleman-Wunsch algorithm.
We treat table comparison as a hierarchical alignment problem, similar to DNA sequence alignment, employing a two-level strategy:
This hierarchical approach captures both structural similarity (how rows align) and content similarity (within corresponding rows).
We use Levenshtein distance to capture partial matches between individual cells, rather than relying on binary exact matches. This metric counts the minimum number of edits required to transform one string into another.
The distance is normalized to a score between 0 and 1:
At the row level, Needleman-Wunsch is adapted to use the cell-level similarity scores. Missing rows at the beginning or end of tables are not penalized (free end gaps), accommodating natural subtable cropping.
This approach effectively handles:
The final similarity score, normalized between 0 and 1, indicates:
You can explore the full implementation and scoring system here.
PubTabNet and FinTabNet offer large datasets for table evaluation. However, these datasets are collected from a homogeneous corpus and use programmatically generated labels from file metadata.
Our goal with RD-TableBench is to provide a more diverse set of real-world examples, ensuring accuracy with manual annotations.
RD-TableBench is intended solely for evaluation and testing purposes, but we recognize that releasing the benchmark risks the use of this data in future model training. To maintain scoring integrity, only a subset of our evaluation framework is being released.
Grading results for various parsing options are available here:
Find out why leading startups and Fortune 10 enterprises trust Reducto to accurately ingest unstructured data.