The data layer for

document intelligence

Enterprise AI runs on documents. We build the datasets that make it possible.

Mobile Illustration

Production-grade document AI datasets don't exist. Current options are synthetic, dated, and blind to the long tail.

Building Illustration

Shipping real world document AI taught us the bottleneck is high-quality data. That experience shapes our datasets.

Transform Illustration

/ DATASETS

Featured Datasets

98% accuracy with massive coverage across 100+ languages, 20+ domains, and every document type in the wild. Document Understanding for layout and parsing, Document Action for complex workflows.

document understanding

Parsing

Setting the gold standard for real-world document understanding. End-to-end parsing covering layout, reading order, 50+ language OCR, table to HTML, forms, formulas, and charts.

20K+Documents3M+Elements50+Languages20+Domains

/ PRODUCT

Complete, living data products

Our datasets are stress-tested with in-house models and continuously improved. Complete data products built with the same rigor as the models they train.

Core Data

Expert-created and rigorously sourced. Domain expertise throughout, from annotation to QA/QC, continuously refined.

Expansion

Synthetic expansion rigorously developed on top of core data. With rich metadata for building your own splits and crafting solid training recipes.

Insights

Interactive reports to explore how we built it. Sourcing, distributions, annotation logic. ML learnings from in-house training to inform your experiments.

Iterative

Accuracy, schemas, and coverage all improve continuously after delivery. An ongoing partnership around your evolving needs.

Section Gradient Background

/ CONTACT

Get samples or build a dataset with us

Our library goes beyond what's listed here. Whether you need an off-the-shelf dataset or a custom build, we're ready to help.

How we work with you

01

Tell us what you need

Short call to understand your exact needs. We identify the best dataset for you and share samples.

02

Simple licensing

Straightforward data license for your specific use-cases. We skip the procurement headache.

03

Start training

Get access to production-grade data in days, not months. Your team starts building immediately.

Bespoke Gradient Background

Bespoke Partnerships

For novel tasks where our off-the-shelf data doesn't fit, we partner with labs to create it. We build the exact recipe you need.

  • Custom recipes designed for your specific capability
  • Deep collaboration with your researchers
  • Scale from pilot to production volume seamlessly

FLOATING-POINT

The data layer for document intelligence

Featured Datasets

Company

Core Layout Gradient Background