βΉοΈ Document Types¶
Types of documents¶
Itβs possible to categorize documents into three distinct types: structured, semi-structured, and unstructured. Each type comes with its own extraction challenges and typically benefits from different AI approaches.
Info
Identify the document type first, then choose an appropriate AI service and configuration to maximize accuracy and speed.
Structured documents¶
Structured documents present information in a fixed, predictable layout with clearly defined fields (often with boxes, labels, and consistent positions). Because their format is standardized, they lend themselves to high-precision extraction with minimal ambiguity.
- Characteristics:
- Fixed templates and layouts
- Clearly labeled fields and stable positions
- Minimal visual or textual variability between instances
- Common examples:
- Application forms, standardized claim forms, registration forms
- Typical challenges:
- Low-quality scans or skewed images can reduce detection accuracy
- Handwritten entries may require specialized OCR or handwriting models
- Tips:
- Use high-quality scans or digital originals
- If handwriting is present, enable selection marks/handwriting options where available
Semi-structured documents¶
Semi-structured documents donβt follow a single rigid template, but they do follow common patterns. Core fields are consistent conceptually (e.g., βInvoice Number,β βTotal,β βDateβ), yet their positions, labels, and formats can vary by issuer or layout. Extraction for this class typically benefits from pretrained models that learn the βconceptsβ rather than fixed coordinates.
- Characteristics:
- Repeating concepts with variable positions and labels
- Moderate layout diversity across sources/vendors
- Tables and line items are common
- Common examples:
- Invoices, receipts, purchase orders, statements
- Typical challenges:
- Field labels vary (e.g., βInvoice #β, βInv No.β, βBill No.β)
- Line items may appear in different table shapes or across multiple pages
- Tips:
- Use services trained for invoices/receipts to handle variability
- For tables, enable table detection and verify column mapping
Unstructured documents¶
Unstructured documents contain rich free-form text with little to no consistent layout or fixed fields. Valuable information may be embedded in paragraphs, clauses, or mixed content. These require layout-aware OCR coupled with semantic understanding or targeted rules.
- Characteristics:
- Free-form text; minimal consistent structure
- Information is context-dependent and dispersed
- High variability between documents
- Common examples:
- Contracts, policies, letters, emails, reports
- Typical challenges:
- Key data resides in narrative text rather than labeled fields
- Requires semantic parsing, entity recognition, or custom rules
- Tips:
- Start with robust OCR/layout extraction
- Apply targeted extraction rules, NLP, or verification to pinpoint entities and clauses