Tesseract OCR¶
Overview¶
The Tesseract OCR Service in AIForged uses the openโsource Tesseract engine to extract plain text from images and PDFs. It returns raw text (no layout preservation) and stores the output in the documentโs Result property for downstream rules, regex, analytics, or integration. Optionally, you can save lines and words for verification.
Info
Use this service when you need fast, costโeffective plain text extraction from scans or images. For structured fields and tables, consider a Document Intelligence service.
Permissions Required¶
Members must belong to one of the following AIForged user group roles to add and configure this service:
- Owner
- Administrator
- Developer
Info
Tip: Role membership is managed in Organisations > Roles. Assign members to roles to grant agent and service administration access.
Supported Content Types¶
- JPEG
- PNG
- TIFF
Info
If your content is in a different format, use the AIForged PDF Converter to generate a compatible file.
Possible Use Cases¶
- Extract text from scanned documents (receipts, invoices, contracts, letters).
- Run RegEx patterns on extracted text to detect and structure key information.
- Index documents for search and retrieval.
- Prepare datasets by standardizing OCR output for analytics and downstream processing.
Service Setup¶
Follow these steps to add and configure the Tesseract OCR Service to your agent:
- Open the Agent View Navigate to the agent where you want to add the service.
- Add the Tesseract OCR Service
Click the Add Service
button.
- Select Service Type
Choose Tesseract OCR Service from the available service types.
- Configure the Service Wizard
- Open the Service Configuration Wizard.
1.

2.

- Step 1: General Settings
Configure core OCR settings such as service name and description.
&#xNAN;Default settings are sufficient for most use cases.
- Step 2: Verification Verification is typically not required for plain text OCR unless paired with downstream validation.
- Step 1: General Settings
Configure core OCR settings such as service name and description.
&#xNAN;Default settings are sufficient for most use cases.
Service Configuration Settings¶
Most users can proceed with default settings. Advanced configuration is available for custom workflows.
Setting | Type | Required? | Description |
---|---|---|---|
ArchivingStrategy | Optional | No | Number of days before documents are automatically deleted. |
BatchSize | Hidden | - | Processing batch size. |
CheckElectronic | Optional | No | If a searchable text layer is detected, use it instead of OCR. |
DocumentProcessedStatus | Optional | No | Status applied after successful processing. |
Enabled | Hidden | - | Enable or disable the service. |
ExecuteBeforeProcess | Optional | No | If configured as a child service, execute before the parent service. |
ExecuteAfterProcess | Optional | No | If configured as a child service, execute after the parent service. |
IsVerification | Optional | No | Save OCR lines and words for human verification. |
Language | Optional | No | Provide a language hint (e.g., eng, deu) to improve accuracy; multiple languages can be combined. |
MinSize | Optional | No | Minimum document size threshold to trigger OCR. |
Password | Optional | No | Authentication/password handling; can be set per document using Custom Code. |
RemoveComments | Optional | No | Remove human comments/annotations before processing. |
Info
If AccessKey or BaseURL fields appear in your environment, leave them empty. Tesseract in AIForged does not require provider keys or endpoints.
Add and Process Documents¶
To upload and process documents using the Tesseract OCR Service:
- Open Service When you open the Tesseract OCR Service, you will be presented with the documents currently queued or processed in the Inbox.
- Upload Documents
Click the Upload
button or drag and drop files over the document grid.
- Select Category (Optional) If you know the category for the document, select it. Otherwise, leave as No selection.
- Process Documents After uploading, select the documents to process and click Process Checked.
Info
Tip: For new services, process a small batch first to verify OCR results before scaling up.
View Processed Documents¶
- Use the usage filter to select Outbox in the Tesseract OCR Service.
- Open any processed document to view the extracted text in the Result property.
Troubleshooting Tips¶
- Text missing or incorrect
- Improve scan quality and resolution (aim for 300 DPI), deskew pages, and reduce noise.
- Set the Language setting to match the documentโs primary language (e.g., eng, fra).
- Use the AIForged PDF Converter to normalize unusual inputs.
- Poor results on complex layouts
- Tesseract produces plain text; complex multi-column layouts may not read in order.
- If layout structure is critical, consider a Document Intelligence service.
- Slow performance on large files
- Process in smaller batches and split very large PDFs/TIFFs into page ranges using the AIForged Document Splitter.
- Upload or processing errors
- Re-upload a clean copy to rule out file corruption.
- If the PDF is password protected, set the password via Custom Code preโprocessing so AIForged can unlock the file.
- Review service settings and try again.
Best Practices¶
- Scan at 300 DPI or higher with good contrast; avoid shadows and skew.
- Always set the correct Language (and additional languages if applicable) to improve recognition.
- Pre-process images when needed: deskew, denoise, and increase contrast before uploading.
- Use categories to keep processing organized and facilitate downstream rules.
- Start with a representative pilot batch to validate accuracy before large-scale runs.
Known Limitations¶
- Handwriting: Tesseract is optimized for printed text. Handwriting accuracy is generally low; test representative samples before relying on results.
- Layout: Output is plain text; reading order may not reflect complex page layouts (columns, sidebars, rotated blocks).
- Image quality: Very small fonts, heavy noise, blur, or significant skew reduce accuracy. Curved or warped text is challenging.
- Languages: Accuracy depends on the selected language packs; choose the correct language(s) for best results.
- Passwordโprotected PDFs: Cannot be processed directly.
Info
Tip: Use the AIForged Custom Code utility as a preโprocessor to set the password per document, so that AIForged can unlock documents before processing.
Info
Tip: For very large PDFs or TIFFs, consider chunking by page range to avoid size limits and to keep processing responsive. Use the AIForged Document Splitter for this task.