AIOCRDocument ProcessingPythonFastAPI

AI Document Processing: OCR, Classification & Workflow Automation

How to build a production AI document processing pipeline that ingests PDFs, runs OCR, classifies documents by type, extracts structured fields, and pushes data to your ERP.

Softotic Engineering/18 February 2025/3 min read

Document-heavy industries — logistics, finance, legal, healthcare — spend significant human effort on data entry from PDFs, scanned forms, and images. AI document processing pipelines can automate 90%+ of this work. Here's how we build them at Softotic.

The Problem: Unstructured Documents at Scale

Manual document processing suffers from:

  • High error rates (4–8% is typical for manual data entry)
  • Slow throughput — humans process ~50–100 documents/hour
  • Inability to scale during peak periods
  • No audit trail of extracted values

Pipeline Architecture

A complete document processing pipeline has 6 stages:

code
[Ingestion] → [Pre-processing] → [OCR] → [Classification] → [Extraction] → [Validation] → [ERP Push]

Stage 1: Document Ingestion

Documents arrive via:

  • Email attachments (via IMAP listener or email webhook)
  • API upload (POST /documents with multipart form)
  • FTP/SFTP directory watch
  • WhatsApp/messaging (webhook)

Normalise to PDF: convert DOCX, TIFF, JPEG to PDF using __INLINE_CODE_0__ or __INLINE_CODE_1__.

Stage 2: Pre-Processing

  • Deskew and denoise scanned images using OpenCV
  • Split multi-page documents into individual page images
  • Resize to optimal resolution for OCR (300 DPI for thermal prints, 200 DPI for standard scans)
python
import cv2
import numpy as np

def preprocess(image: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray)
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return binary

Stage 3: OCR

Use a hybrid approach for best accuracy:

  • AWS Textract for structured forms and tables (handles 2-column layouts, checkboxes)
  • Tesseract as fallback (for offline or cost-sensitive pipelines)
  • Azure Form Recognizer for specific form templates with pre-built models
python
import boto3

textract = boto3.client("textract")

def run_ocr(document_bytes: bytes) -> dict:
    response = textract.analyze_document(
        Document={"Bytes": document_bytes},
        FeatureTypes=["TABLES", "FORMS"]
    )
    return response["Blocks"]

Stage 4: Document Classification

Train a multi-class classifier on document layout and keyword features:

  • Inputs: OCR text, page count, presence of keywords (e.g., "INVOICE", "BILL OF LADING")
  • Model: Fine-tuned DistilBERT or simpler TF-IDF + LogisticRegression for high-volume low-cost classification
  • Classes: Invoice, Delivery Note, Customs Declaration, Contract, ID Document, etc.

Confidence threshold: if < 0.85, route to human review queue.

Stage 5: Field Extraction

Per document class, extract structured fields using:

  • Template matching: regex for IDs, amounts, dates in known positions
  • ML-based extraction: LayoutLM or Donut models for zero-shot extraction from new templates
python
import re

def extract_invoice_fields(text: str) -> dict:
    return {
        "invoice_number": re.search(r"Invoice\s*#?\s*([A-Z0-9-]+)", text, re.I),
        "amount_due": re.search(r"Total\s*Due\s*:?\s*[\$£]?([\d,]+\.?\d*)", text, re.I),
        "due_date": re.search(r"Due\s*Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text, re.I),
    }

Stage 6: Validation & ERP Push

Validate extracted fields against business rules:

  • Amount is a valid number
  • Date is not in the past (for invoices)
  • Supplier ID exists in ERP

If validation passes, push to ERP via REST webhook. If not, route to review UI.

Infrastructure Stack

  • FastAPI — async REST API for document ingestion
  • Redis Queue (RQ) — background job processing
  • PostgreSQL — document metadata and extracted fields storage
  • S3 — original document storage
  • Docker — containerised deployment

Handling the Review Queue

Low-confidence extractions go to a human review UI where operators:

  • See the original document alongside extracted fields
  • Correct any errors
  • Approve and push to ERP
  • These corrections feed back into model fine-tuning

Conclusion

AI document processing delivers ROI within months for high-volume operations. The key investment is in the extraction and validation layers — getting those right is what separates a 60% accurate prototype from a 96%+ production system.

Ready to automate your document workflows? Talk to Softotic's AI team.