OCR Key-Value Extraction from PDFs: What Works in 2025

By Codexal • · Updated

Many business workflows depend on extracting key-value pairs (KVP)—like Name → John, Invoice No → 1234—from PDFs. Below is a concise, practical guide covering OCR engines, layout analysis, modern KIE models, and cloud services, plus tips to maximize accuracy. Need a production-ready pipeline? Explore Codexal AI Services and Integrations & DevOps.

Approaches to Key-Value Extraction

  1. Template / rule-based: fixed forms where keys appear in known locations; fast and reliable for stable layouts.
  2. OCR + layout analysis: run OCR, detect blocks/lines, then pair nearest label → value using coordinates and heuristics (e.g., right-of / below). Tesseract PSM and page-layout analysis matter here.
  3. ML KIE models: models like LayoutLM (OCR-based) or Donut (OCR-free) learn to output fields directly, robust to layout variance.
  4. Cloud services: Azure Document Intelligence & Google Document AI provide ready KVP extraction and table/checkbox parsing via API.

Popular Tools & Models

  • Tesseract OCR: open-source engine with Page Segmentation Modes (--psm) that strongly affect accuracy and layout grouping.
    Tip: try --psm 6 for uniform blocks or --psm 4/11/12/13 depending on layout.
  • PaddleOCR: production-ready OCR with KIE components/datasets like FUNSD; active docs and tooling.
  • LayoutLM family: widely used for KIE (needs OCR text + positions).
  • Donut (OCR-free Document Understanding Transformer): parses fields end-to-end without external OCR.
  • pdfplumber (born-digital PDFs): layout-aware text & table extraction; watch out for ligatures/encodings.

Cloud APIs for KVP

Azure Document Intelligence (General document) extracts key-value pairs, tables and selection marks in one call. Google Document AI Form Parser also returns form fields as KVP and tables, with JSON coordinates for post-processing.

Accuracy Tips (PDFs & Scans)

  • Preprocess scans: deskew, denoise, increase contrast; ensure 300+ DPI.
  • Choose PSM wisely: PSM impacts word grouping/lines; experiment and validate.
  • Handle ligatures/encodings: PDFs may encode fi/fl oddly; normalize text or use char maps before matching.
  • Use coordinates: pair label→value by proximity (same line/right/below) with thresholds.
  • Hybrid pipeline: fall back to templates when layout confidence drops; log fields with low OCR confidence for review.

Build with Codexal

Codexal delivers end-to-end IDP (Intelligent Document Processing): OCR, KIE models, cloud integrations, and dashboards. See AI, Web & Mobile Apps, and Analytics. Contact us via Contact.

Need a pilot? We can compare Tesseract/PaddleOCR vs. Azure/Google on your forms, then fine-tune a LayoutLM/Donut model if needed.