OCR Key-Value Extraction from PDFs: What Works in 2025
Many business workflows depend on extracting key-value pairs (KVP)—like Name → John, Invoice No → 1234—from PDFs. Below is a concise, practical guide covering OCR engines, layout analysis, modern KIE models, and cloud services, plus tips to maximize accuracy. Need a production-ready pipeline? Explore Codexal AI Services and Integrations & DevOps.
Approaches to Key-Value Extraction
- Template / rule-based: fixed forms where keys appear in known locations; fast and reliable for stable layouts.
- OCR + layout analysis: run OCR, detect blocks/lines, then pair nearest label → value using coordinates and heuristics (e.g., right-of / below). Tesseract PSM and page-layout analysis matter here.
- ML KIE models: models like LayoutLM (OCR-based) or Donut (OCR-free) learn to output fields directly, robust to layout variance.
- Cloud services: Azure Document Intelligence & Google Document AI provide ready KVP extraction and table/checkbox parsing via API.
Popular Tools & Models
- Tesseract OCR: open-source engine with Page Segmentation Modes (--psm) that strongly affect accuracy and layout grouping.
Tip: try --psm 6 for uniform blocks or --psm 4/11/12/13 depending on layout. - PaddleOCR: production-ready OCR with KIE components/datasets like FUNSD; active docs and tooling.
- LayoutLM family: widely used for KIE (needs OCR text + positions).
- Donut (OCR-free Document Understanding Transformer): parses fields end-to-end without external OCR.
- pdfplumber (born-digital PDFs): layout-aware text & table extraction; watch out for ligatures/encodings.
Cloud APIs for KVP
Azure Document Intelligence (General document) extracts key-value pairs, tables and selection marks in one call. Google Document AI Form Parser also returns form fields as KVP and tables, with JSON coordinates for post-processing.
Accuracy Tips (PDFs & Scans)
- Preprocess scans: deskew, denoise, increase contrast; ensure 300+ DPI.
- Choose PSM wisely: PSM impacts word grouping/lines; experiment and validate.
- Handle ligatures/encodings: PDFs may encode fi/fl oddly; normalize text or use char maps before matching.
- Use coordinates: pair label→value by proximity (same line/right/below) with thresholds.
- Hybrid pipeline: fall back to templates when layout confidence drops; log fields with low OCR confidence for review.
Build with Codexal
Codexal delivers end-to-end IDP (Intelligent Document Processing): OCR, KIE models, cloud integrations, and dashboards. See AI, Web & Mobile Apps, and Analytics. Contact us via Contact.