Technology
Tesseract
Tesseract is the industry-standard open-source OCR engine supporting text extraction for over 100 languages.
Originally developed by Hewlett-Packard (1985 to 1994) and maintained by Google since 2006, Tesseract is a highly versatile Optical Character Recognition (OCR) engine. The current 5.x releases utilize a Long Short-Term Memory (LSTM) neural network to achieve superior accuracy across diverse document layouts. It processes standard image formats (PNG, JPEG, TIFF) and outputs results in multiple formats: plain text, hOCR (HTML), and searchable PDFs. Developers integrate its capabilities via the libtesseract C++ API or popular wrappers like pytesseract for Python. It remains the primary choice for high-volume digitization projects and automated data entry pipelines.
Related technologies
Recent Talks & Demos
Showing 1-8 of 8