Technology
Tesseract OCR
Tesseract OCR is the premier open-source Optical Character Recognition engine (Apache 2.0 license), originally developed by HP and later sponsored by Google, recognizing over 100 languages.
Tesseract is a high-performance OCR engine, released as open-source in 2005 after its initial development by Hewlett-Packard (1985โ1994). Google sponsored its development from 2006 to 2018, significantly advancing its capabilities. The current stable version, Tesseract 5, incorporates an LSTM (Long Short-Term Memory) neural network for superior line recognition, a major upgrade from the legacy character pattern engine. It operates as a command-line program and library, supporting over 100 languages out-of-the-box and offering multiple output formats (e.g., plain text, PDF, TSV). Developers widely adopt it via wrappers like Pytesseract for Python integration, leveraging its robust, freely available text extraction power.
Related technologies
Recent Talks & Demos
Showing 1-4 of 4