.

Technology

Tesseract

Tesseract is the industry-standard open-source OCR engine supporting text extraction for over 100 languages.

Originally developed by Hewlett-Packard (1985 to 1994) and maintained by Google since 2006, Tesseract is a highly versatile Optical Character Recognition (OCR) engine. The current 5.x releases utilize a Long Short-Term Memory (LSTM) neural network to achieve superior accuracy across diverse document layouts. It processes standard image formats (PNG, JPEG, TIFF) and outputs results in multiple formats: plain text, hOCR (HTML), and searchable PDFs. Developers integrate its capabilities via the libtesseract C++ API or popular wrappers like pytesseract for Python. It remains the primary choice for high-volume digitization projects and automated data entry pipelines.

https://github.com/tesseract-ocr/tesseract
8 projects · 9 cities

Related technologies

Recent Talks & Demos

Showing 1-8 of 8

Members-Only

Sign in to see who built these projects