Technology

Tesseract OCR

Tesseract OCR is the premier open-source Optical Character Recognition engine (Apache 2.0 license), originally developed by HP and later sponsored by Google, recognizing over 100 languages.

Tesseract is a high-performance OCR engine, released as open-source in 2005 after its initial development by Hewlett-Packard (1985–1994). Google sponsored its development from 2006 to 2018, significantly advancing its capabilities. The current stable version, Tesseract 5, incorporates an LSTM (Long Short-Term Memory) neural network for superior line recognition, a major upgrade from the legacy character pattern engine. It operates as a command-line program and library, supporting over 100 languages out-of-the-box and offering multiple output formats (e.g., plain text, PDF, TSV). Developers widely adopt it via wrappers like Pytesseract for Python integration, leveraging its robust, freely available text extraction power.

https://tesseract-ocr.github.io/

3 projects · 3 cities

Related technologies

GPT-4 528 Python 618 ADB 1 Android Debug Bridge 1 BERT 179 BLOOM 115 Cursor 59 FFmpeg 14 GPT-3 191 Llama-2 227 OpenAI API 509 OpenCV 22 PaddleOCR 2 PaLM 2 116 PP-DocLayout-L 1 RoBERTa 118 Tampermonkey 1

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Cursor Personal Knowledge Management Tool

Nairobi Sep 25

Cursor Tampermonkey

PaddlePaddle: Structuring Legal Docs