Technology
Grobid
Grobid (GeneRation Of BIbliographic Data) is a machine learning library that extracts, parses, and re-structures raw scientific PDF documents into structured XML/TEI.
Grobid is a robust, open-source machine learning library (Apache 2 License) specialized in converting unstructured PDF-based scientific and technical publications into structured TEI/XML. The system uses a cascade of sequence labeling models, including Conditional Random Fields (CRF) and Deep Learning (like BERT-CRF), to identify and parse over 55 fine-grained structures: title, authors, affiliations, abstract, full-text body, and reference citations. It delivers high accuracy, such as an F1-score of around 0.87 for reference extraction on independent datasets. Major institutions like ResearchGate, Academia.edu, and the European Patent Office leverage Grobid for large-scale document ingestion, demonstrating its production readiness and scalability (e.g., processing 4,000 PDFs in approximately 26 minutes).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1