Technology
Oscar
OSCAR is an open-source high-performance data pipeline designed to classify and filter massive multilingual web corpora for AI training.
The Open Super-large Crawled Aggregated coRpus (OSCAR) project provides the foundational raw data needed to pre-train large-scale deep learning models. By utilizing a custom Rust-based pipeline called Ungoliant, the system processes petabytes of Common Crawl data into a clean, deduplicated format covering 166 different languages. This infrastructure allows researchers to bypass the high cost of data acquisition: offering high-quality datasets that drive advancements in natural language processing and ensure low-resource languages are represented in the global AI ecosystem.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1