Technology

SlimPajama

SlimPajama is the 627 billion token, globally deduplicated dataset derived from RedPajama: it delivers high-density, compute-efficient data for robust LLM pre-training.

This is the SlimPajama-627B dataset, an essential resource for efficient LLM pre-training. Developed by Cerebras, it was created by aggressively cleaning and globally deduplicating the original 1.21 trillion token RedPajama corpus. This rigorous process removed 49.6% of the bytes and low-value documents, resulting in a 627 billion token dataset with significantly higher information density. Training models on SlimPajama, which includes diverse sources like code, books, and Wikipedia, ensures superior performance and generalization while maximizing compute budget efficiency: a direct upgrade for your training pipeline.

https://huggingface.co/datasets/cerebras/SlimPajama-627B

1 project · 1 city

Related technologies

DeepSpeed 2 Mamba 1 State Space Model 1 Transformers 146

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Mamba Long Context Training

Austin Apr 11

Mamba DeepSpeed