.

Technology

TinyStories

TinyStories is the synthetic dataset, generated by GPT-4, that successfully trains Small Language Models (SLMs) under 10 million parameters to achieve fluent, coherent English text and demonstrate reasoning.

TinyStories is the synthetic dataset, sourced from GPT-3.5 and GPT-4, that redefined language model scaling. It comprises millions of short stories, strictly limited to a 3-4 year old's vocabulary: this constraint is the key. Researchers successfully trained Small Language Models (SLMs), some with fewer than 10 million parameters or only one transformer block, on this data. These SLMs consistently generate fluent, grammatically perfect, multi-paragraph stories, even demonstrating reasoning capabilities. This project fundamentally proves that language coherence emerges at a much smaller scale than previously assumed, enabling efficient, low-resource AI development.

https://huggingface.co/datasets/roneneldan/TinyStories
1 project · 1 city

Related technologies

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Sign in to see who built these projects