All Resources
Datasets
14 resources
Datasets
Common Crawl
Common Crawl
Petabyte-scale open web crawl dataset updated monthly. The foundation for training most large language models including GPT and LLaMA.
Web CrawlNLPPretrainingLarge Scale
Datasets
FineWeb
Hugging Face
Hugging Face's 15-trillion token high-quality web dataset derived from CommonCrawl with aggressive deduplication and filtering. Outperforms other web datasets on benchmarks.
Web CrawlPretrainingHigh QualityWeb
Datasets
LAION-5B
LAION
Large-scale open dataset of 5.85 billion image-text pairs scraped from the internet, used to train Stable Diffusion and other vision-language models.
Image-TextMultimodalPretrainingImage