Datasets

14 resources

Datasets

COCO (Common Objects in Context)

Microsoft

A large-scale object detection, segmentation, and captioning dataset with over 330K images, 1.5 million object instances, and 80 object categories.

Computer VisionObject DetectionImage SegmentationImage Captioning

Website GitHub Docs

Datasets

COCO Dataset

Microsoft COCO

Common Objects in Context (COCO) is a large-scale object detection, segmentation, and captioning dataset. The dataset contains over 330K images with 80 object categories.

Computer VisionObject DetectionSegmentationImage Captioning

Website Docs

Datasets

Common Crawl

Petabyte-scale open web crawl dataset updated monthly. The foundation for training most large language models including GPT and LLaMA.

Web CrawlNLPPretrainingLarge Scale

Website

Datasets

Dolma

Allen Institute for AI

Open dataset of 3 trillion tokens used to train OLMo. Combines web text, code, scientific papers, books, and Wikipedia with full data provenance and filtering details.

PretrainingOpen SourceDiverseLarge Scale

Website GitHub

Datasets

FineWeb

Hugging Face

Hugging Face's 15-trillion token high-quality web dataset derived from CommonCrawl with aggressive deduplication and filtering. Outperforms other web datasets on benchmarks.

Web CrawlPretrainingHigh QualityWeb

Website

Datasets

Hugging Face Datasets Hub

Hugging Face

The largest hub for open machine learning datasets. Browse, search, and load 100,000+ datasets for NLP, vision, audio, and more with the datasets library.

NLPComputer VisionAudioDatasets

Website Docs

Datasets

LAION-5B

LAION

Large-scale open dataset of 5.85 billion image-text pairs scraped from the internet, used to train Stable Diffusion and other vision-language models.

Image-TextMultimodalPretrainingImage

Website

Datasets

nuScenes

Motional

A large-scale dataset for autonomous driving with 1000 scenes of 20 seconds each, with data from 6 cameras, 1 radar, and 5 lidars, including 3D bounding box annotations.

Autonomous DrivingLidarObject DetectionSensor Fusion

Website GitHub Docs

Datasets

OpenAssistant Conversations

LAION

Human-generated, human-annotated assistant-style conversation corpus with 161,443 messages in 35 languages for training RLHF-based models.

Instruction TuningRLHFConversationsInstruction

Website GitHub

Datasets

RedPajama-Data

Together AI

Open reproduction of the LLaMA training dataset (1.2 trillion tokens) across 7 data sources. Enables fully open LLM pretraining research.

PretrainingOpen SourceLLaMALarge Scale

Website GitHub

Datasets

Roboflow 100

Roboflow

A curated collection of 100 diverse computer vision datasets spanning multiple domains like agriculture, retail, and industrial settings.

Computer VisionObject DetectionMulti-domainRobotics

Website Docs

Datasets

Stanford Alpaca Dataset

Stanford

Self-instruct dataset of 52,000 instruction-following examples generated from GPT-3.5, used to fine-tune the original Alpaca model and spark the open instruction-tuning movement.

Instruction TuningSelf-instructFine-tuningStanford

Website GitHub

Datasets

The Stack

BigCode

Large dataset of permissively licensed source code from GitHub (6.4TB) covering 358 programming languages. Used for training code LLMs like StarCoder.

CodeProgrammingPretrainingGitHub

Website GitHub

Datasets

The Stanford Question Answering Dataset (SQuAD)

Stanford University

A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer is a segment of text from the corresponding article.

NLPQuestion AnsweringReading ComprehensionBenchmark

Website GitHub