Getting Started
Overview
DataComp is a competition about designing multimodal datasets.
As a participant, your task is to create a pre-training dataset with image-text pairs that produces a CLIP model high accuracy on downstream tasks.
Unlike traditional benchmarks, in DataComp the model architecture and hyperparameters are fixed, and your task is to innovate on the dataset design.
As part of the benchmark, we provide CommonPool, a large collection of 12.8B image-text pairs crawled from the public internet.
Our benchmark offers two tracks: one where participants must use only samples from the pools we provide (filtering), and another where participants can use external data, including samples from our pool (Bring your own data, BYOD).
DataComp is designed to accommodate verious levels of computational resources: each track is broken down into four scales, spanning several orders of magnitude of compute.
Our codebase is available at github.com/mlfoundations/datacomp.
Install dependencies
To start, clone the repository and install the dependencies.
conda env create -f environment.yml
To activate the environment:
conda activate datacomp
Downloding CommonPool
To download CommonPool, run the following command, replacing $scale
with the competition scale (i.e. small
, medium
, large
or xlarge
) and $data_dir
with the output directory where you want the data to be stored.
python download_upstream.py --scale $scale --data_dir $data_dir
There are four scales in our competition:
small
: 12.8M pool size, 12.8M examples seenmedium
: 128M pool size, 128M examples seenlarge
: 1.28B pool size, 1.28B examples seenxlarge
: 12.8B pool size, 12.8B examples seen
The data is stored in shards, which are tar files with the images and captions to be consumed by webdataset.
Once the download finishes, the data will be available at $data_dir/shards
.
We offer options for selecting subsets of the downloaded pool here.
Training
To train, run the following command:
torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $output_dir --exp_name $exp_name
Evaluating
To evaluate, run the following command:
python evaluate.py --train_output_dir $train_output_dir/$exp_name
Submitting
See our submission instructions. Good luck!