DataComp - LM

Welcome to DataComp, the machine learning benchmark where the models are fixed and the challenge is to find the best possible data!

Prior competitions in machine learning have focused on finding the best model, with a fixed set of training and test data. However, many recent advances (GPT-4, Gemini, LLAMA, Mistral) are due in part to large and diverse language datasets. DataComp centers the role that data plays by fixing the training code, and encouraging researchers to innovate by proposing new training sets. We provide an experimental testbed centered around a new candidate pool of 240T tokens from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate them by running our standardized language model training code followed by an evaluation on 53 downstream datasets. Our benchmark consists of multiple scales, with various candidate pool sizes and associated compute budgets ranging from 412M to 7B parameters. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources.

Paper Code Data

FAQ

Can I include a piece of data more than once in training?
Yes! For the DCLM-filtering track you can do this by simply including a document multiple times

Can I change the HTML Extraction method
Yes! But if you want to participate in the filtering track please use the exact same WARC files as used in DCLMPool

Can we use the same filtering algorithm to enter multiple tracks/scales?
Yes! We encourage participation in both tracks and multiple scales.

Is any data forbidden from use in the Mixing Track?
The only data that is explicitly forbidden is the test documents from the evaluation tasks and data that cannot be released publicly. However, we additionally require that external data meets our own safety standards and may be excluded if it does not.