MarketThe Pile (dataset)
Company Profile

The Pile (dataset)

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 component sub-datasets. The Pile and Common Crawl had been, as of 2024, the two main training datasets being used to train AI models.

Training on copyrighted works or derivatives
The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Danish anti-piracy group Rights Alliance took down Books3 through DMCA notices. Tens of thousands of YouTube videos had their subtitles scraped directly from YouTube and included in the Pile, which YouTube argued is against its terms of service. == Common Pile v0.1 ==
Common Pile v0.1
In June 2025, EleutherAI, in partnership with the Poolside, Hugging Face, and the US Library of Congress and over two dozen researchers at 14 institutions including the University of Toronto, MIT, CMU, the Vector Institute and the Allen Institute for AI released Common Pile v0.1, a training dataset that contains only works where the licenses permit their use for training AI models. The intent is to show what is possible if ethically training AI systems while respecting copyrighted works. They found that the process of gathering the data could not be fully automated and was at times painstaking, with humans verifying and annotating every entry, and that resulting models could achieve impressive results even though they were still not comparable with frontier models. ==See also==
tickerdossier.comtickerdossier.substack.com