The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year. It is composed of 22 component sub-datasets. The Pile and Common Crawl had been, as of 2024, the two main training datasets being used to train AI models.