The Pile (dataset)

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Danish anti-piracy group Rights Alliance took down Books3 through DMCA notices. Tens of thousands of YouTube videos had their subtitles scraped directly from YouTube and included in the Pile, which YouTube argued is against its terms of service. == Common Pile v0.1 ==