Skip Navigation

HUGE dataset released for open source use

together.ai RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

Releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI

30T tokens, 20.5T in English, allegedly high quality, can't wait to see people start putting it to use!

Related github: https://github.com/togethercomputer/RedPajama-Data

4
4 comments