Version 2.0 of the HPLT Monolingual Datasets is now published. These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: deduplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as deduplicated minus those filtered out by our cleaning heuristics. The cleaned variant is recommended unless you want to try your own cleaning pipelines.
Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake, and text extraction pipeline was run on LUMI supercomputer.
HPLT Monolingual Datasets version 2.0 (the deduplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on.
These data are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.