HPLTDatasets v1.1

DEPRECATED RELEASE
GET FILES FROM RELEASE 1.2

This version ellaborates on the previous one, version 1.0, which was made of mostly raw plain text data. De-duplication is now applied to both monolingual and bilingual datasets. Further cleaning is applied to the bilingual datasets. At the same time, the texts come with some metadata, which is currently unused but can be employed by the end users to conduct their own filtering.
Please check this report for a full description of the original data release (version 1.0), but note that the corpus statistics are very different due to deduplication and further cleaning.

Monolingual

Data release 1.1 (October 2023)

There are 75 languages in this release (7.6 TB in total size) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:

{"id":1, "document_lang":"en",
"scores":["0.76","0.76","0.76"],
"langs":["en","en","en"],
"text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3",
"url":"url1", "collection":"collection-1"
}
{"id":2, "document_lang":"en",
"scores":["0.65",...],
"langs":["en",...],
"text":"another paragraph\n...",
...

In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.

How to download it?

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/one/monotext/deduplicated/eu_map.txt

Full download

If you want to download all the 7 Terabytes in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/deduplicated/hplt_monolingual_map_all_1.2.txt

Source: CC/IA

RAW:

Docs: 4.36M

Words: 6.09B

DEDUPLICATED:

Docs: 1.37M

Words: 1.87B

Source: CC/IA

RAW:

Docs: 197.49M

Words: 216.32B

DEDUPLICATED:

Docs: 46.64M

Words: 51.53B

Source: CC/IA

RAW:

Docs: 10.45M

Words: 10.68B

DEDUPLICATED:

Docs: 3.00M

Words: 2.94B

Source: CC/IA

RAW:

Docs: 3.91M

Words: 4.71B

DEDUPLICATED:

Docs: 1.26M

Words: 1.39B

Source: CC/IA

RAW:

Docs: 58.27M

Words: 66.37B

DEDUPLICATED:

Docs: 13.34M

Words: 15.48B

Source: CC/IA

RAW:

Docs: 23.35M

Words: 46.18B

DEDUPLICATED:

Docs: 5.97M

Words: 8.23B

Source: CC/IA

RAW:

Docs: 24.17M

Words: 24.53B

DEDUPLICATED:

Docs: 7.79M

Words: 7.88B

Source: CC/IA

RAW:

Docs: 184.09M

Words: 192.48B

DEDUPLICATED:

Docs: 38.56M

Words: 41.04B

Source: CC/IA

RAW:

Docs: 725.32K

Words: 623.99M

DEDUPLICATED:

Docs: 285.05K

Words: 243.81M

Source: CC/IA

RAW:

Docs: 686.47M

Words: 139.54B

DEDUPLICATED:

Docs: 23.58M

Words: 22.10B

Source: CC/IA

RAW:

Docs: 1.02B

Words: 899.14B

DEDUPLICATED:

Docs: 226.47M

Words: 193.11B

Source: CC/IA

RAW:

Docs: 134.47M

Words: 248.08B

DEDUPLICATED:

Docs: 30.63M

Words: 52.24B

Source: CC/IA

RAW:

Docs: 12.68B

Words: 10.27T

DEDUPLICATED:

Docs: 1.78B

Words: 2.86T

Source: CC/IA

RAW:

Docs: 400.65K

Words: 295.87M

DEDUPLICATED:

Docs: 177.13K

Words: 147.06M

Source: CC/IA

RAW:

Docs: 671.98M

Words: 869.38B

DEDUPLICATED:

Docs: 201.06M

Words: 250.08B

Source: CC/IA

RAW:

Docs: 21.18M

Words: 21.75B

DEDUPLICATED:

Docs: 5.84M

Words: 5.85B

Source: CC/IA

RAW:

Docs: 2.29M

Words: 1.55B

DEDUPLICATED:

Docs: 1.01M

Words: 717.14M

Source: CC/IA

RAW:

Docs: 189.04M

Words: 318.13B

DEDUPLICATED:

Docs: 42.28M

Words: 70.38B

Source: CC/IA

RAW:

Docs: 89.06M

Words: 80.59B

DEDUPLICATED:

Docs: 19.51M

Words: 16.85B

Source: CC/IA

RAW:

Docs: 660.32M

Words: 791.82B

DEDUPLICATED:

Docs: 175.76M

Words: 213.84B

Source: CC/IA

RAW:

Docs: 2.71M

Words: 1.43B

DEDUPLICATED:

Docs: 931.55K

Words: 520.10M

Source: CC/IA

RAW:

Docs: 4.60M

Words: 5.02B

DEDUPLICATED:

Docs: 1.79M

Words: 1.29B

Source: CC/IA

RAW:

Docs: 915.29K

Words: 1.06B

DEDUPLICATED:

Docs: 454.73K

Words: 473.46M

Serbo-Croatian (hbs)

Source: CC/IA

RAW:

Docs: 60.64M

Words: 69.45B

DEDUPLICATED:

Docs: 17.84M

Words: 19.40B

Source: CC/IA

RAW:

Docs: 46.22M

Words: 61.85B

DEDUPLICATED:

Docs: 11.24M

Words: 15.45B

Source: CC/IA

RAW:

Docs: 33.67M

Words: 41.21B

DEDUPLICATED:

Docs: 11.42M

Words: 13.65B

Source: CC/IA

RAW:

Docs: 137.33M

Words: 137.25B

DEDUPLICATED:

Docs: 28.49M

Words: 26.54B

Source: CC/IA

RAW:

Docs: 3.99M

Words: 4.06B

DEDUPLICATED:

Docs: 1.36M

Words: 1.31B

Source: CC/IA

RAW:

Docs: 125.73M

Words: 208.16B

DEDUPLICATED:

Docs: 45.77M

Words: 69.22B

Source: CC/IA

RAW:

Docs: 3.86M

Words: 3.48B

DEDUPLICATED:

Docs: 1.44M

Words: 1.34B

Source: CC/IA

RAW:

Docs: 337.44M

Words: 405.66B

DEDUPLICATED:

Docs: 96.50M

Words: 114.44B

Source: CC/IA

RAW:

Docs: 679.79M

Words: 305.03B

DEDUPLICATED:

Docs: 218.85M

Words: 91.41B

Source: CC/IA

RAW:

Docs: 6.46M

Words: 7.19B

DEDUPLICATED:

Docs: 1.67M

Words: 1.68B

Source: CC/IA

RAW:

Docs: 3.46M

Words: 2.53B

DEDUPLICATED:

Docs: 1.43M

Words: 1.01B

Source: CC/IA

RAW:

Docs: 1.93M

Words: 2.10B

DEDUPLICATED:

Docs: 557.82K

Words: 577.95M

Source: CC/IA

RAW:

Docs: 248.06M

Words: 161.42B

DEDUPLICATED:

Docs: 44.46M

Words: 29.00B

Source: CC/IA

RAW:

Docs: 333.58K

Words: 263.05M

DEDUPLICATED:

Docs: 188.17K

Words: 146.44M

Source: CC/IA

RAW:

Docs: 20.54M

Words: 14.38B

DEDUPLICATED:

Docs: 4.81M

Words: 3.32B

Source: CC/IA

RAW:

Docs: 32.35M

Words: 33.12B

DEDUPLICATED:

Docs: 7.40M

Words: 7.78B

Source: CC/IA

RAW:

Docs: 21.15M

Words: 27.37B

DEDUPLICATED:

Docs: 5.12M

Words: 5.39B

Source: CC/IA

RAW:

Docs: 3.41M

Words: 3.24B

DEDUPLICATED:

Docs: 1.25M

Words: 1.11B

Source: CC/IA

RAW:

Docs: 2.19M

Words: 2.67B

DEDUPLICATED:

Docs: 1.13M

Words: 1.05B

Source: CC/IA

RAW:

Docs: 2.50M

Words: 2.49B

DEDUPLICATED:

Docs: 1.06M

Words: 1.09B

Source: CC/IA

RAW:

Docs: 1.64M

Words: 1.91B

DEDUPLICATED:

Docs: 857.21K

Words: 911.64M

Source: CC/IA

RAW:

Docs: 28.32M

Words: 56.68B

DEDUPLICATED:

Docs: 8.36M

Words: 16.50B

Source: CC/IA

RAW:

Docs: 926.22K

Words: 1.39B

DEDUPLICATED:

Docs: 484.42K

Words: 818.61M

Source: CC/IA

RAW:

Docs: 2.46M

Words: 4.41B

DEDUPLICATED:

Docs: 826.11K

Words: 1.25B

Norwegian Bokmål (nb)

Source: CC/IA

RAW:

Docs: 61.09M

Words: 55.51B

DEDUPLICATED:

Docs: 14.58M

Words: 13.90B

Source: CC/IA

RAW:

Docs: 2.11M

Words: 1.68B

DEDUPLICATED:

Docs: 1.36M

Words: 1.06B

Source: CC/IA

RAW:

Docs: 234.44M

Words: 250.08B

DEDUPLICATED:

Docs: 66.62M

Words: 71.75B

Norwegian Nynorsk (nn)

Source: CC/IA

RAW:

Docs: 1.85M

Words: 1.64B

DEDUPLICATED:

Docs: 752.53K

Words: 640.11M

Source: CC/IA

RAW:

Docs: 2.40M

Words: 1.34B

DEDUPLICATED:

Docs: 888.47K

Words: 532.71M

Source: CC/IA

RAW:

Docs: 346.33M

Words: 366.18B

DEDUPLICATED:

Docs: 82.92M

Words: 85.41B

Source: CC/IA

RAW:

Docs: 313.15K

Words: 356.65M

DEDUPLICATED:

Docs: 142.66K

Words: 167.53M

Source: CC/IA

RAW:

Docs: 448.20M

Words: 607.05B

DEDUPLICATED:

Docs: 103.82M

Words: 132.09B

Source: CC/IA

RAW:

Docs: 103.69M

Words: 144.82B

DEDUPLICATED:

Docs: 24.94M

Words: 32.80B

Source: CC/IA

RAW:

Docs: 1.53B

Words: 1.79T

DEDUPLICATED:

Docs: 397.27M

Words: 492.53B

Source: CC/IA

RAW:

Docs: 1.37M

Words: 2.35B

DEDUPLICATED:

Docs: 563.99K

Words: 917.69M

Source: CC/IA

RAW:

Docs: 90.43M

Words: 94.65B

DEDUPLICATED:

Docs: 13.99M

Words: 13.95B

Source: CC/IA

RAW:

Docs: 19.17M

Words: 22.85B

DEDUPLICATED:

Docs: 5.82M

Words: 7.04B

Source: CC/IA

RAW:

Docs: 676.04K

Words: 544.50M

DEDUPLICATED:

Docs: 374.71K

Words: 299.68M

Source: CC/IA

RAW:

Docs: 9.18M

Words: 11.50B

DEDUPLICATED:

Docs: 3.22M

Words: 3.78B

Source: CC/IA

RAW:

Docs: 96.13M

Words: 95.99B

DEDUPLICATED:

Docs: 29.96M

Words: 28.26B

Source: CC/IA

RAW:

Docs: 2.17M

Words: 2.10B

DEDUPLICATED:

Docs: 983.51K

Words: 910.98M

Source: CC/IA

RAW:

Docs: 5.46M

Words: 9.30B

DEDUPLICATED:

Docs: 2.47M

Words: 3.87B

Source: CC/IA

RAW:

Docs: 3.51M

Words: 2.43B

DEDUPLICATED:

Docs: 1.61M

Words: 1.06B

Source: CC/IA

RAW:

Docs: 95.27M

Words: 78.63B

DEDUPLICATED:

Docs: 29.48M

Words: 21.99B

Source: CC/IA

RAW:

Docs: 4.97M

Words: 7.07B

DEDUPLICATED:

Docs: 1.20M

Words: 1.63B

Source: CC/IA

RAW:

Docs: 215.38M

Words: 238.30B

DEDUPLICATED:

Docs: 59.43M

Words: 64.92B

Source: CC/IA

RAW:

Docs: 368.35K

Words: 261.38M

DEDUPLICATED:

Docs: 172.47K

Words: 134.18M

Source: CC/IA

RAW:

Docs: 47.12M

Words: 52.95B

DEDUPLICATED:

Docs: 17.86M

Words: 18.19B

Source: CC/IA

RAW:

Docs: 6.09M

Words: 6.59B

DEDUPLICATED:

Docs: 2.23M

Words: 2.02B

Source: CC/IA

RAW:

Docs: 1.37M

Words: 1.11B

DEDUPLICATED:

Docs: 633.22K

Words: 556.45M

Source: CC/IA

RAW:

Docs: 174.17M

Words: 287.20B

DEDUPLICATED:

Docs: 40.10M

Words: 64.97B

Source: CC/IA

RAW:

Docs: 6.91B

Words: 1.79T

DEDUPLICATED:

Docs: 1.20B

Words: 334.58B

Bilingual

Data release 1.1 (October 2023)

There are 18 language pairs in this release. The parallel corpus contains over 96 million clean and unique sentence pairs and covers over 1.4 billion English tokens. The corpora are provided in TMX and TXT compressed formats derived from the raw files published in the previous release. These corpora have been highly curated, de-duplicated and filtered using the full Bitextor pipeline.

Arabic (ar) - English (en)

Source: CC/IA

Stats:

Src lang tokens: 29.29M

Translation units: 14.65M

Bosnian (bs) - English (en)

Source: CC/IA

Stats:

Src lang tokens: 2.64M

Translation units: 240.01K

Catalan (ca) - English (en)

Source: CC/IA

Stats:

Src lang tokens: 149.86M

Translation units: 8.91M

English (en) - Estonian (et)

Source: CC/IA

Stats:

Src lang tokens: 95.94M

Translation units: 6.09M

English (en) - Basque (eu)

Source: CC/IA

Stats:

Src lang tokens: 1.22M

Translation units: 610.69K

English (en) - Finnish (fi)

Source: CC/IA

Stats:

Src lang tokens: 50.35M

Translation units: 25.18M

English (en) - Irish (ga)

Source: CC/IA

Stats:

Src lang tokens: 16.33M

Translation units: 994.75K

English (en) - Galician (gl)

Source: CC/IA

Stats:

Src lang tokens: 2.13M

Translation units: 1.06M

English (en) - Hindi (hi)

Source: CC/IA

Stats:

Src lang tokens: 165.14M

Translation units: 12.04M

English (en) - Croatian (hr)

Source: CC/IA

Stats:

Src lang tokens: 138.36M

Translation units: 9.31M

English (en) - Icelandic (is)

Source: CC/IA

Stats:

Src lang tokens: 4.30M

Translation units: 2.15M

English (en) - Macedonian (mk)

Source: CC/IA

Stats:

Src lang tokens: 2.28M

Translation units: 1.14M

English (en) - Maltese (mt)

Source: CC/IA

Stats:

Src lang tokens: 1.71M

Translation units: 854.82K

English (en) - Norwegian Nynorsk (nn)

Source: CC/IA

Stats:

Src lang tokens: 265.08K

Translation units: 132.54K

English (en) - Albanian (sq)

Source: CC/IA

Stats:

Src lang tokens: 3.31M

Translation units: 1.66M

English (en) - Serbian (sr)

Source: CC/IA

Stats:

Src lang tokens: 55.12M

Translation units: 3.90M

English (en) - Swahili (sw)

Source: CC/IA

Stats:

Src lang tokens: 20.04M

Translation units: 1.71M

English (en) - Traditional Chinese (zh-hant)

Source: CC/IA

Stats:

Src lang tokens: 84.11M

Translation units: 5.31M

License

These data are released under this licensing scheme:

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.