HPLT Datasets v1.2

This version ellaborates upon which was made of mostly raw plain text data. The previous one, v1.1 has been deprecated.

What's new in v1.2


For further information about how these datasets were produced or if you use them, please read and cite "A New Massive Multilingual Dataset for High-Performance Language Technologies".

Monolingual

Data release 1.2 (December 2023)

There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:

{"id":1, "document_lang":"en",
"scores":["0.76","0.76","0.76"],
"langs":["en","en","en"],
"text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3",
"url":"url1", "collection":"collection-1"
}
{"id":2, "document_lang":"en",
"scores":["0.65",...],
"langs":["en",...],
"text":"another paragraph\n...",
...

In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.

How to download it?

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/one/monotext/raw/eu_map.txt

wget -i https://data.hplt-project.org/one/monotext/deduplicated/eu_map.txt

wget -i https://data.hplt-project.org/one/monotext/cleaned/eu_map.txt

Full download

If you want to download all the available files from raw, deduplicated or cleaned versions in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/deduplicated/hplt_monolingual_map_all_1.2.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/cleaned/hplt_monolingual_map_cleaned_1.2.txt

Source: CC/IA

RAW:

Docs: 4.36M

Words: 6.09B

DEDUPLICATED:

Docs: 1.37M

Words: 1.60B

CLEANED:

Docs: 747.23K

Words: 829.49M

Source: CC/IA

RAW:

Docs: 197.49M

Words: 216.32B

DEDUPLICATED:

Docs: 46.64M

Words: 49.46B

CLEANED:

Docs: 26.80M

Words: 31.85B

Source: CC/IA

RAW:

Docs: 10.45M

Words: 10.68B

DEDUPLICATED:

Docs: 3.00M

Words: 2.88B

CLEANED:

Docs: 1.10M

Words: 1.13B

Source: CC/IA

RAW:

Docs: 3.91M

Words: 4.71B

DEDUPLICATED:

Docs: 1.26M

Words: 1.93B

CLEANED:

Docs: 356.53K

Words: 394.19M

Source: CC/IA

RAW:

Docs: 58.27M

Words: 66.37B

DEDUPLICATED:

Docs: 13.34M

Words: 15.29B

CLEANED:

Docs: 6.50M

Words: 8.76B

Source: CC/IA

RAW:

Docs: 23.35M

Words: 46.18B

DEDUPLICATED:

Docs: 5.97M

Words: 4.86B

CLEANED:

Docs: 2.88M

Words: 2.77B

Source: CC/IA

RAW:

Docs: 24.17M

Words: 24.53B

DEDUPLICATED:

Docs: 7.79M

Words: 7.88B

CLEANED:

Docs: 4.54M

Words: 5.76B

Source: CC/IA

RAW:

Docs: 184.09M

Words: 192.48B

DEDUPLICATED:

Docs: 38.56M

Words: 36.38B

CLEANED:

Docs: 16.99M

Words: 19.11B

Source: CC/IA

RAW:

Docs: 725.32K

Words: 623.99M

DEDUPLICATED:

Docs: 285.05K

Words: 234.14M

CLEANED:

Docs: 111.25K

Words: 124.06M

Source: CC/IA

RAW:

Docs: 686.47M

Words: 139.54B

DEDUPLICATED:

Docs: 23.58M

Words: 22.10B

CLEANED:

Docs: 8.18M

Words: 9.37B

Source: CC/IA

RAW:

Docs: 1.02B

Words: 899.14B

DEDUPLICATED:

Docs: 226.47M

Words: 190.94B

CLEANED:

Docs: 101.41M

Words: 110.98B

Source: CC/IA

RAW:

Docs: 134.47M

Words: 248.08B

DEDUPLICATED:

Docs: 30.63M

Words: 49.89B

CLEANED:

Docs: 15.83M

Words: 33.76B

Source: CC/IA

RAW:

Docs: 12.68B

Words: 10.27T

DEDUPLICATED:

Docs: 1.78B

Words: 2.86T

CLEANED:

Docs: 1.02B

Words: 2.31T

Source: CC/IA

RAW:

Docs: 400.65K

Words: 295.87M

DEDUPLICATED:

Docs: 177.13K

Words: 152.63M

CLEANED:

Docs: 67.81K

Words: 101.70M

Source: CC/IA

RAW:

Docs: 671.98M

Words: 869.38B

DEDUPLICATED:

Docs: 201.06M

Words: 239.29B

CLEANED:

Docs: 129.29M

Words: 181.23B

Source: CC/IA

RAW:

Docs: 21.18M

Words: 21.75B

DEDUPLICATED:

Docs: 5.84M

Words: 6.57B

CLEANED:

Docs: 1.48M

Words: 1.74B

Source: CC/IA

RAW:

Docs: 2.29M

Words: 1.55B

DEDUPLICATED:

Docs: 1.01M

Words: 660.30M

CLEANED:

Docs: 343.95K

Words: 324.64M

Source: CC/IA

RAW:

Docs: 189.04M

Words: 318.13B

DEDUPLICATED:

Docs: 42.28M

Words: 57.52B

CLEANED:

Docs: 30.90M

Words: 47.58B

Source: CC/IA

RAW:

Docs: 89.06M

Words: 80.59B

DEDUPLICATED:

Docs: 19.51M

Words: 19.88B

CLEANED:

Docs: 7.15M

Words: 9.04B

Source: CC/IA

RAW:

Docs: 660.32M

Words: 791.82B

DEDUPLICATED:

Docs: 175.76M

Words: 173.76B

CLEANED:

Docs: 99.59M

Words: 122.88B

Source: CC/IA

RAW:

Docs: 2.71M

Words: 1.43B

DEDUPLICATED:

Docs: 931.55K

Words: 519.91M

CLEANED:

Docs: 115.53K

Words: 130.68M

Source: CC/IA

RAW:

Docs: 4.60M

Words: 5.02B

DEDUPLICATED:

Docs: 1.79M

Words: 1.29B

CLEANED:

Docs: 731.36K

Words: 847.40M

Source: CC/IA

RAW:

Docs: 915.29K

Words: 1.06B

DEDUPLICATED:

Docs: 454.73K

Words: 430.50M

CLEANED:

Docs: 264.82K

Words: 303.63M

Serbo-Croatian (hbs)

Source: CC/IA

RAW:

Docs: 60.64M

Words: 69.45B

DEDUPLICATED:

Docs: 17.84M

Words: 17.94B

CLEANED:

Docs: 8.68M

Words: 10.03B

Source: CC/IA

RAW:

Docs: 46.22M

Words: 61.85B

DEDUPLICATED:

Docs: 11.24M

Words: 14.45B

CLEANED:

Docs: 4.98M

Words: 7.49B

Source: CC/IA

RAW:

Docs: 33.67M

Words: 41.21B

DEDUPLICATED:

Docs: 11.42M

Words: 14.13B

CLEANED:

Docs: 5.77M

Words: 7.54B

Source: CC/IA

RAW:

Docs: 137.33M

Words: 137.25B

DEDUPLICATED:

Docs: 28.49M

Words: 28.05B

CLEANED:

Docs: 11.71M

Words: 14.39B

Source: CC/IA

RAW:

Docs: 3.99M

Words: 4.06B

DEDUPLICATED:

Docs: 1.36M

Words: 1.29B

CLEANED:

Docs: 621.47K

Words: 589.95M

Source: CC/IA

RAW:

Docs: 125.73M

Words: 208.16B

DEDUPLICATED:

Docs: 45.77M

Words: 54.22B

CLEANED:

Docs: 31.42M

Words: 42.08B

Source: CC/IA

RAW:

Docs: 3.86M

Words: 3.48B

DEDUPLICATED:

Docs: 1.44M

Words: 1.56B

CLEANED:

Docs: 481.33K

Words: 562.01M

Source: CC/IA

RAW:

Docs: 337.44M

Words: 405.66B

DEDUPLICATED:

Docs: 96.50M

Words: 115.25B

CLEANED:

Docs: 53.53M

Words: 74.45B

Source: CC/IA

RAW:

Docs: 679.79M

Words: 305.03B

DEDUPLICATED:

Docs: 218.85M

Words: 77.44B

CLEANED:

Docs: 190.41M

Words: 63.23B

Source: CC/IA

RAW:

Docs: 6.46M

Words: 7.19B

DEDUPLICATED:

Docs: 1.67M

Words: 1.61B

CLEANED:

Docs: 533.07K

Words: 573.88M

Source: CC/IA

RAW:

Docs: 3.46M

Words: 2.53B

DEDUPLICATED:

Docs: 1.43M

Words: 1.03B

CLEANED:

Docs: 406.35K

Words: 471.76M

Source: CC/IA

RAW:

Docs: 1.93M

Words: 2.10B

DEDUPLICATED:

Docs: 557.82K

Words: 492.00M

CLEANED:

Docs: 228.22K

Words: 235.58M

Source: CC/IA

RAW:

Docs: 248.06M

Words: 161.42B

DEDUPLICATED:

Docs: 44.46M

Words: 34.34B

CLEANED:

Docs: 31.85M

Words: 25.52B

Source: CC/IA

RAW:

Docs: 333.58K

Words: 263.05M

DEDUPLICATED:

Docs: 188.17K

Words: 152.93M

CLEANED:

Docs: 88.32K

Words: 101.62M

Source: CC/IA

RAW:

Docs: 20.54M

Words: 14.38B

DEDUPLICATED:

Docs: 4.81M

Words: 3.81B

CLEANED:

Docs: 301.70K

Words: 294.13M

Source: CC/IA

RAW:

Docs: 32.35M

Words: 33.12B

DEDUPLICATED:

Docs: 7.40M

Words: 7.33B

CLEANED:

Docs: 2.72M

Words: 2.95B

Source: CC/IA

RAW:

Docs: 21.15M

Words: 27.37B

DEDUPLICATED:

Docs: 5.12M

Words: 5.85B

CLEANED:

Docs: 1.54M

Words: 1.59B

Source: CC/IA

RAW:

Docs: 3.41M

Words: 3.24B

DEDUPLICATED:

Docs: 1.25M

Words: 1.07B

CLEANED:

Docs: 734.69K

Words: 736.55M

Source: CC/IA

RAW:

Docs: 2.19M

Words: 2.67B

DEDUPLICATED:

Docs: 1.13M

Words: 917.42M

CLEANED:

Docs: 469.98K

Words: 517.83M

Source: CC/IA

RAW:

Docs: 2.50M

Words: 2.49B

DEDUPLICATED:

Docs: 1.06M

Words: 1.05B

CLEANED:

Docs: 594.90K

Words: 803.21M

Source: CC/IA

RAW:

Docs: 1.64M

Words: 1.91B

DEDUPLICATED:

Docs: 857.21K

Words: 812.46M

CLEANED:

Docs: 453.69K

Words: 519.55M

Source: CC/IA

RAW:

Docs: 28.32M

Words: 56.68B

DEDUPLICATED:

Docs: 8.36M

Words: 13.36B

CLEANED:

Docs: 4.87M

Words: 9.03B

Source: CC/IA

RAW:

Docs: 926.22K

Words: 1.39B

DEDUPLICATED:

Docs: 484.42K

Words: 818.61M

CLEANED:

Docs: 111.12K

Words: 102.42M

Source: CC/IA

RAW:

Docs: 2.46M

Words: 4.41B

DEDUPLICATED:

Docs: 826.11K

Words: 1.14B

CLEANED:

Docs: 239.47K

Words: 357.11M

Norwegian Bokmål (nb)

Source: CC/IA

RAW:

Docs: 61.09M

Words: 55.51B

DEDUPLICATED:

Docs: 14.58M

Words: 16.26B

CLEANED:

Docs: 6.12M

Words: 8.30B

Source: CC/IA

RAW:

Docs: 2.11M

Words: 1.68B

DEDUPLICATED:

Docs: 1.36M

Words: 966.77M

CLEANED:

Docs: 863.35K

Words: 694.40M

Source: CC/IA

RAW:

Docs: 234.44M

Words: 250.08B

DEDUPLICATED:

Docs: 66.62M

Words: 55.94B

CLEANED:

Docs: 31.75M

Words: 33.30B

Source: CC/IA

RAW:

Docs: 1.85M

Words: 1.64B

DEDUPLICATED:

Docs: 752.53K

Words: 615.68M

CLEANED:

Docs: 228.48K

Words: 298.57M

Source: CC/IA

RAW:

Docs: 2.40M

Words: 1.34B

DEDUPLICATED:

Docs: 888.47K

Words: 523.18M

CLEANED:

Docs: 152.78K

Words: 184.77M

Source: CC/IA

RAW:

Docs: 346.33M

Words: 366.18B

DEDUPLICATED:

Docs: 82.92M

Words: 76.27B

CLEANED:

Docs: 39.38M

Words: 44.17B

Source: CC/IA

RAW:

Docs: 313.15K

Words: 356.65M

DEDUPLICATED:

Docs: 142.66K

Words: 172.14M

CLEANED:

Docs: 88.21K

Words: 113.19M

Source: CC/IA

RAW:

Docs: 448.20M

Words: 607.05B

DEDUPLICATED:

Docs: 103.82M

Words: 121.75B

CLEANED:

Docs: 58.24M

Words: 81.41B

Source: CC/IA

RAW:

Docs: 103.69M

Words: 144.82B

DEDUPLICATED:

Docs: 24.94M

Words: 28.92B

CLEANED:

Docs: 14.47M

Words: 19.49B

Source: CC/IA

RAW:

Docs: 1.53B

Words: 1.79T

DEDUPLICATED:

Docs: 397.27M

Words: 413.51B

CLEANED:

Docs: 224.20M

Words: 284.58B

Source: CC/IA

RAW:

Docs: 1.37M

Words: 2.35B

DEDUPLICATED:

Docs: 563.99K

Words: 734.99M

CLEANED:

Docs: 322.51K

Words: 568.03M

Source: CC/IA

RAW:

Docs: 90.43M

Words: 94.65B

DEDUPLICATED:

Docs: 13.99M

Words: 14.16B

CLEANED:

Docs: 4.62M

Words: 4.98B

Source: CC/IA

RAW:

Docs: 19.17M

Words: 22.85B

DEDUPLICATED:

Docs: 5.82M

Words: 6.72B

CLEANED:

Docs: 2.20M

Words: 2.51B

Source: CC/IA

RAW:

Docs: 676.04K

Words: 544.50M

DEDUPLICATED:

Docs: 374.71K

Words: 253.71M

CLEANED:

Docs: 283.71K

Words: 211.80M

Source: CC/IA

RAW:

Docs: 9.18M

Words: 11.50B

DEDUPLICATED:

Docs: 3.22M

Words: 3.66B

CLEANED:

Docs: 1.24M

Words: 1.34B

Source: CC/IA

RAW:

Docs: 96.13M

Words: 95.99B

DEDUPLICATED:

Docs: 29.96M

Words: 29.78B

CLEANED:

Docs: 13.67M

Words: 16.91B

Source: CC/IA

RAW:

Docs: 2.17M

Words: 2.10B

DEDUPLICATED:

Docs: 983.51K

Words: 830.99M

CLEANED:

Docs: 698.57K

Words: 668.17M

Source: CC/IA

RAW:

Docs: 5.46M

Words: 9.30B

DEDUPLICATED:

Docs: 2.47M

Words: 2.94B

CLEANED:

Docs: 1.24M

Words: 1.91B

Source: CC/IA

RAW:

Docs: 3.51M

Words: 2.43B

DEDUPLICATED:

Docs: 1.61M

Words: 1.03B

CLEANED:

Docs: 415.60K

Words: 437.74M

Source: CC/IA

RAW:

Docs: 95.27M

Words: 78.63B

DEDUPLICATED:

Docs: 29.48M

Words: 16.43B

CLEANED:

Docs: 8.19M

Words: 4.33B

Source: CC/IA

RAW:

Docs: 4.97M

Words: 7.07B

DEDUPLICATED:

Docs: 1.20M

Words: 1.63B

CLEANED:

Docs: 585.24K

Words: 911.06M

Source: CC/IA

RAW:

Docs: 215.38M

Words: 238.30B

DEDUPLICATED:

Docs: 59.43M

Words: 64.92B

CLEANED:

Docs: 27.05M

Words: 42.65B

Source: CC/IA

RAW:

Docs: 368.35K

Words: 261.38M

DEDUPLICATED:

Docs: 172.47K

Words: 134.18M

CLEANED:

Docs: 65.15K

Words: 74.86M

Source: CC/IA

RAW:

Docs: 47.12M

Words: 52.95B

DEDUPLICATED:

Docs: 17.86M

Words: 18.19B

CLEANED:

Docs: 9.31M

Words: 10.57B

Source: CC/IA

RAW:

Docs: 6.09M

Words: 6.59B

DEDUPLICATED:

Docs: 2.23M

Words: 2.02B

CLEANED:

Docs: 1.44M

Words: 1.42B

Source: CC/IA

RAW:

Docs: 1.37M

Words: 1.11B

DEDUPLICATED:

Docs: 633.22K

Words: 556.45M

CLEANED:

Docs: 290.29K

Words: 367.25M

Source: CC/IA

RAW:

Docs: 174.17M

Words: 287.20B

DEDUPLICATED:

Docs: 40.10M

Words: 59.20B

CLEANED:

Docs: 31.50M

Words: 49.36B

Source: CC/IA

RAW:

Docs: 6.91B

Words: 1.79T

DEDUPLICATED:

Docs: 1.20B

Words: 482.86B

CLEANED:

Docs: 1.08B

Words: 432.88B

Bilingual

Data release 1.2 (December 2023)

There are 18 language pairs in this release. The parallel corpus contains over 96 million clean and unique sentence pairs and covers over 1.4 billion English tokens. The corpora are provided in raw, TMX and TXT compressed formats. These corpora have been highly curated, de-duplicated and filtered using the full Bitextor pipeline. Besides this, an anonymized (ROAM) version of the TMX is also provided.

Source: CC/IA

TMX:

Words: 240M

Translation units: 15M

ROAM:

Words: 173.70M

Translation units: 11.87M

RAW:

Words: 33.20B

Translation units: 1.55B

11 downloads

Source: CC/IA

TMX:

Words: 2.8M

Translation units: 241K

ROAM:

Words: 280.86K

Translation units: 212.18K

RAW:

Words: 521.63M

Translation units: 27.00M

Source: CC/IA

TMX:

Words: 142M

Translation units: 9.0M

ROAM:

Words: 125.90M

Translation units: 8.05M

RAW:

Words: 8.03B

Translation units: 402.49M

4 downloads

Source: CC/IA

TMX:

Words: 96M

Translation units: 6.1M

ROAM:

Words: 86.59M

Translation units: 5.51M

RAW:

Words: 15.48B

Translation units: 865.43M

1 download

Source: CC/IA

TMX:

Words: 10M

Translation units: 611K

ROAM:

Words: 8.41M

Translation units: 521.10K

RAW:

Words: 400.26M

Translation units: 20.83M

1 download

Source: CC/IA

TMX:

Words: 339M

Translation units: 26M

ROAM:

Words: 303M

Translation units: 23M

RAW:

Words: 65.31B

Translation units: 3.83B

2 downloads

Source: CC/IA

TMX:

Words: 17M

Translation units: 995K

ROAM:

Words: 15.16M

Translation units: 920.01K

RAW:

Words: 2.01B

Translation units: 101.00M

6 downloads

Source: CC/IA

TMX:

Words: 14M

Translation units: 1.1M

ROAM:

Words: 11.63M

Translation units: 930.40K

RAW:

Words: 1.02B

Translation units: 56.10M

Source: CC/IA

TMX:

Words: 166M

Translation units: 13M

ROAM:

Words: 137.38M

Translation units: 10.49M

RAW:

Words: 19.25B

Translation units: 1.04B

2 downloads

Source: CC/IA

TMX:

Words: 139M

Translation units: 9.4M

ROAM:

Words: 125.99M

Translation units: 8.61M

RAW:

Words: 16.57B

Translation units: 895.79M

Source: CC/IA

TMX:

Words: 30M

Translation units: 2.2M

ROAM:

Words: 26.27M

Translation units: 1.91M

RAW:

Words: 3.27B

Translation units: 170.42M

Source: CC/IA

TMX:

Words: 19M

Translation units: 1.1M

ROAM:

Words: 15.55M

Translation units: 995.10K

RAW:

Words: 1.87B

Translation units: 91.29M

Source: CC/IA

TMX:

Words: 19M

Translation units: 855K

ROAM:

Words: 17.14M

Translation units: 752.06K

RAW:

Words: 2.82B

Translation units: 135.10M

Source: CC/IA

TMX:

Words: 2.1M

Translation units: 133K

ROAM:

Words: 1.58M

Translation units: 108.30K

RAW:

Words: 496.50M

Translation units: 28.70M

Source: CC/IA

TMX:

Words: 26M

Translation units: 1.7M

ROAM:

Words: 20.50M

Translation units: 1.35M

RAW:

Words: 5.82B

Translation units: 253.10M

Source: CC/IA

TMX:

Words: 56M

Translation units: 4.0M

ROAM:

Words: 49.63M

Translation units: 3.62M

RAW:

Words: 14.25B

Translation units: 247.56M

Source: CC/IA

TMX:

Words: 21M

Translation units: 1.8M

ROAM:

Words: 15.03M

Translation units: 1.38M

RAW:

Words: 5.75B

Translation units: 247.56M

5 downloads

English (en) - Traditional Chinese (zh-hant)

Source: CC/IA

TMX:

Words: 85M

Translation units: 5.4M

ROAM:

Words: 73M

Translation units: 4.6M

RAW:

Words: 9.16B

Translation units: 530.12M

3 downloads

License

These data are released under this licensing scheme:

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.