This version ellaborates upon version 1.0, which was made of mostly raw plain text data. The previous one, v1.1 has been deprecated.
For further information about how these datasets were produced or if you use them, please read and cite "A New Massive Multilingual Dataset for High-Performance Language Technologies".
Data release 1.2 (December 2023)
There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.
The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:
{"id":1, "document_lang":"en", "scores":["0.76","0.76","0.76"], "langs":["en","en","en"], "text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3", "url":"url1", "collection":"collection-1" } {"id":2, "document_lang":"en", "scores":["0.65",...], "langs":["en",...], "text":"another paragraph\n...", ...
In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.
The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/one/monotext/raw/eu_map.txt
wget -i https://data.hplt-project.org/one/monotext/deduplicated/eu_map.txt
wget -i https://data.hplt-project.org/one/monotext/cleaned/eu_map.txt
If you want to download all the available files from raw, deduplicated or cleaned versions in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/deduplicated/hplt_monolingual_map_all_1.2.txt
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/cleaned/hplt_monolingual_map_cleaned_1.2.txt
Afrikaans (af)
Source: CC/IA
RAW:
Docs: 4.36M
Words: 6.09B
DEDUPLICATED:
Docs: 1.37M
Words: 1.60B
CLEANED:
Docs: 747.23K
Words: 829.49M
Arabic (ar)
Source: CC/IA
RAW:
Docs: 197.49M
Words: 216.32B
DEDUPLICATED:
Docs: 46.64M
Words: 49.46B
CLEANED:
Docs: 26.80M
Words: 31.85B
Azerbaijani (az)
Source: CC/IA
RAW:
Docs: 10.45M
Words: 10.68B
DEDUPLICATED:
Docs: 3.00M
Words: 2.88B
CLEANED:
Docs: 1.10M
Words: 1.13B
Belarusian (be)
Source: CC/IA
RAW:
Docs: 3.91M
Words: 4.71B
DEDUPLICATED:
Docs: 1.26M
Words: 1.93B
CLEANED:
Docs: 356.53K
Words: 394.19M
Bulgarian (bg)
Source: CC/IA
RAW:
Docs: 58.27M
Words: 66.37B
DEDUPLICATED:
Docs: 13.34M
Words: 15.29B
CLEANED:
Docs: 6.50M
Words: 8.76B
Bangla (bn)
Source: CC/IA
RAW:
Docs: 23.35M
Words: 46.18B
DEDUPLICATED:
Docs: 5.97M
Words: 4.86B
CLEANED:
Docs: 2.88M
Words: 2.77B
Catalan (ca)
Source: CC/IA
RAW:
Docs: 24.17M
Words: 24.53B
DEDUPLICATED:
Docs: 7.79M
Words: 7.88B
CLEANED:
Docs: 4.54M
Words: 5.76B
Czech (cs)
Source: CC/IA
RAW:
Docs: 184.09M
Words: 192.48B
DEDUPLICATED:
Docs: 38.56M
Words: 36.38B
CLEANED:
Docs: 16.99M
Words: 19.11B
Source: CC/IA
RAW:
Docs: 725.32K
Words: 623.99M
DEDUPLICATED:
Docs: 285.05K
Words: 234.14M
CLEANED:
Docs: 111.25K
Words: 124.06M
Danish (da)
Source: CC/IA
RAW:
Docs: 686.47M
Words: 139.54B
DEDUPLICATED:
Docs: 23.58M
Words: 22.10B
CLEANED:
Docs: 8.18M
Words: 9.37B
German (de)
Source: CC/IA
RAW:
Docs: 1.02B
Words: 899.14B
DEDUPLICATED:
Docs: 226.47M
Words: 190.94B
CLEANED:
Docs: 101.41M
Words: 110.98B
Greek (el)
Source: CC/IA
RAW:
Docs: 134.47M
Words: 248.08B
DEDUPLICATED:
Docs: 30.63M
Words: 49.89B
CLEANED:
Docs: 15.83M
Words: 33.76B
English (en)
Source: CC/IA
RAW:
Docs: 12.68B
Words: 10.27T
DEDUPLICATED:
Docs: 1.78B
Words: 2.86T
CLEANED:
Docs: 1.02B
Words: 2.31T
Esperanto (eo)
Source: CC/IA
RAW:
Docs: 400.65K
Words: 295.87M
DEDUPLICATED:
Docs: 177.13K
Words: 152.63M
CLEANED:
Docs: 67.81K
Words: 101.70M
Spanish (es)
Source: CC/IA
RAW:
Docs: 671.98M
Words: 869.38B
DEDUPLICATED:
Docs: 201.06M
Words: 239.29B
CLEANED:
Docs: 129.29M
Words: 181.23B
Estonian (et)
Source: CC/IA
RAW:
Docs: 21.18M
Words: 21.75B
DEDUPLICATED:
Docs: 5.84M
Words: 6.57B
CLEANED:
Docs: 1.48M
Words: 1.74B
Source: CC/IA
RAW:
Docs: 2.29M
Words: 1.55B
DEDUPLICATED:
Docs: 1.01M
Words: 660.30M
CLEANED:
Docs: 343.95K
Words: 324.64M
Persian (fa)
Source: CC/IA
RAW:
Docs: 189.04M
Words: 318.13B
DEDUPLICATED:
Docs: 42.28M
Words: 57.52B
CLEANED:
Docs: 30.90M
Words: 47.58B
Finnish (fi)
Source: CC/IA
RAW:
Docs: 89.06M
Words: 80.59B
DEDUPLICATED:
Docs: 19.51M
Words: 19.88B
CLEANED:
Docs: 7.15M
Words: 9.04B
French (fr)
Source: CC/IA
RAW:
Docs: 660.32M
Words: 791.82B
DEDUPLICATED:
Docs: 175.76M
Words: 173.76B
CLEANED:
Docs: 99.59M
Words: 122.88B
Source: CC/IA
RAW:
Docs: 2.71M
Words: 1.43B
DEDUPLICATED:
Docs: 931.55K
Words: 519.91M
CLEANED:
Docs: 115.53K
Words: 130.68M
Galician (gl)
Source: CC/IA
RAW:
Docs: 4.60M
Words: 5.02B
DEDUPLICATED:
Docs: 1.79M
Words: 1.29B
CLEANED:
Docs: 731.36K
Words: 847.40M
Gujarati (gu)
Source: CC/IA
RAW:
Docs: 915.29K
Words: 1.06B
DEDUPLICATED:
Docs: 454.73K
Words: 430.50M
CLEANED:
Docs: 264.82K
Words: 303.63M
Serbo-Croatian (hbs)
Source: CC/IA
RAW:
Docs: 60.64M
Words: 69.45B
DEDUPLICATED:
Docs: 17.84M
Words: 17.94B
CLEANED:
Docs: 8.68M
Words: 10.03B
Hebrew (he)
Source: CC/IA
RAW:
Docs: 46.22M
Words: 61.85B
DEDUPLICATED:
Docs: 11.24M
Words: 14.45B
CLEANED:
Docs: 4.98M
Words: 7.49B
Hindi (hi)
Source: CC/IA
RAW:
Docs: 33.67M
Words: 41.21B
DEDUPLICATED:
Docs: 11.42M
Words: 14.13B
CLEANED:
Docs: 5.77M
Words: 7.54B
Hungarian (hu)
Source: CC/IA
RAW:
Docs: 137.33M
Words: 137.25B
DEDUPLICATED:
Docs: 28.49M
Words: 28.05B
CLEANED:
Docs: 11.71M
Words: 14.39B
Armenian (hy)
Source: CC/IA
RAW:
Docs: 3.99M
Words: 4.06B
DEDUPLICATED:
Docs: 1.36M
Words: 1.29B
CLEANED:
Docs: 621.47K
Words: 589.95M
Indonesian (id)
Source: CC/IA
RAW:
Docs: 125.73M
Words: 208.16B
DEDUPLICATED:
Docs: 45.77M
Words: 54.22B
CLEANED:
Docs: 31.42M
Words: 42.08B
Icelandic (is)
Source: CC/IA
RAW:
Docs: 3.86M
Words: 3.48B
DEDUPLICATED:
Docs: 1.44M
Words: 1.56B
CLEANED:
Docs: 481.33K
Words: 562.01M
Italian (it)
Source: CC/IA
RAW:
Docs: 337.44M
Words: 405.66B
DEDUPLICATED:
Docs: 96.50M
Words: 115.25B
CLEANED:
Docs: 53.53M
Words: 74.45B
Japanese (ja)
Source: CC/IA
RAW:
Docs: 679.79M
Words: 305.03B
DEDUPLICATED:
Docs: 218.85M
Words: 77.44B
CLEANED:
Docs: 190.41M
Words: 63.23B
Georgian (ka)
Source: CC/IA
RAW:
Docs: 6.46M
Words: 7.19B
DEDUPLICATED:
Docs: 1.67M
Words: 1.61B
CLEANED:
Docs: 533.07K
Words: 573.88M
Source: CC/IA
RAW:
Docs: 3.46M
Words: 2.53B
DEDUPLICATED:
Docs: 1.43M
Words: 1.03B
CLEANED:
Docs: 406.35K
Words: 471.76M
Kannada (kn)
Source: CC/IA
RAW:
Docs: 1.93M
Words: 2.10B
DEDUPLICATED:
Docs: 557.82K
Words: 492.00M
CLEANED:
Docs: 228.22K
Words: 235.58M
Korean (ko)
Source: CC/IA
RAW:
Docs: 248.06M
Words: 161.42B
DEDUPLICATED:
Docs: 44.46M
Words: 34.34B
CLEANED:
Docs: 31.85M
Words: 25.52B
Source: CC/IA
RAW:
Docs: 333.58K
Words: 263.05M
DEDUPLICATED:
Docs: 188.17K
Words: 152.93M
CLEANED:
Docs: 88.32K
Words: 101.62M
Source: CC/IA
RAW:
Docs: 20.54M
Words: 14.38B
DEDUPLICATED:
Docs: 4.81M
Words: 3.81B
CLEANED:
Docs: 301.70K
Words: 294.13M
Lithuanian (lt)
Source: CC/IA
RAW:
Docs: 32.35M
Words: 33.12B
DEDUPLICATED:
Docs: 7.40M
Words: 7.33B
CLEANED:
Docs: 2.72M
Words: 2.95B
Latvian (lv)
Source: CC/IA
RAW:
Docs: 21.15M
Words: 27.37B
DEDUPLICATED:
Docs: 5.12M
Words: 5.85B
CLEANED:
Docs: 1.54M
Words: 1.59B
Macedonian (mk)
Source: CC/IA
RAW:
Docs: 3.41M
Words: 3.24B
DEDUPLICATED:
Docs: 1.25M
Words: 1.07B
CLEANED:
Docs: 734.69K
Words: 736.55M
Malayalam (ml)
Source: CC/IA
RAW:
Docs: 2.19M
Words: 2.67B
DEDUPLICATED:
Docs: 1.13M
Words: 917.42M
CLEANED:
Docs: 469.98K
Words: 517.83M
Mongolian (mn)
Source: CC/IA
RAW:
Docs: 2.50M
Words: 2.49B
DEDUPLICATED:
Docs: 1.06M
Words: 1.05B
CLEANED:
Docs: 594.90K
Words: 803.21M
Marathi (mr)
Source: CC/IA
RAW:
Docs: 1.64M
Words: 1.91B
DEDUPLICATED:
Docs: 857.21K
Words: 812.46M
CLEANED:
Docs: 453.69K
Words: 519.55M
Malay (ms)
Source: CC/IA
RAW:
Docs: 28.32M
Words: 56.68B
DEDUPLICATED:
Docs: 8.36M
Words: 13.36B
CLEANED:
Docs: 4.87M
Words: 9.03B
Maltese (mt)
Source: CC/IA
RAW:
Docs: 926.22K
Words: 1.39B
DEDUPLICATED:
Docs: 484.42K
Words: 818.61M
CLEANED:
Docs: 111.12K
Words: 102.42M
Burmese (my)
Source: CC/IA
RAW:
Docs: 2.46M
Words: 4.41B
DEDUPLICATED:
Docs: 826.11K
Words: 1.14B
CLEANED:
Docs: 239.47K
Words: 357.11M
Norwegian Bokmål (nb)
Source: CC/IA
RAW:
Docs: 61.09M
Words: 55.51B
DEDUPLICATED:
Docs: 14.58M
Words: 16.26B
CLEANED:
Docs: 6.12M
Words: 8.30B
Source: CC/IA
RAW:
Docs: 2.11M
Words: 1.68B
DEDUPLICATED:
Docs: 1.36M
Words: 966.77M
CLEANED:
Docs: 863.35K
Words: 694.40M
Dutch (nl)
Source: CC/IA
RAW:
Docs: 234.44M
Words: 250.08B
DEDUPLICATED:
Docs: 66.62M
Words: 55.94B
CLEANED:
Docs: 31.75M
Words: 33.30B
Norwegian Nynorsk (nn)
Source: CC/IA
RAW:
Docs: 1.85M
Words: 1.64B
DEDUPLICATED:
Docs: 752.53K
Words: 615.68M
CLEANED:
Docs: 228.48K
Words: 298.57M
Punjabi (pa)
Source: CC/IA
RAW:
Docs: 2.40M
Words: 1.34B
DEDUPLICATED:
Docs: 888.47K
Words: 523.18M
CLEANED:
Docs: 152.78K
Words: 184.77M
Polish (pl)
Source: CC/IA
RAW:
Docs: 346.33M
Words: 366.18B
DEDUPLICATED:
Docs: 82.92M
Words: 76.27B
CLEANED:
Docs: 39.38M
Words: 44.17B
Source: CC/IA
RAW:
Docs: 313.15K
Words: 356.65M
DEDUPLICATED:
Docs: 142.66K
Words: 172.14M
CLEANED:
Docs: 88.21K
Words: 113.19M
Portuguese (pt)
Source: CC/IA
RAW:
Docs: 448.20M
Words: 607.05B
DEDUPLICATED:
Docs: 103.82M
Words: 121.75B
CLEANED:
Docs: 58.24M
Words: 81.41B
Romanian (ro)
Source: CC/IA
RAW:
Docs: 103.69M
Words: 144.82B
DEDUPLICATED:
Docs: 24.94M
Words: 28.92B
CLEANED:
Docs: 14.47M
Words: 19.49B
Russian (ru)
Source: CC/IA
RAW:
Docs: 1.53B
Words: 1.79T
DEDUPLICATED:
Docs: 397.27M
Words: 413.51B
CLEANED:
Docs: 224.20M
Words: 284.58B
Sinhala (si)
Source: CC/IA
RAW:
Docs: 1.37M
Words: 2.35B
DEDUPLICATED:
Docs: 563.99K
Words: 734.99M
CLEANED:
Docs: 322.51K
Words: 568.03M
Slovak (sk)
Source: CC/IA
RAW:
Docs: 90.43M
Words: 94.65B
DEDUPLICATED:
Docs: 13.99M
Words: 14.16B
CLEANED:
Docs: 4.62M
Words: 4.98B
Slovenian (sl)
Source: CC/IA
RAW:
Docs: 19.17M
Words: 22.85B
DEDUPLICATED:
Docs: 5.82M
Words: 6.72B
CLEANED:
Docs: 2.20M
Words: 2.51B
Source: CC/IA
RAW:
Docs: 676.04K
Words: 544.50M
DEDUPLICATED:
Docs: 374.71K
Words: 253.71M
CLEANED:
Docs: 283.71K
Words: 211.80M
Albanian (sq)
Source: CC/IA
RAW:
Docs: 9.18M
Words: 11.50B
DEDUPLICATED:
Docs: 3.22M
Words: 3.66B
CLEANED:
Docs: 1.24M
Words: 1.34B
Swedish (sv)
Source: CC/IA
RAW:
Docs: 96.13M
Words: 95.99B
DEDUPLICATED:
Docs: 29.96M
Words: 29.78B
CLEANED:
Docs: 13.67M
Words: 16.91B
Swahili (sw)
Source: CC/IA
RAW:
Docs: 2.17M
Words: 2.10B
DEDUPLICATED:
Docs: 983.51K
Words: 830.99M
CLEANED:
Docs: 698.57K
Words: 668.17M
Tamil (ta)
Source: CC/IA
RAW:
Docs: 5.46M
Words: 9.30B
DEDUPLICATED:
Docs: 2.47M
Words: 2.94B
CLEANED:
Docs: 1.24M
Words: 1.91B
Source: CC/IA
RAW:
Docs: 3.51M
Words: 2.43B
DEDUPLICATED:
Docs: 1.61M
Words: 1.03B
CLEANED:
Docs: 415.60K
Words: 437.74M
Thai (th)
Source: CC/IA
RAW:
Docs: 95.27M
Words: 78.63B
DEDUPLICATED:
Docs: 29.48M
Words: 16.43B
CLEANED:
Docs: 8.19M
Words: 4.33B
Filipino (tl)
Source: CC/IA
RAW:
Docs: 4.97M
Words: 7.07B
DEDUPLICATED:
Docs: 1.20M
Words: 1.63B
CLEANED:
Docs: 585.24K
Words: 911.06M
Turkish (tr)
Source: CC/IA
RAW:
Docs: 215.38M
Words: 238.30B
DEDUPLICATED:
Docs: 59.43M
Words: 64.92B
CLEANED:
Docs: 27.05M
Words: 42.65B
Source: CC/IA
RAW:
Docs: 368.35K
Words: 261.38M
DEDUPLICATED:
Docs: 172.47K
Words: 134.18M
CLEANED:
Docs: 65.15K
Words: 74.86M
Ukrainian (uk)
Source: CC/IA
RAW:
Docs: 47.12M
Words: 52.95B
DEDUPLICATED:
Docs: 17.86M
Words: 18.19B
CLEANED:
Docs: 9.31M
Words: 10.57B
Urdu (ur)
Source: CC/IA
RAW:
Docs: 6.09M
Words: 6.59B
DEDUPLICATED:
Docs: 2.23M
Words: 2.02B
CLEANED:
Docs: 1.44M
Words: 1.42B
Source: CC/IA
RAW:
Docs: 1.37M
Words: 1.11B
DEDUPLICATED:
Docs: 633.22K
Words: 556.45M
CLEANED:
Docs: 290.29K
Words: 367.25M
Vietnamese (vi)
Source: CC/IA
RAW:
Docs: 174.17M
Words: 287.20B
DEDUPLICATED:
Docs: 40.10M
Words: 59.20B
CLEANED:
Docs: 31.50M
Words: 49.36B
Chinese (zh)
Source: CC/IA
RAW:
Docs: 6.91B
Words: 1.79T
DEDUPLICATED:
Docs: 1.20B
Words: 482.86B
CLEANED:
Docs: 1.08B
Words: 432.88B
Data release 1.2 (December 2023)
There are 18 language pairs in this release. The parallel corpus contains over 96 million clean and unique sentence pairs and covers over 1.4 billion English tokens. The corpora are provided in raw, TMX and TXT compressed formats. These corpora have been highly curated, de-duplicated and filtered using the full Bitextor pipeline. Besides this, an anonymized (ROAM) version of the TMX is also provided.
Arabic (ar) - English (en)
Source: CC/IA
TMX:
Words: 240M
Translation units: 15M
ROAM:
Words: 173.70M
Translation units: 11.87M
RAW:
Words: 33.20B
Translation units: 1.55B
11 downloads
Bosnian (bs) - English (en)
Source: CC/IA
TMX:
Words: 2.8M
Translation units: 241K
ROAM:
Words: 280.86K
Translation units: 212.18K
RAW:
Words: 521.63M
Translation units: 27.00M
Catalan (ca) - English (en)
Source: CC/IA
TMX:
Words: 142M
Translation units: 9.0M
ROAM:
Words: 125.90M
Translation units: 8.05M
RAW:
Words: 8.03B
Translation units: 402.49M
4 downloads
English (en) - Estonian (et)
Source: CC/IA
TMX:
Words: 96M
Translation units: 6.1M
ROAM:
Words: 86.59M
Translation units: 5.51M
RAW:
Words: 15.48B
Translation units: 865.43M
1 download
English (en) - Basque (eu)
Source: CC/IA
TMX:
Words: 10M
Translation units: 611K
ROAM:
Words: 8.41M
Translation units: 521.10K
RAW:
Words: 400.26M
Translation units: 20.83M
1 download
English (en) - Finnish (fi)
Source: CC/IA
TMX:
Words: 339M
Translation units: 26M
ROAM:
Words: 303M
Translation units: 23M
RAW:
Words: 65.31B
Translation units: 3.83B
2 downloads
English (en) - Irish (ga)
Source: CC/IA
TMX:
Words: 17M
Translation units: 995K
ROAM:
Words: 15.16M
Translation units: 920.01K
RAW:
Words: 2.01B
Translation units: 101.00M
6 downloads
English (en) - Galician (gl)
Source: CC/IA
TMX:
Words: 14M
Translation units: 1.1M
ROAM:
Words: 11.63M
Translation units: 930.40K
RAW:
Words: 1.02B
Translation units: 56.10M
English (en) - Hindi (hi)
Source: CC/IA
TMX:
Words: 166M
Translation units: 13M
ROAM:
Words: 137.38M
Translation units: 10.49M
RAW:
Words: 19.25B
Translation units: 1.04B
2 downloads
English (en) - Croatian (hr)
Source: CC/IA
TMX:
Words: 139M
Translation units: 9.4M
ROAM:
Words: 125.99M
Translation units: 8.61M
RAW:
Words: 16.57B
Translation units: 895.79M
English (en) - Icelandic (is)
Source: CC/IA
TMX:
Words: 30M
Translation units: 2.2M
ROAM:
Words: 26.27M
Translation units: 1.91M
RAW:
Words: 3.27B
Translation units: 170.42M
English (en) - Macedonian (mk)
Source: CC/IA
TMX:
Words: 19M
Translation units: 1.1M
ROAM:
Words: 15.55M
Translation units: 995.10K
RAW:
Words: 1.87B
Translation units: 91.29M
English (en) - Maltese (mt)
Source: CC/IA
TMX:
Words: 19M
Translation units: 855K
ROAM:
Words: 17.14M
Translation units: 752.06K
RAW:
Words: 2.82B
Translation units: 135.10M
English (en) - Norwegian Nynorsk (nn)
Source: CC/IA
TMX:
Words: 2.1M
Translation units: 133K
ROAM:
Words: 1.58M
Translation units: 108.30K
RAW:
Words: 496.50M
Translation units: 28.70M
English (en) - Albanian (sq)
Source: CC/IA
TMX:
Words: 26M
Translation units: 1.7M
ROAM:
Words: 20.50M
Translation units: 1.35M
RAW:
Words: 5.82B
Translation units: 253.10M
English (en) - Serbian (sr)
Source: CC/IA
TMX:
Words: 56M
Translation units: 4.0M
ROAM:
Words: 49.63M
Translation units: 3.62M
RAW:
Words: 14.25B
Translation units: 247.56M
English (en) - Swahili (sw)
Source: CC/IA
TMX:
Words: 21M
Translation units: 1.8M
ROAM:
Words: 15.03M
Translation units: 1.38M
RAW:
Words: 5.75B
Translation units: 247.56M
5 downloads
English (en) - Traditional Chinese (zh-hant)
Source: CC/IA
TMX:
Words: 85M
Translation units: 5.4M
ROAM:
Words: 73M
Translation units: 4.6M
RAW:
Words: 9.16B
Translation units: 530.12M
3 downloads
These data are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.