Monolingual
Data release 1.1 (October 2023)
There are 75 languages in this release (7.6 TB in total size) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.
The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:
{"id":1, "document_lang":"en", "scores":["0.76","0.76","0.76"], "langs":["en","en","en"], "text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3", "url":"url1", "collection":"collection-1" } {"id":2, "document_lang":"en", "scores":["0.65",...], "langs":["en",...], "text":"another paragraph\n...", ...
In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.
How to download it?
The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/one/monotext/deduplicated/eu_map.txt
Full download
If you want to download all the 7 Terabytes in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/deduplicated/hplt_monolingual_map_all_1.2.txt
Source: CC/IA
RAW:
Docs: 4.36M
Words: 6.09B
DEDUPLICATED:
Docs: 1.37M
Words: 1.87B
Source: CC/IA
RAW:
Docs: 197.49M
Words: 216.32B
DEDUPLICATED:
Docs: 46.64M
Words: 51.53B
Source: CC/IA
RAW:
Docs: 10.45M
Words: 10.68B
DEDUPLICATED:
Docs: 3.00M
Words: 2.94B
Source: CC/IA
RAW:
Docs: 3.91M
Words: 4.71B
DEDUPLICATED:
Docs: 1.26M
Words: 1.39B
Source: CC/IA
RAW:
Docs: 58.27M
Words: 66.37B
DEDUPLICATED:
Docs: 13.34M
Words: 15.48B
Source: CC/IA
RAW:
Docs: 23.35M
Words: 46.18B
DEDUPLICATED:
Docs: 5.97M
Words: 8.23B
Source: CC/IA
RAW:
Docs: 24.17M
Words: 24.53B
DEDUPLICATED:
Docs: 7.79M
Words: 7.88B
Source: CC/IA
RAW:
Docs: 184.09M
Words: 192.48B
DEDUPLICATED:
Docs: 38.56M
Words: 41.04B
Source: CC/IA
RAW:
Docs: 725.32K
Words: 623.99M
DEDUPLICATED:
Docs: 285.05K
Words: 243.81M
Source: CC/IA
RAW:
Docs: 686.47M
Words: 139.54B
DEDUPLICATED:
Docs: 23.58M
Words: 22.10B
Source: CC/IA
RAW:
Docs: 1.02B
Words: 899.14B
DEDUPLICATED:
Docs: 226.47M
Words: 193.11B
Source: CC/IA
RAW:
Docs: 134.47M
Words: 248.08B
DEDUPLICATED:
Docs: 30.63M
Words: 52.24B
Source: CC/IA
RAW:
Docs: 12.68B
Words: 10.27T
DEDUPLICATED:
Docs: 1.78B
Words: 2.86T
Source: CC/IA
RAW:
Docs: 400.65K
Words: 295.87M
DEDUPLICATED:
Docs: 177.13K
Words: 147.06M
Source: CC/IA
RAW:
Docs: 671.98M
Words: 869.38B
DEDUPLICATED:
Docs: 201.06M
Words: 250.08B
Source: CC/IA
RAW:
Docs: 21.18M
Words: 21.75B
DEDUPLICATED:
Docs: 5.84M
Words: 5.85B
Source: CC/IA
RAW:
Docs: 2.29M
Words: 1.55B
DEDUPLICATED:
Docs: 1.01M
Words: 717.14M
Source: CC/IA
RAW:
Docs: 189.04M
Words: 318.13B
DEDUPLICATED:
Docs: 42.28M
Words: 70.38B
Source: CC/IA
RAW:
Docs: 89.06M
Words: 80.59B
DEDUPLICATED:
Docs: 19.51M
Words: 16.85B
Source: CC/IA
RAW:
Docs: 660.32M
Words: 791.82B
DEDUPLICATED:
Docs: 175.76M
Words: 213.84B
Source: CC/IA
RAW:
Docs: 2.71M
Words: 1.43B
DEDUPLICATED:
Docs: 931.55K
Words: 520.10M
Source: CC/IA
RAW:
Docs: 4.60M
Words: 5.02B
DEDUPLICATED:
Docs: 1.79M
Words: 1.29B
Source: CC/IA
RAW:
Docs: 915.29K
Words: 1.06B
DEDUPLICATED:
Docs: 454.73K
Words: 473.46M
Source: CC/IA
RAW:
Docs: 60.64M
Words: 69.45B
DEDUPLICATED:
Docs: 17.84M
Words: 19.40B
Source: CC/IA
RAW:
Docs: 46.22M
Words: 61.85B
DEDUPLICATED:
Docs: 11.24M
Words: 15.45B
Source: CC/IA
RAW:
Docs: 33.67M
Words: 41.21B
DEDUPLICATED:
Docs: 11.42M
Words: 13.65B
Source: CC/IA
RAW:
Docs: 137.33M
Words: 137.25B
DEDUPLICATED:
Docs: 28.49M
Words: 26.54B
Source: CC/IA
RAW:
Docs: 3.99M
Words: 4.06B
DEDUPLICATED:
Docs: 1.36M
Words: 1.31B
Source: CC/IA
RAW:
Docs: 125.73M
Words: 208.16B
DEDUPLICATED:
Docs: 45.77M
Words: 69.22B
Source: CC/IA
RAW:
Docs: 3.86M
Words: 3.48B
DEDUPLICATED:
Docs: 1.44M
Words: 1.34B
Source: CC/IA
RAW:
Docs: 337.44M
Words: 405.66B
DEDUPLICATED:
Docs: 96.50M
Words: 114.44B
Source: CC/IA
RAW:
Docs: 679.79M
Words: 305.03B
DEDUPLICATED:
Docs: 218.85M
Words: 91.41B
Source: CC/IA
RAW:
Docs: 6.46M
Words: 7.19B
DEDUPLICATED:
Docs: 1.67M
Words: 1.68B
Source: CC/IA
RAW:
Docs: 3.46M
Words: 2.53B
DEDUPLICATED:
Docs: 1.43M
Words: 1.01B
Source: CC/IA
RAW:
Docs: 1.93M
Words: 2.10B
DEDUPLICATED:
Docs: 557.82K
Words: 577.95M
Source: CC/IA
RAW:
Docs: 248.06M
Words: 161.42B
DEDUPLICATED:
Docs: 44.46M
Words: 29.00B
Source: CC/IA
RAW:
Docs: 333.58K
Words: 263.05M
DEDUPLICATED:
Docs: 188.17K
Words: 146.44M
Source: CC/IA
RAW:
Docs: 20.54M
Words: 14.38B
DEDUPLICATED:
Docs: 4.81M
Words: 3.32B
Source: CC/IA
RAW:
Docs: 32.35M
Words: 33.12B
DEDUPLICATED:
Docs: 7.40M
Words: 7.78B
Source: CC/IA
RAW:
Docs: 21.15M
Words: 27.37B
DEDUPLICATED:
Docs: 5.12M
Words: 5.39B
Source: CC/IA
RAW:
Docs: 3.41M
Words: 3.24B
DEDUPLICATED:
Docs: 1.25M
Words: 1.11B
Source: CC/IA
RAW:
Docs: 2.19M
Words: 2.67B
DEDUPLICATED:
Docs: 1.13M
Words: 1.05B
Source: CC/IA
RAW:
Docs: 2.50M
Words: 2.49B
DEDUPLICATED:
Docs: 1.06M
Words: 1.09B
Source: CC/IA
RAW:
Docs: 1.64M
Words: 1.91B
DEDUPLICATED:
Docs: 857.21K
Words: 911.64M
Source: CC/IA
RAW:
Docs: 28.32M
Words: 56.68B
DEDUPLICATED:
Docs: 8.36M
Words: 16.50B
Source: CC/IA
RAW:
Docs: 926.22K
Words: 1.39B
DEDUPLICATED:
Docs: 484.42K
Words: 818.61M
Source: CC/IA
RAW:
Docs: 2.46M
Words: 4.41B
DEDUPLICATED:
Docs: 826.11K
Words: 1.25B
Source: CC/IA
RAW:
Docs: 61.09M
Words: 55.51B
DEDUPLICATED:
Docs: 14.58M
Words: 13.90B
Source: CC/IA
RAW:
Docs: 2.11M
Words: 1.68B
DEDUPLICATED:
Docs: 1.36M
Words: 1.06B
Source: CC/IA
RAW:
Docs: 234.44M
Words: 250.08B
DEDUPLICATED:
Docs: 66.62M
Words: 71.75B
Source: CC/IA
RAW:
Docs: 1.85M
Words: 1.64B
DEDUPLICATED:
Docs: 752.53K
Words: 640.11M
Source: CC/IA
RAW:
Docs: 2.40M
Words: 1.34B
DEDUPLICATED:
Docs: 888.47K
Words: 532.71M
Source: CC/IA
RAW:
Docs: 346.33M
Words: 366.18B
DEDUPLICATED:
Docs: 82.92M
Words: 85.41B
Source: CC/IA
RAW:
Docs: 313.15K
Words: 356.65M
DEDUPLICATED:
Docs: 142.66K
Words: 167.53M
Source: CC/IA
RAW:
Docs: 448.20M
Words: 607.05B
DEDUPLICATED:
Docs: 103.82M
Words: 132.09B
Source: CC/IA
RAW:
Docs: 103.69M
Words: 144.82B
DEDUPLICATED:
Docs: 24.94M
Words: 32.80B
Source: CC/IA
RAW:
Docs: 1.53B
Words: 1.79T
DEDUPLICATED:
Docs: 397.27M
Words: 492.53B
Source: CC/IA
RAW:
Docs: 1.37M
Words: 2.35B
DEDUPLICATED:
Docs: 563.99K
Words: 917.69M
Source: CC/IA
RAW:
Docs: 90.43M
Words: 94.65B
DEDUPLICATED:
Docs: 13.99M
Words: 13.95B
Source: CC/IA
RAW:
Docs: 19.17M
Words: 22.85B
DEDUPLICATED:
Docs: 5.82M
Words: 7.04B
Source: CC/IA
RAW:
Docs: 676.04K
Words: 544.50M
DEDUPLICATED:
Docs: 374.71K
Words: 299.68M
Source: CC/IA
RAW:
Docs: 9.18M
Words: 11.50B
DEDUPLICATED:
Docs: 3.22M
Words: 3.78B
Source: CC/IA
RAW:
Docs: 96.13M
Words: 95.99B
DEDUPLICATED:
Docs: 29.96M
Words: 28.26B
Source: CC/IA
RAW:
Docs: 2.17M
Words: 2.10B
DEDUPLICATED:
Docs: 983.51K
Words: 910.98M
Source: CC/IA
RAW:
Docs: 5.46M
Words: 9.30B
DEDUPLICATED:
Docs: 2.47M
Words: 3.87B
Source: CC/IA
RAW:
Docs: 3.51M
Words: 2.43B
DEDUPLICATED:
Docs: 1.61M
Words: 1.06B
Source: CC/IA
RAW:
Docs: 95.27M
Words: 78.63B
DEDUPLICATED:
Docs: 29.48M
Words: 21.99B
Source: CC/IA
RAW:
Docs: 4.97M
Words: 7.07B
DEDUPLICATED:
Docs: 1.20M
Words: 1.63B
Source: CC/IA
RAW:
Docs: 215.38M
Words: 238.30B
DEDUPLICATED:
Docs: 59.43M
Words: 64.92B
Source: CC/IA
RAW:
Docs: 368.35K
Words: 261.38M
DEDUPLICATED:
Docs: 172.47K
Words: 134.18M
Source: CC/IA
RAW:
Docs: 47.12M
Words: 52.95B
DEDUPLICATED:
Docs: 17.86M
Words: 18.19B
Source: CC/IA
RAW:
Docs: 6.09M
Words: 6.59B
DEDUPLICATED:
Docs: 2.23M
Words: 2.02B
Source: CC/IA
RAW:
Docs: 1.37M
Words: 1.11B
DEDUPLICATED:
Docs: 633.22K
Words: 556.45M
Source: CC/IA
RAW:
Docs: 174.17M
Words: 287.20B
DEDUPLICATED:
Docs: 40.10M
Words: 64.97B
Source: CC/IA
RAW:
Docs: 6.91B
Words: 1.79T
DEDUPLICATED:
Docs: 1.20B
Words: 334.58B
Bilingual
Data release 1.1 (October 2023)
There are 18 language pairs in this release. The parallel corpus contains over 96 million clean and unique sentence pairs and covers over 1.4 billion English tokens. The corpora are provided in TMX and TXT compressed formats derived from the raw files published in the previous release. These corpora have been highly curated, de-duplicated and filtered using the full Bitextor pipeline.
Source: CC/IA
Stats:
Src lang tokens: 29.29M
Translation units: 14.65M
Source: CC/IA
Stats:
Src lang tokens: 2.64M
Translation units: 240.01K
Source: CC/IA
Stats:
Src lang tokens: 149.86M
Translation units: 8.91M
Source: CC/IA
Stats:
Src lang tokens: 95.94M
Translation units: 6.09M
Source: CC/IA
Stats:
Src lang tokens: 1.22M
Translation units: 610.69K
Source: CC/IA
Stats:
Src lang tokens: 50.35M
Translation units: 25.18M
Source: CC/IA
Stats:
Src lang tokens: 16.33M
Translation units: 994.75K
Source: CC/IA
Stats:
Src lang tokens: 2.13M
Translation units: 1.06M
Source: CC/IA
Stats:
Src lang tokens: 165.14M
Translation units: 12.04M
Source: CC/IA
Stats:
Src lang tokens: 138.36M
Translation units: 9.31M
Source: CC/IA
Stats:
Src lang tokens: 4.30M
Translation units: 2.15M
Source: CC/IA
Stats:
Src lang tokens: 2.28M
Translation units: 1.14M
Source: CC/IA
Stats:
Src lang tokens: 1.71M
Translation units: 854.82K
Source: CC/IA
Stats:
Src lang tokens: 265.08K
Translation units: 132.54K
Source: CC/IA
Stats:
Src lang tokens: 3.31M
Translation units: 1.66M
Source: CC/IA
Stats:
Src lang tokens: 55.12M
Translation units: 3.90M
Source: CC/IA
Stats:
Src lang tokens: 20.04M
Translation units: 1.71M
Source: CC/IA
Stats:
Src lang tokens: 84.11M
Translation units: 5.31M