We use cookies on our site.

HPLT logoHPLT logo

High Performance
Language Technologies

HPLT@EMNLP25
More
Tools & PipelinesAboutPublicationsDashboards
DeliverablesModels
Datasets
v1 Releasev1.1 Releasev1.2 Releasev2.0 Releasev3.0 Releasev4.0 Release
HPLT@EMNLP25
More
Tools & PipelinesAboutPublicationsDashboards
DeliverablesModels
Datasets
v1 releasev1.1 releasev1.2 Releasev2.0 Releasev3.0 Releasev4.0 Release

HPLT Datasets v1.2

License and takedownopen source logo

This version ellaborates upon version 1.0, which was made of mostly raw plain text data. The previous one, v1.1 has been deprecated.

What's new in v1.2

  • We fixed a bug found in the de-duplication algorithm for monolingual data in version 1.1. Both monolingual and bilingual datasets are now correctly deduplicated. And we recovered back a lot of monolingual data!
  • Further cleaning has been applied to monolingual datasets. Full documents have been filtered following 5 criteria: URL is in UT1 blocklist of adult sites, average words per segment is less than 5, it contains less than 200 characters or less than 5 segments (lines) and less than 20% of the segments in a document share the language identified at document level.
  • The bilingual datasets have now been anonymized using a blend of regular expressions and NER on the English side as implemente in BiROAMer.

For further information about how these datasets were produced or if you use them, please read and cite "A New Massive Multilingual Dataset for High-Performance Language Technologies".

Monolingual

Data release 1.2 (December 2023)

There are 75 languages in this release (22 TB of raw files, 11 TB of deduped files and 8.4 TB of clean files) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:

{"id":1, "document_lang":"en",
"scores":["0.76","0.76","0.76"],
"langs":["en","en","en"],
"text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3",
"url":"url1", "collection":"collection-1"
}
{"id":2, "document_lang":"en",
"scores":["0.65",...],
"langs":["en",...],
"text":"another paragraph\n...",
...

In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.

How to download it?

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/one/monotext/eu_map.txt

wget -i https://data.hplt-project.org/one/monotext/deduplicated/eu_map.txt

wget -i https://data.hplt-project.org/one/monotext/cleaned/eu_map.txt

Full download

If you want to download all the available files from raw, deduplicated or cleaned versions in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/deduplicated/hplt_monolingual_map_all_1.2.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/cleaned/hplt_monolingual_map_cleaned_1.2.txt

Romanian (ro)

Afrikaans (af)

Creative Commons CC0 license
HPLT Analytics report
raw 5.0G
dedup 2.2G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 4.36M

Words: 6.09B

DEDUPLICATED:

Docs: 1.37M

Words: 1.60B

CLEANED:

Docs: 747.23k

Words: 829.49M

Arabic (ar)

Creative Commons CC0 license
HPLT Analytics report
raw 174G
dedup 78G
cleaned 52G

Source: CC/IA

RAW:

Docs: 197.49M

Words: 216.32B

DEDUPLICATED:

Docs: 46.64M

Words: 49.46B

CLEANED:

Docs: 26.80M

Words: 31.85B

Azerbaijani (az)

Creative Commons CC0 license
HPLT Analytics report
raw 8.8G
dedup 4.0G
cleaned 1.7G

Source: CC/IA

RAW:

Docs: 10.45M

Words: 10.68B

DEDUPLICATED:

Docs: 3.00M

Words: 2.88B

CLEANED:

Docs: 1.10M

Words: 1.13B

Belarusian (be)

Creative Commons CC0 license
HPLT Analytics report
raw 5.7G
dedup 3.5G
cleaned 865M

Source: CC/IA

RAW:

Docs: 3.91M

Words: 4.71B

DEDUPLICATED:

Docs: 1.26M

Words: 1.93B

CLEANED:

Docs: 356.53k

Words: 394.19M

Bulgarian (bg)

Creative Commons CC0 license
HPLT Analytics report
raw 50G
dedup 24G
cleaned 15G

Source: CC/IA

RAW:

Docs: 58.27M

Words: 66.37B

DEDUPLICATED:

Docs: 13.34M

Words: 15.29B

CLEANED:

Docs: 6.50M

Words: 8.76B

Bangla (bn)

Creative Commons CC0 license
HPLT Analytics report
raw 14G
dedup 7.8G
cleaned 5.0G

Source: CC/IA

RAW:

Docs: 23.35M

Words: 46.18B

DEDUPLICATED:

Docs: 5.97M

Words: 4.86B

CLEANED:

Docs: 2.88M

Words: 2.77B

Catalan (ca)

Creative Commons CC0 license
HPLT Analytics report
raw 17G
dedup 11T
cleaned 7.3G

Source: CC/IA

RAW:

Docs: 24.17M

Words: 24.53B

DEDUPLICATED:

Docs: 7.79M

Words: 7.88B

CLEANED:

Docs: 4.54M

Words: 5.76B

Czech (cs)

Creative Commons CC0 license
HPLT Analytics report
raw 128G
dedup 55G
cleaned 30G

Source: CC/IA

RAW:

Docs: 184.09M

Words: 192.48B

DEDUPLICATED:

Docs: 38.56M

Words: 36.38B

CLEANED:

Docs: 16.99M

Words: 19.11B

Welsh (cy)

Creative Commons CC0 license
HPLT Analytics report
raw 494M
dedup 321M
cleaned 174M

Source: CC/IA

RAW:

Docs: 725.32k

Words: 623.99M

DEDUPLICATED:

Docs: 285.05k

Words: 234.14M

CLEANED:

Docs: 111.25k

Words: 124.06M

Danish (da)

Creative Commons CC0 license
HPLT Analytics report
raw 93G
dedup 32G
cleaned 14G

Source: CC/IA

RAW:

Docs: 686.47M

Words: 139.54B

DEDUPLICATED:

Docs: 23.58M

Words: 22.10B

CLEANED:

Docs: 8.18M

Words: 9.37B

German (de)

Creative Commons CC0 license
HPLT Analytics report
raw 750G
dedup 11T
cleaned 190G

Source: CC/IA

RAW:

Docs: 1.02B

Words: 899.14B

DEDUPLICATED:

Docs: 226.47M

Words: 190.94B

CLEANED:

Docs: 101.41M

Words: 110.98B

Greek (el)

Creative Commons CC0 license
HPLT Analytics report
raw 179G
dedup 68G
cleaned 45G

Source: CC/IA

RAW:

Docs: 134.47M

Words: 248.08B

DEDUPLICATED:

Docs: 30.63M

Words: 49.89B

CLEANED:

Docs: 15.83M

Words: 33.76B

English (en)

Creative Commons CC0 license
raw 8.7T
dedup 3.9T
cleaned 3.1T

Source: CC/IA

RAW:

Docs: 12.68B

Words: 10.27T

DEDUPLICATED:

Docs: 1.78B

Words: 2.86T

CLEANED:

Docs: 1.02B

Words: 2.31T

Esperanto (eo)

Creative Commons CC0 license
HPLT Analytics report
raw 364M
dedup 273M
cleaned 187M

Source: CC/IA

RAW:

Docs: 400.65k

Words: 295.87M

DEDUPLICATED:

Docs: 177.13k

Words: 152.63M

CLEANED:

Docs: 67.81k

Words: 101.70M

Spanish (es)

Creative Commons CC0 license
HPLT Analytics report
raw 684G
dedup 324G
cleaned 255G

Source: CC/IA

RAW:

Docs: 671.98M

Words: 869.38B

DEDUPLICATED:

Docs: 201.06M

Words: 239.29B

CLEANED:

Docs: 129.29M

Words: 181.23B

Estonian (et)

Creative Commons CC0 license
HPLT Analytics report
raw 19G
dedup 11G
cleaned 3.0G

Source: CC/IA

RAW:

Docs: 21.18M

Words: 21.75B

DEDUPLICATED:

Docs: 5.84M

Words: 6.57B

CLEANED:

Docs: 1.48M

Words: 1.74B

Basque (eu)

Creative Commons CC0 license
HPLT Analytics report
raw 1.4G
dedup 1.1G
cleaned 561M

Source: CC/IA

RAW:

Docs: 2.29M

Words: 1.55B

DEDUPLICATED:

Docs: 1.01M

Words: 660.30M

CLEANED:

Docs: 343.95k

Words: 324.64M

Persian (fa)

Creative Commons CC0 license
HPLT Analytics report
raw 172G
dedup 78G
cleaned 66G

Source: CC/IA

RAW:

Docs: 189.04M

Words: 318.13B

DEDUPLICATED:

Docs: 42.28M

Words: 57.52B

CLEANED:

Docs: 30.90M

Words: 47.58B

Finnish (fi)

Creative Commons CC0 license
HPLT Analytics report
raw 70G
dedup 85K
cleaned 16G

Source: CC/IA

RAW:

Docs: 89.06M

Words: 80.59B

DEDUPLICATED:

Docs: 19.51M

Words: 19.88B

CLEANED:

Docs: 7.15M

Words: 9.04B

French (fr)

Creative Commons CC0 license
HPLT Analytics report
raw 520G
dedup 249G
cleaned 177G

Source: CC/IA

RAW:

Docs: 660.32M

Words: 791.82B

DEDUPLICATED:

Docs: 175.76M

Words: 173.76B

CLEANED:

Docs: 99.59M

Words: 122.88B

Irish (ga)

Creative Commons CC0 license
HPLT Analytics report
raw 936M
dedup 517M
cleaned 168M

Source: CC/IA

RAW:

Docs: 2.71M

Words: 1.43B

DEDUPLICATED:

Docs: 931.55k

Words: 519.91M

CLEANED:

Docs: 115.53k

Words: 130.68M

Galician (gl)

Creative Commons CC0 license
HPLT Analytics report
raw 2.7G
dedup 1.7G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 4.60M

Words: 5.02B

DEDUPLICATED:

Docs: 1.79M

Words: 1.29B

CLEANED:

Docs: 731.36k

Words: 847.40M

Gujarati (gu)

Creative Commons CC0 license
HPLT Analytics report
raw 976M
dedup 718M
cleaned 536M

Source: CC/IA

RAW:

Docs: 915.29k

Words: 1.06B

DEDUPLICATED:

Docs: 454.73k

Words: 430.50M

CLEANED:

Docs: 264.82k

Words: 303.63M

Serbo-Croatian (hbs)

Creative Commons CC0 license
HPLT Analytics report
raw 52G
dedup 27G
cleaned 16G

Source: CC/IA

RAW:

Docs: 60.64M

Words: 69.45B

DEDUPLICATED:

Docs: 17.84M

Words: 17.94B

CLEANED:

Docs: 8.68M

Words: 10.03B

Hebrew (he)

Creative Commons CC0 license
HPLT Analytics report
raw 45G
dedup 22G
cleaned 11G

Source: CC/IA

RAW:

Docs: 46.22M

Words: 61.85B

DEDUPLICATED:

Docs: 11.24M

Words: 14.45B

CLEANED:

Docs: 4.98M

Words: 7.49B

Hindi (hi)

Creative Commons CC0 license
HPLT Analytics report
raw 38G
dedup 21G
cleaned 12G

Source: CC/IA

RAW:

Docs: 33.67M

Words: 41.21B

DEDUPLICATED:

Docs: 11.42M

Words: 14.13B

CLEANED:

Docs: 5.77M

Words: 7.54B

Hungarian (hu)

Creative Commons CC0 license
HPLT Analytics report
raw 103G
dedup 45G
cleaned 24G

Source: CC/IA

RAW:

Docs: 137.33M

Words: 137.25B

DEDUPLICATED:

Docs: 28.49M

Words: 28.05B

CLEANED:

Docs: 11.71M

Words: 14.39B

Armenian (hy)

Creative Commons CC0 license
HPLT Analytics report
raw 3.9G
dedup 2.3G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 3.99M

Words: 4.06B

DEDUPLICATED:

Docs: 1.36M

Words: 1.29B

CLEANED:

Docs: 621.47k

Words: 589.95M

Indonesian (id)

Creative Commons CC0 license
HPLT Analytics report
raw 136G
dedup 74G
cleaned 58G

Source: CC/IA

RAW:

Docs: 125.73M

Words: 208.16B

DEDUPLICATED:

Docs: 45.77M

Words: 54.22B

CLEANED:

Docs: 31.42M

Words: 42.08B

Icelandic (is)

Creative Commons CC0 license
HPLT Analytics report
raw 3.4G
dedup 2.5G
cleaned 837M

Source: CC/IA

RAW:

Docs: 3.86M

Words: 3.48B

DEDUPLICATED:

Docs: 1.44M

Words: 1.56B

CLEANED:

Docs: 481.33k

Words: 562.01M

Italian (it)

Creative Commons CC0 license
HPLT Analytics report
raw 301G
dedup 160G
cleaned 103G

Source: CC/IA

RAW:

Docs: 337.44M

Words: 405.66B

DEDUPLICATED:

Docs: 96.50M

Words: 115.25B

CLEANED:

Docs: 53.53M

Words: 74.45B

Japanese (ja)

Creative Commons CC0 license
HPLT Analytics report
raw 635G
dedup 384G
cleaned 354G

Source: CC/IA

RAW:

Docs: 679.79M

Words: 305.03B

DEDUPLICATED:

Docs: 218.85M

Words: 77.44B

CLEANED:

Docs: 190.41M

Words: 63.23B

Georgian (ka)

Creative Commons CC0 license
HPLT Analytics report
raw 6.3G
dedup 3.2G
cleaned 1.3G

Source: CC/IA

RAW:

Docs: 6.46M

Words: 7.19B

DEDUPLICATED:

Docs: 1.67M

Words: 1.61B

CLEANED:

Docs: 533.07k

Words: 573.88M

Kazakh (kk)

Creative Commons CC0 license
HPLT Analytics report
raw 2.6G
dedup 1.9G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 3.46M

Words: 2.53B

DEDUPLICATED:

Docs: 1.43M

Words: 1.03B

CLEANED:

Docs: 406.35k

Words: 471.76M

Kannada (kn)

Creative Commons CC0 license
HPLT Analytics report
raw 1.9G
dedup 872M
cleaned 510M

Source: CC/IA

RAW:

Docs: 1.93M

Words: 2.10B

DEDUPLICATED:

Docs: 557.82k

Words: 492.00M

CLEANED:

Docs: 228.22k

Words: 235.58M

Korean (ko)

Creative Commons CC0 license
HPLT Analytics report
raw 132G
dedup 63G
cleaned 48G

Source: CC/IA

RAW:

Docs: 248.06M

Words: 161.42B

DEDUPLICATED:

Docs: 44.46M

Words: 34.34B

CLEANED:

Docs: 31.85M

Words: 25.52B

Kyrgyz (ky)

Creative Commons CC0 license
HPLT Analytics report
raw 344M
dedup 296M
cleaned 221M

Source: CC/IA

RAW:

Docs: 333.58k

Words: 263.05M

DEDUPLICATED:

Docs: 188.17k

Words: 152.93M

CLEANED:

Docs: 88.32k

Words: 101.62M

Latin (la)

Creative Commons CC0 license
HPLT Analytics report
raw 8.0G
dedup 4.1G
cleaned 394M

Source: CC/IA

RAW:

Docs: 20.54M

Words: 14.38B

DEDUPLICATED:

Docs: 4.81M

Words: 3.81B

CLEANED:

Docs: 301.70k

Words: 294.13M

Lithuanian (lt)

Creative Commons CC0 license
HPLT Analytics report
raw 22G
dedup 12G
cleaned 4.8G

Source: CC/IA

RAW:

Docs: 32.35M

Words: 33.12B

DEDUPLICATED:

Docs: 7.40M

Words: 7.33B

CLEANED:

Docs: 2.72M

Words: 2.95B

Latvian (lv)

Creative Commons CC0 license
HPLT Analytics report
raw 18G
dedup 9.3G
cleaned 2.5G

Source: CC/IA

RAW:

Docs: 21.15M

Words: 27.37B

DEDUPLICATED:

Docs: 5.12M

Words: 5.85B

CLEANED:

Docs: 1.54M

Words: 1.59B

Macedonian (mk)

Creative Commons CC0 license
HPLT Analytics report
raw 2.7G
dedup 1.5G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 3.41M

Words: 3.24B

DEDUPLICATED:

Docs: 1.25M

Words: 1.07B

CLEANED:

Docs: 734.69k

Words: 736.55M

Malayalam (ml)

Creative Commons CC0 license
HPLT Analytics report
raw 2.6G
dedup 1.9G
cleaned 1.2G

Source: CC/IA

RAW:

Docs: 2.19M

Words: 2.67B

DEDUPLICATED:

Docs: 1.13M

Words: 917.42M

CLEANED:

Docs: 469.98k

Words: 517.83M

Mongolian (mn)

Creative Commons CC0 license
HPLT Analytics report
raw 2.2G
dedup 1.6G
cleaned 1.4G

Source: CC/IA

RAW:

Docs: 2.50M

Words: 2.49B

DEDUPLICATED:

Docs: 1.06M

Words: 1.05B

CLEANED:

Docs: 594.90k

Words: 803.21M

Marathi (mr)

Creative Commons CC0 license
HPLT Analytics report
raw 2.0G
dedup 1.5G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 1.64M

Words: 1.91B

DEDUPLICATED:

Docs: 857.21k

Words: 812.46M

CLEANED:

Docs: 453.69k

Words: 519.55M

Malay (ms)

Creative Commons CC0 license
HPLT Analytics report
raw 40G
dedup 18G
cleaned 12G

Source: CC/IA

RAW:

Docs: 28.32M

Words: 56.68B

DEDUPLICATED:

Docs: 8.36M

Words: 13.36B

CLEANED:

Docs: 4.87M

Words: 9.03B

Maltese (mt)

Creative Commons CC0 license
HPLT Analytics report
raw 1.8G
dedup 1.4G
cleaned 154M

Source: CC/IA

RAW:

Docs: 926.22k

Words: 1.39B

DEDUPLICATED:

Docs: 484.42k

Words: 818.61M

CLEANED:

Docs: 111.12k

Words: 102.42M

Burmese (my)

Creative Commons CC0 license
HPLT Analytics report
raw 3.7G
dedup 2.0G
cleaned 1.1G

Source: CC/IA

RAW:

Docs: 2.46M

Words: 4.41B

DEDUPLICATED:

Docs: 826.11k

Words: 1.14B

CLEANED:

Docs: 239.47k

Words: 357.11M

Norwegian Bokmål (nb)

Creative Commons CC0 license
HPLT Analytics report
raw 46G
dedup 24G
cleaned 13G

Source: CC/IA

RAW:

Docs: 61.09M

Words: 55.51B

DEDUPLICATED:

Docs: 14.58M

Words: 16.26B

CLEANED:

Docs: 6.12M

Words: 8.30B

Nepali (ne)

Creative Commons CC0 license
HPLT Analytics report
raw 1.9G
dedup 1.7G
cleaned 8.4T

Source: CC/IA

RAW:

Docs: 2.11M

Words: 1.68B

DEDUPLICATED:

Docs: 1.36M

Words: 966.77M

CLEANED:

Docs: 863.35k

Words: 694.40M

Dutch (nl)

Creative Commons CC0 license
HPLT Analytics report
raw 159G
dedup 83G
cleaned 50G

Source: CC/IA

RAW:

Docs: 234.44M

Words: 250.08B

DEDUPLICATED:

Docs: 66.62M

Words: 55.94B

CLEANED:

Docs: 31.75M

Words: 33.30B

Norwegian Nynorsk (nn)

Creative Commons CC0 license
HPLT Analytics report
raw 1.4G
dedup 886M
cleaned 414M

Source: CC/IA

RAW:

Docs: 1.85M

Words: 1.64B

DEDUPLICATED:

Docs: 752.53k

Words: 615.68M

CLEANED:

Docs: 228.48k

Words: 298.57M

Punjabi (pa)

Creative Commons CC0 license
HPLT Analytics report
raw 1.1G
dedup 624M
cleaned 295M

Source: CC/IA

RAW:

Docs: 2.40M

Words: 1.34B

DEDUPLICATED:

Docs: 888.47k

Words: 523.18M

CLEANED:

Docs: 152.78k

Words: 184.77M

Polish (pl)

Creative Commons CC0 license
HPLT Analytics report
raw 249G
dedup 11T
cleaned 69G

Source: CC/IA

RAW:

Docs: 346.33M

Words: 366.18B

DEDUPLICATED:

Docs: 82.92M

Words: 76.27B

CLEANED:

Docs: 39.38M

Words: 44.17B

Pashto (ps)

Creative Commons CC0 license
HPLT Analytics report
raw 283M
dedup 85K
cleaned 146M

Source: CC/IA

RAW:

Docs: 313.15k

Words: 356.65M

DEDUPLICATED:

Docs: 142.66k

Words: 172.14M

CLEANED:

Docs: 88.21k

Words: 113.19M

Portuguese (pt)

Creative Commons CC0 license
HPLT Analytics report
raw 384G
dedup 163G
cleaned 116G

Source: CC/IA

RAW:

Docs: 448.20M

Words: 607.05B

DEDUPLICATED:

Docs: 103.82M

Words: 121.75B

CLEANED:

Docs: 58.24M

Words: 81.41B

Romanian (ro)

Creative Commons CC0 license
HPLT Analytics report
raw 86G
dedup 39G
cleaned 27G

Source: CC/IA

RAW:

Docs: 103.69M

Words: 144.82B

DEDUPLICATED:

Docs: 24.94M

Words: 28.92B

CLEANED:

Docs: 14.47M

Words: 19.49B

Russian (ru)

Creative Commons CC0 license
HPLT Analytics report
raw 1.7T
dedup 842G
cleaned 622G

Source: CC/IA

RAW:

Docs: 1.53B

Words: 1.79T

DEDUPLICATED:

Docs: 397.27M

Words: 413.51B

CLEANED:

Docs: 224.20M

Words: 284.58B

Sinhala (si)

Creative Commons CC0 license
HPLT Analytics report
raw 2.2G
dedup 1.2G
cleaned 1010M

Source: CC/IA

RAW:

Docs: 1.37M

Words: 2.35B

DEDUPLICATED:

Docs: 563.99k

Words: 734.99M

CLEANED:

Docs: 322.51k

Words: 568.03M

Slovak (sk)

Creative Commons CC0 license
HPLT Analytics report
raw 58G
dedup 22G
cleaned 7.7G

Source: CC/IA

RAW:

Docs: 90.43M

Words: 94.65B

DEDUPLICATED:

Docs: 13.99M

Words: 14.16B

CLEANED:

Docs: 4.62M

Words: 4.98B

Slovenian (sl)

Creative Commons CC0 license
HPLT Analytics report
raw 18G
dedup 9.7G
cleaned 3.7G

Source: CC/IA

RAW:

Docs: 19.17M

Words: 22.85B

DEDUPLICATED:

Docs: 5.82M

Words: 6.72B

CLEANED:

Docs: 2.20M

Words: 2.51B

Somali (so)

Creative Commons CC0 license
HPLT Analytics report
raw 466M
dedup 322M
cleaned 272M

Source: CC/IA

RAW:

Docs: 676.04k

Words: 544.50M

DEDUPLICATED:

Docs: 374.71k

Words: 253.71M

CLEANED:

Docs: 283.71k

Words: 211.80M

Albanian (sq)

Creative Commons CC0 license
HPLT Analytics report
raw 8.8G
dedup 4.6G
cleaned 1.7G

Source: CC/IA

RAW:

Docs: 9.18M

Words: 11.50B

DEDUPLICATED:

Docs: 3.22M

Words: 3.66B

CLEANED:

Docs: 1.24M

Words: 1.34B

Swedish (sv)

Creative Commons CC0 license
HPLT Analytics report
raw 77G
dedup 43G
cleaned 25G

Source: CC/IA

RAW:

Docs: 96.13M

Words: 95.99B

DEDUPLICATED:

Docs: 29.96M

Words: 29.78B

CLEANED:

Docs: 13.67M

Words: 16.91B

Swahili (sw)

Creative Commons CC0 license
HPLT Analytics report
raw 1.6G
dedup 1006M
cleaned 803M

Source: CC/IA

RAW:

Docs: 2.17M

Words: 2.10B

DEDUPLICATED:

Docs: 983.51k

Words: 830.99M

CLEANED:

Docs: 698.57k

Words: 668.17M

Tamil (ta)

Creative Commons CC0 license
HPLT Analytics report
raw 12G
dedup 5.8G
cleaned 4.3G

Source: CC/IA

RAW:

Docs: 5.46M

Words: 9.30B

DEDUPLICATED:

Docs: 2.47M

Words: 2.94B

CLEANED:

Docs: 1.24M

Words: 1.91B

Telugu (te)

Creative Commons CC0 license
HPLT Analytics report
raw 2.4G
dedup 11T
cleaned 925M

Source: CC/IA

RAW:

Docs: 3.51M

Words: 2.43B

DEDUPLICATED:

Docs: 1.61M

Words: 1.03B

CLEANED:

Docs: 415.60k

Words: 437.74M

Thai (th)

Creative Commons CC0 license
HPLT Analytics report
raw 83G
dedup 48G
cleaned 16G

Source: CC/IA

RAW:

Docs: 95.27M

Words: 78.63B

DEDUPLICATED:

Docs: 29.48M

Words: 16.43B

CLEANED:

Docs: 8.19M

Words: 4.33B

Filipino (tl)

Creative Commons CC0 license
HPLT Analytics report
raw 5.2G
dedup 2.1G
cleaned 1.2G

Source: CC/IA

RAW:

Docs: 4.97M

Words: 7.07B

DEDUPLICATED:

Docs: 1.20M

Words: 1.63B

CLEANED:

Docs: 585.24k

Words: 911.06M

Turkish (tr)

Creative Commons CC0 license
HPLT Analytics report
raw 154G
dedup 79G
cleaned 47G

Source: CC/IA

RAW:

Docs: 215.38M

Words: 238.30B

DEDUPLICATED:

Docs: 59.43M

Words: 64.92B

CLEANED:

Docs: 27.05M

Words: 42.65B

Tatar (tt)

Creative Commons CC0 license
HPLT Analytics report
raw 322M
dedup 251M
cleaned 163M

Source: CC/IA

RAW:

Docs: 368.35k

Words: 261.38M

DEDUPLICATED:

Docs: 172.47k

Words: 134.18M

CLEANED:

Docs: 65.15k

Words: 74.86M

Ukrainian (uk)

Creative Commons CC0 license
HPLT Analytics report
raw 54G
dedup 34G
cleaned 23G

Source: CC/IA

RAW:

Docs: 47.12M

Words: 52.95B

DEDUPLICATED:

Docs: 17.86M

Words: 18.19B

CLEANED:

Docs: 9.31M

Words: 10.57B

Urdu (ur)

Creative Commons CC0 license
HPLT Analytics report
raw 4.2G
dedup 2.6G
cleaned 1.8G

Source: CC/IA

RAW:

Docs: 6.09M

Words: 6.59B

DEDUPLICATED:

Docs: 2.23M

Words: 2.02B

CLEANED:

Docs: 1.44M

Words: 1.42B

Uzbek (uz)

Creative Commons CC0 license
HPLT Analytics report
raw 1.4G
dedup 1016M
cleaned 745M

Source: CC/IA

RAW:

Docs: 1.37M

Words: 1.11B

DEDUPLICATED:

Docs: 633.22k

Words: 556.45M

CLEANED:

Docs: 290.29k

Words: 367.25M

Vietnamese (vi)

Creative Commons CC0 license
HPLT Analytics report
raw 148G
dedup 65G
cleaned 54G

Source: CC/IA

RAW:

Docs: 174.17M

Words: 287.20B

DEDUPLICATED:

Docs: 40.10M

Words: 59.20B

CLEANED:

Docs: 31.50M

Words: 49.36B

Chinese (zh)

Creative Commons CC0 license
HPLT Analytics report
raw 5.3T
dedup 3.0T
cleaned 2.9T

Source: CC/IA

RAW:

Docs: 6.91B

Words: 1.79T

DEDUPLICATED:

Docs: 1.20B

Words: 482.86B

CLEANED:

Docs: 1.08B

Words: 432.88B

Bilingual

Data release 1.2 (December 2023)

There are 18 language pairs in this release. The parallel corpus contains over 96 million clean and unique sentence pairs and covers over 1.4 billion English tokens. The corpora are provided in raw, TMX and TXT compressed formats. These corpora have been highly curated, de-duplicated and filtered using the full Bitextor pipeline. Besides this, an anonymized (ROAM) version of the TMX is also provided.

MultiHPLT: Beyond English-centric parallel corpora!

Parallel corpora from the web crawls collected in the HPLT project and further processed for making it a multi-parallel corpus by pivoting via English. Link: https://opus.nlpl.eu/MultiHPLT/corpus/version/MultiHPLT

English-Arabic

Arabic (ar) - English (en)

Creative Commons CC0 license
HPLT Analytics report
moses 1.51 GB
tmx 7.4G
roam 1.2G
raw 70G

Source: CC/IA

TMX:

Words: 240M

Translation units: 15M

ROAM:

Words: 173.70M

Translation units: 11.87M

RAW:

Words: 33.20B

Translation units: 1.55B

62 downloads

Bosnian (bs) - English (en)

Creative Commons CC0 license
HPLT Analytics report
moses 14.98 MB
tmx 57M
roam 13M
raw 1.1G

Source: CC/IA

TMX:

Words: 2.8M

Translation units: 241K

ROAM:

Words: 280.86k

Translation units: 212.18k

RAW:

Words: 521.63M

Translation units: 27.00M

56 downloads

Catalan (ca) - English (en)

Creative Commons CC0 license
HPLT Analytics report
moses 825.38 MB
tmx 3.4G
roam 714M
raw 18G

Source: CC/IA

TMX:

Words: 142M

Translation units: 9.0M

ROAM:

Words: 125.90M

Translation units: 8.05M

RAW:

Words: 8.03B

Translation units: 402.49M

23 downloads

English (en) - Estonian (et)

Creative Commons CC0 license
HPLT Analytics report
moses 541.46 MB
tmx 2.6G
roam 493M
raw 37G

Source: CC/IA

TMX:

Words: 96M

Translation units: 6.1M

ROAM:

Words: 86.59M

Translation units: 5.51M

RAW:

Words: 15.48B

Translation units: 865.43M

13 downloads

English (en) - Basque (eu)

Creative Commons CC0 license
HPLT Analytics report
moses 57.91 MB
tmx 169M
roam 49M
raw 852M

Source: CC/IA

TMX:

Words: 10M

Translation units: 611K

ROAM:

Words: 8.41M

Translation units: 521.10k

RAW:

Words: 400.26M

Translation units: 20.83M

13 downloads

English (en) - Finnish (fi)

Creative Commons CC0 license
HPLT Analytics report
moses 2.08 GB
tmx 17G
roam 1.9G
raw 147G

Source: CC/IA

TMX:

Words: 339M

Translation units: 26M

ROAM:

Words: 303M

Translation units: 23M

RAW:

Words: 65.31B

Translation units: 3.83B

8 downloads

English (en) - Irish (ga)

Creative Commons CC0 license
HPLT Analytics report
moses 87.00 MB
tmx 593M
roam 84M
raw 3.6G

Source: CC/IA

TMX:

Words: 17M

Translation units: 995K

ROAM:

Words: 15.16M

Translation units: 920.01k

RAW:

Words: 2.01B

Translation units: 101.00M

35 downloads

English (en) - Galician (gl)

Creative Commons CC0 license
HPLT Analytics report
moses 84.24 MB
tmx 267M
roam 69M
raw 2.1G

Source: CC/IA

TMX:

Words: 14M

Translation units: 1.1M

ROAM:

Words: 11.63M

Translation units: 930.40k

RAW:

Words: 1.02B

Translation units: 56.10M

10 downloads

English (en) - Hindi (hi)

Creative Commons CC0 license
HPLT Analytics report
moses 1.17 GB
tmx 4.5G
roam 1010M
raw 46G

Source: CC/IA

TMX:

Words: 166M

Translation units: 13M

ROAM:

Words: 137.38M

Translation units: 10.49M

RAW:

Words: 19.25B

Translation units: 1.04B

10 downloads

English (en) - Croatian (hr)

Creative Commons CC0 license
HPLT Analytics report
moses 833.38 MB
tmx 4.3G
roam 769M
raw 40G

Source: CC/IA

TMX:

Words: 139M

Translation units: 9.4M

ROAM:

Words: 125.99M

Translation units: 8.61M

RAW:

Words: 16.57B

Translation units: 895.79M

8 downloads

English (en) - Icelandic (is)

Creative Commons CC0 license
HPLT Analytics report
moses 146.87 MB
tmx 811M
roam 130M
raw 7.5G

Source: CC/IA

TMX:

Words: 30M

Translation units: 2.2M

ROAM:

Words: 26.27M

Translation units: 1.91M

RAW:

Words: 3.27B

Translation units: 170.42M

4 downloads

English (en) - Macedonian (mk)

Creative Commons CC0 license
HPLT Analytics report
moses 122.18 MB
tmx 614M
roam 108M
raw 4.5G

Source: CC/IA

TMX:

Words: 19M

Translation units: 1.1M

ROAM:

Words: 15.55M

Translation units: 995.10k

RAW:

Words: 1.87B

Translation units: 91.29M

5 downloads

English (en) - Maltese (mt)

Creative Commons CC0 license
HPLT Analytics report
moses 103.87 MB
tmx 365M
roam 97M
raw 6.5G

Source: CC/IA

TMX:

Words: 19M

Translation units: 855K

ROAM:

Words: 17.14M

Translation units: 752.06k

RAW:

Words: 2.82B

Translation units: 135.10M

2 downloads

English (en) - Norwegian Nynorsk (nn)

Creative Commons CC0 license
HPLT Analytics report
moses 11.85 MB
tmx 32M
roam 8.9M
raw 945M

Source: CC/IA

TMX:

Words: 2.1M

Translation units: 133K

ROAM:

Words: 1.58M

Translation units: 108.30k

RAW:

Words: 496.50M

Translation units: 28.70M

3 downloads

English (en) - Albanian (sq)

Creative Commons CC0 license
HPLT Analytics report
moses 151.32 MB
tmx 688M
roam 120M
raw 13G

Source: CC/IA

TMX:

Words: 26M

Translation units: 1.7M

ROAM:

Words: 20.50M

Translation units: 1.35M

RAW:

Words: 5.82B

Translation units: 253.10M

6 downloads

English (en) - Serbian (sr)

Creative Commons CC0 license
HPLT Analytics report
moses 319.21 MB
tmx 2.1G
roam 292M
raw 35G

Source: CC/IA

TMX:

Words: 56M

Translation units: 4.0M

ROAM:

Words: 49.63M

Translation units: 3.62M

RAW:

Words: 14.25B

Translation units: 247.56M

2 downloads

English (en) - Swahili (sw)

Creative Commons CC0 license
HPLT Analytics report
moses 99.89 MB
tmx 476M
roam 70M
raw 11G

Source: CC/IA

TMX:

Words: 21M

Translation units: 1.8M

ROAM:

Words: 15.03M

Translation units: 1.38M

RAW:

Words: 5.75B

Translation units: 247.56M

8 downloads

English (en) - Traditional Chinese (zh-hant)

Creative Commons CC0 license
HPLT Analytics report
moses 508.09 MB
tmx 2.1G
roam 450M
raw 25G

Source: CC/IA

TMX:

Words: 85M

Translation units: 5.4M

ROAM:

Words: 73M

Translation units: 4.6M

RAW:

Words: 9.16B

Translation units: 530.12M

51 downloads

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these text data has been extracted.*
  • We license the actual packaging of these text data under the Creative Commons CC0 license ("no rights reserved") .
public-domain-logo

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

Ⓒ HPLT 2025

horizon-logoukri-logoukri-logo

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

eu flag

The contents of this publication are the sole responsibility of the HPLT consortium and do not necessarily reflect the opinion of the European Union.

Icons by Lucide

logo xlogo x
github icongithub icon

Visitor count

visitor map