Monolingual
Data release 1 (September 2023)
Here, we publish monolingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects.
There are 75 languages in this release (22 TB in total size) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.
The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:
{"id":1, "document_lang":"en", "scores":["0.76","0.76","0.76"], "langs":["en","en","en"], "text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3", "url":"url1", "collection":"collection-1" } {"id":2, "document_lang":"en", "scores":["0.65",...], "langs":["en",...], "text":"another paragraph\n...", ...
In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.
How to download it?
The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/one/monotext/eu_map.txt
Full download
If you want to download all the 22 Terabytes in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt
Source: CC/IA
Docs: 4.36M
Words: NaN
Source: CC/IA
Docs: 9.18M
Words: NaN
Source: CC/IA
Docs: 197.00M
Words: NaN
Source: CC/IA
Docs: 3.99M
Words: NaN
Source: CC/IA
Docs: 10.40M
Words: NaN
Source: CC/IA
Docs: 2.29M
Words: NaN
Source: CC/IA
Docs: 3.91M
Words: NaN
Source: CC/IA
Docs: 23.30M
Words: NaN
Source: CC/IA
Docs: 58.30M
Words: NaN
Source: CC/IA
Docs: 2.46M
Words: NaN
Source: CC/IA
Docs: 24.20M
Words: NaN
Source: CC/IA
Docs: 6.91B
Words: NaN
Source: CC/IA
Docs: 184.00M
Words: NaN
Source: CC/IA
Docs: 686.00M
Words: NaN
Source: CC/IA
Docs: 235.00M
Words: NaN
Source: CC/IA
Docs: 12.70B
Words: NaN
Source: CC/IA
Docs: 401.00K
Words: NaN
Source: CC/IA
Docs: 21.20M
Words: NaN
Source: CC/IA
Docs: 89.10M
Words: NaN
Source: CC/IA
Docs: 660.00M
Words: NaN
Source: CC/IA
Docs: 4.60M
Words: NaN
Source: CC/IA
Docs: 6.46M
Words: NaN
Source: CC/IA
Docs: 1.02B
Words: NaN
Source: CC/IA
Docs: 134.00M
Words: NaN
Source: CC/IA
Docs: 915.00K
Words: NaN
Source: CC/IA
Docs: 46.20M
Words: NaN
Source: CC/IA
Docs: 33.70M
Words: NaN
Source: CC/IA
Docs: 137.00M
Words: NaN
Source: CC/IA
Docs: 3.86M
Words: NaN
Source: CC/IA
Docs: 126.00M
Words: NaN
Source: CC/IA
Docs: 2.71M
Words: NaN
Source: CC/IA
Docs: 337.00M
Words: NaN
Source: CC/IA
Docs: 680.00M
Words: NaN
Source: CC/IA
Docs: 1.93M
Words: NaN
Source: CC/IA
Docs: 3.46M
Words: NaN
Source: CC/IA
Docs: 248.00M
Words: NaN
Source: CC/IA
Docs: 334.00K
Words: NaN
Source: CC/IA
Docs: 20.50M
Words: NaN
Source: CC/IA
Docs: 21.20M
Words: NaN
Source: CC/IA
Docs: 32.40M
Words: NaN
Source: CC/IA
Docs: 3.41M
Words: NaN
Source: CC/IA
Docs: 28.30M
Words: NaN
Source: CC/IA
Docs: 2.19M
Words: NaN
Source: CC/IA
Docs: 925.00K
Words: NaN
Source: CC/IA
Docs: 1.64M
Words: NaN
Source: CC/IA
Docs: 2.50M
Words: NaN
Source: CC/IA
Docs: 2.11M
Words: NaN
Source: CC/IA
Docs: 61.10M
Words: NaN
Source: CC/IA
Docs: 1.85M
Words: NaN
Source: CC/IA
Docs: 313.00K
Words: NaN
Source: CC/IA
Docs: 189.00M
Words: NaN
Source: CC/IA
Docs: 346.00M
Words: NaN
Source: CC/IA
Docs: 448.00M
Words: NaN
Source: CC/IA
Docs: 2.40M
Words: NaN
Source: CC/IA
Docs: 104.00M
Words: NaN
Source: CC/IA
Docs: 1.53B
Words: NaN
Source: CC/IA
Docs: 60.60M
Words: NaN
Source: CC/IA
Docs: 1.37M
Words: NaN
Source: CC/IA
Docs: 90.40M
Words: NaN
Source: CC/IA
Docs: 19.20M
Words: NaN
Source: CC/IA
Docs: 676.00K
Words: NaN
Source: CC/IA
Docs: 672.00M
Words: NaN
Source: CC/IA
Docs: 2.17M
Words: NaN
Source: CC/IA
Docs: 96.10M
Words: NaN
Source: CC/IA
Docs: 4.97M
Words: NaN
Source: CC/IA
Docs: 5.46M
Words: NaN
Source: CC/IA
Docs: 368.00K
Words: NaN
Source: CC/IA
Docs: 3.51M
Words: NaN
Source: CC/IA
Docs: 95.30M
Words: NaN
Source: CC/IA
Docs: 215.00M
Words: NaN
Source: CC/IA
Docs: 47.10M
Words: NaN
Source: CC/IA
Docs: 6.09M
Words: NaN
Source: CC/IA
Docs: 1.37M
Words: NaN
Source: CC/IA
Docs: 174.00M
Words: NaN
Source: CC/IA
Docs: 725.00K
Words: NaN
Bilingual
Data release 1 (September 2023)
Here, we publish bilingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.
There are 14 language pairs in this release (25 Gigabytes in total size). The corpora are provided in raw TSV files gzip compressed. The raw format means that data is coming directly after the sentence aligment step and only obvious noise has been filtered from it. So, it is likely to contain high portions of non-parallel sentences.
The format is a raw TSV file with aligned sentences and metadata. For example:
http://statsborger.no/en/udi-strips-norwegian-citizenship-of-my-norwegian-children/ TABhttp://statsborger.no/er-geir-sine-barn-ulovelige-innvandrere/ TABWill we ever be able to move back to Norway as a family?TABVil me nokon sinne ha lov til å flytta tilbake til Norge som ein familie?TAB1941213a3984df16TAB c956c856dce68bc9TAB ba89f95e2977f9fe TAB91.54TAB wide17
Where the structure corresponds to:
URL_source TABURL_target TABsource_sentence TABtarget_sentence TABsource_hash TABtarget_hash bifixer_hashTAB bifixer_scoreTABcollection
The bifixer-related hashes and score can be used to deduplicate, the other two hashes are useful to recrawl these sentences and the collection is the source of the crawled texts.
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN
Source: CC/IA
Segments: NaN
Words: NaN