HPLTDatasets v1

DEPRECATED RELEASE
GET FILES FROM RELEASE 1.2

This release provides mostly raw plain text data with some essential pre-processing but without sophisticated language identification, boilerplate removal, fine-grained filtering, cleaning, or de-duplication. At the same time, the texts come with some metadata, which is currently unused but can be employed by the end users to conduct their own filtering.

Please read the release report to get further details on how the datasets were produced.

Further data curation has been addressed in release 1.2.

Monolingual

Data release 1 (September 2023)

Here, we publish monolingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects.

There are 75 languages in this release (22 TB in total size) provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few GB each. The number of shards per language depends on the size of the specific corpus.

The format is JSONL, where each line is a valid JSON value and a full document with metadata. For example:

{"id":1, "document_lang":"en",
"scores":["0.76","0.76","0.76"],
"langs":["en","en","en"],
"text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3",
"url":"url1", "collection":"collection-1"
}
{"id":2, "document_lang":"en",
"scores":["0.65",...],
"langs":["en",...],
"text":"another paragraph\n...",
...

In each document, each paragraph is concatenated using new-line separators. langs and scores are lists containing one entry per paragraph, corresponding to the language identified and monocleaner score of each one.

How to download it?

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/one/monotext/eu_map.txt

Full download

If you want to download all the 22 Terabytes in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt

Afrikaans (af)

5.4G

Source: CC/IA

Docs: 4.36M

Words: NaN

Albanian (sq)

9.6G

Source: CC/IA

Docs: 9.18M

Words: NaN

Arabic (ar)

186G

Source: CC/IA

Docs: 197.00M

Words: NaN

Armenian (hy)

4.1G

Source: CC/IA

Docs: 3.99M

Words: NaN

Azerbaijani (az)

9.5G

Source: CC/IA

Docs: 10.40M

Words: NaN

Basque (eu)

1.5G

Source: CC/IA

Docs: 2.29M

Words: NaN

Belarusian (be)

6.1G

Source: CC/IA

Docs: 3.91M

Words: NaN

Bengali (bn)

15G

Source: CC/IA

Docs: 23.30M

Words: NaN

Bulgarian (bg)

54G

Source: CC/IA

Docs: 58.30M

Words: NaN

Burmese (my)

4.1G

Source: CC/IA

Docs: 2.46M

Words: NaN

Catalan (ca)

18G

Source: CC/IA

Docs: 24.20M

Words: NaN

Chinese (zh)

5.8T

Source: CC/IA

Docs: 6.91B

Words: NaN

Czech (cs)

137G

Source: CC/IA

Docs: 184.00M

Words: NaN

Danish (da)

100G

Source: CC/IA

Docs: 686.00M

Words: NaN

Dutch (nl)

170G

Source: CC/IA

Docs: 235.00M

Words: NaN

English (en)

9.5T

Source: CC/IA

Docs: 12.70B

Words: NaN

Esperanto (eo)

382M

Source: CC/IA

Docs: 401.00K

Words: NaN

Estonian (et)

20G

Source: CC/IA

Docs: 21.20M

Words: NaN

Finnish (fi)

75G

Source: CC/IA

Docs: 89.10M

Words: NaN

French (fr)

558G

Source: CC/IA

Docs: 660.00M

Words: NaN

Galician (gl)

2.9G

Source: CC/IA

Docs: 4.60M

Words: NaN

Georgian (ka)

6.8G

Source: CC/IA

Docs: 6.46M

Words: NaN

German (de)

805G

Source: CC/IA

Docs: 1.02B

Words: NaN

Greek (el)

192G

Source: CC/IA

Docs: 134.00M

Words: NaN

Gujarati (gu)

1.1G

Source: CC/IA

Docs: 915.00K

Words: NaN

Hebrew (he)

49G

Source: CC/IA

Docs: 46.20M

Words: NaN

Hindi (hi)

41G

Source: CC/IA

Docs: 33.70M

Words: NaN

Hungarian (hu)

111G

Source: CC/IA

Docs: 137.00M

Words: NaN

Icelandic (is)

3.7G

Source: CC/IA

Docs: 3.86M

Words: NaN

Indonesian (id)

146G

Source: CC/IA

Docs: 126.00M

Words: NaN

Irish (ga)

982M

Source: CC/IA

Docs: 2.71M

Words: NaN

Italian (it)

323G

Source: CC/IA

Docs: 337.00M

Words: NaN

Japanese (ja)

681G

Source: CC/IA

Docs: 680.00M

Words: NaN

Kannada (kn)

2.0G

Source: CC/IA

Docs: 1.93M

Words: NaN

Kazakh (kk)

2.7G

Source: CC/IA

Docs: 3.46M

Words: NaN

Korean (ko)

141G

Source: CC/IA

Docs: 248.00M

Words: NaN

Kyrgyz (ky)

360M

Source: CC/IA

Docs: 334.00K

Words: NaN

Latin (la)

8.5G

Source: CC/IA

Docs: 20.50M

Words: NaN

Latvian (lv)

19G

Source: CC/IA

Docs: 21.20M

Words: NaN

Lithuanian (lt)

24G

Source: CC/IA

Docs: 32.40M

Words: NaN

Macedonian (mk)

2.8G

Source: CC/IA

Docs: 3.41M

Words: NaN

Malay (ms)

43G

Source: CC/IA

Docs: 28.30M

Words: NaN

Malayalam (ml)

2.8G

Source: CC/IA

Docs: 2.19M

Words: NaN

Maltese (mt)

2.0G

Source: CC/IA

Docs: 925.00K

Words: NaN

Marathi (mr)

2.1G

Source: CC/IA

Docs: 1.64M

Words: NaN

Mongolian (mn)

2.3G

Source: CC/IA

Docs: 2.50M

Words: NaN

Nepali (ne)

2.1G

Source: CC/IA

Docs: 2.11M

Words: NaN

Norwegian Bokmål (nb)

49G

Source: CC/IA

Docs: 61.10M

Words: NaN

Norwegian Nynorsk (nn)

1.5G

Source: CC/IA

Docs: 1.85M

Words: NaN

Pashto (ps)

296M

Source: CC/IA

Docs: 313.00K

Words: NaN

Persian (fa)

184G

Source: CC/IA

Docs: 189.00M

Words: NaN

Polish (pl)

267G

Source: CC/IA

Docs: 346.00M

Words: NaN

Portuguese (pt)

413G

Source: CC/IA

Docs: 448.00M

Words: NaN

Punjabi (pa)

1.2G

Source: CC/IA

Docs: 2.40M

Words: NaN

Romanian (ro)

92G

Source: CC/IA

Docs: 104.00M

Words: NaN

Russian (ru)

1.8T

Source: CC/IA

Docs: 1.53B

Words: NaN

Serbo-Croatian (hbs)

56G

Source: CC/IA

Docs: 60.60M

Words: NaN

Sinhala (si)

2.4G

Source: CC/IA

Docs: 1.37M

Words: NaN

Slovak (sk)

62G

Source: CC/IA

Docs: 90.40M

Words: NaN

Slovenian (sl)

19G

Source: CC/IA

Docs: 19.20M

Words: NaN

Somali (so)

488M

Source: CC/IA

Docs: 676.00K

Words: NaN

Spanish (es)

734G

Source: CC/IA

Docs: 672.00M

Words: NaN

Swahili (sw)

1.7G

Source: CC/IA

Docs: 2.17M

Words: NaN

Swedish (sv)

83G

Source: CC/IA

Docs: 96.10M

Words: NaN

Tagalog (tl)

5.7G

Source: CC/IA

Docs: 4.97M

Words: NaN

Tamil (ta)

12G

Source: CC/IA

Docs: 5.46M

Words: NaN

Tatar (tt)

338M

Source: CC/IA

Docs: 368.00K

Words: NaN

Telugu (te)

2.6G

Source: CC/IA

Docs: 3.51M

Words: NaN

Thai (th)

90G

Source: CC/IA

Docs: 95.30M

Words: NaN

Turkish (tr)

166G

Source: CC/IA

Docs: 215.00M

Words: NaN

Ukrainian (uk)

58G

Source: CC/IA

Docs: 47.10M

Words: NaN

Urdu (ur)

4.6G

Source: CC/IA

Docs: 6.09M

Words: NaN

Uzbek (uz)

1.5G

Source: CC/IA

Docs: 1.37M

Words: NaN

Vietnamese (vi)

159G

Source: CC/IA

Docs: 174.00M

Words: NaN

Welsh (cy)

517M

Source: CC/IA

Docs: 725.00K

Words: NaN

Bilingual

Data release 1 (September 2023)

Here, we publish bilingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.

There are 14 language pairs in this release (25 Gigabytes in total size). The corpora are provided in raw TSV files gzip compressed. The raw format means that data is coming directly after the sentence aligment step and only obvious noise has been filtered from it. So, it is likely to contain high portions of non-parallel sentences.

The format is a raw TSV file with aligned sentences and metadata. For example:

http://statsborger.no/en/udi-strips-norwegian-citizenship-of-my-norwegian-children/ TABhttp://statsborger.no/er-geir-sine-barn-ulovelige-innvandrere/ TABWill we ever be able to move back to Norway as a family?TABVil me nokon sinne ha lov til å flytta tilbake til Norge som ein familie?TAB1941213a3984df16TAB c956c856dce68bc9TAB ba89f95e2977f9fe TAB91.54TAB wide17

Where the structure corresponds to:

URL_source TABURL_target TABsource_sentence TABtarget_sentence TABsource_hash TABtarget_hash bifixer_hashTAB bifixer_scoreTABcollection

The bifixer-related hashes and score can be used to deduplicate, the other two hashes are useful to recrawl these sentences and the collection is the source of the crawled texts.

English-Albanian

957M

Source: CC/IA

Segments: NaN

Words: NaN

English-Arabic

4.2G

Source: CC/IA

Segments: NaN

Words: NaN

English-Basque

267M

Source: CC/IA

Segments: NaN

Words: NaN

English-Bosnian

81M

Source: CC/IA

Segments: NaN

Words: NaN

English-Croatian

8.4G

Source: CC/IA

Segments: NaN

Words: NaN

English-Estonian

5.8G

Source: CC/IA

Segments: NaN

Words: NaN

English-Galician

437M

Source: CC/IA

Segments: NaN

Words: NaN

English-Hindi

9.8G

Source: CC/IA

Segments: NaN

Words: NaN

English-Icelandic

1.6G

Source: CC/IA

Segments: NaN

Words: NaN

English-Irish

753M

Source: CC/IA

Segments: NaN

Words: NaN

English-Macedonian

1.3G

Source: CC/IA

Segments: NaN

Words: NaN

English-Maltese

947M

Source: CC/IA

Segments: NaN

Words: NaN

English-Norwegian Nynorsk

60M

Source: CC/IA

Segments: NaN

Words: NaN

English-Serbian

291M

Source: CC/IA

Segments: NaN

Words: NaN

License

These data are released under this licensing scheme:

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.