Version 2.0 of the HPLT Monolingual Datasets is now published. These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: deduplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as deduplicated minus those filtered out by our cleaning heuristics. The cleaned variant is recommended unless you want to try your own cleaning pipelines.
Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake, and text extraction pipeline was run on LUMI supercomputer.
HPLT Monolingual Datasets version 2.0 (the deduplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on.
These automated reports provide useful statistics about the clean version of the HPLT v.2.0 datasets. They are the result of running the HPLT Analytics Tool on them. They allow inspecting dataset properties prior to a full download.
The output format is JSONL, where each line is a valid JSON object, providing a full document with all its metadata and text content. An example is provided here:
{"f":"./CC-MAIN-20170116095124-00401-ip-10-171-10-70.ec2.internal.warc.2.gz","o":37461524,"s":8114,"rs":27035, "u":"http://blogtailors.blogspot.com/2010/08/saida-de-emergencia-procura-tradutores.html", "c":"text/html","ts":"2017-01-24T09:05:15Z", "collection":"cc17", "lang":["por_Latn","slk_Latn","kmb_Latn"],"prob":[1,0,0], "text":"A editora Saída de Emergência procura revisores e tradutores profissionais em regime freelancer com formação superior linguística e experiência na área de revisão e tradução literária.\nPreza-se a capacidade de cumprimento de prazos bem definidos. Os interessados poderão enviar candidatura paraa geral@saidademergencia.com.\nquarta-feira, 4 de agosto de 2010\nSaída de Emergência procura tradutores e revisores em regime freelancer\nA editora Saída de Emergência procura revisores e tradutores profissionais em regime freelancer com formação superior linguística eexperiência na área de revisão e tradução literária.", "seg_langs":["por_Latn","por_Latn","por_Latn","por_Latn","por_Latn"], "robotstxt":"allowed", "id":"92a22c9672a52ae5af0da0457a184151", "filter":"keep", "pii":[[296,323]], "doc_scores":[6.4,10,10,10,10,10,8,0,0]} {"f":"./path/to/80716-00468.warc.gz","o":579437,"s":1100,"rs":44535, ... "text":"More texts\n...", ...
In each document text field, each segment is concatenated using new-line separators. The first 7 fields are inherited from warc2text HTML extraction from the WARCs, explained here, with the exception ofp field which is replaced by text and l replaced by lang and prob, which describe the three top identified languages for the document and their prediction probabilities. The rest of the output is explained here.
The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/two/deduplicated/epo_Latn_map.txt
wget -i https://data.hplt-project.org/two/cleaned/epo_Latn_map.txt
If you want to download all the available files from deduplicated or cleaned variants in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/deduplicated/hplt_monolingual_map_deduplicated_2.0.txt
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/cleaned/hplt_monolingual_map_cleaned_2.0.txt
Every .zst file in the deduplicated and cleaned variants is accompanied with a corresponding .zst.md5 file containing its MD5 checksum. See, for example, this directory with the cleaned Norwegian data: https://data.hplt-project.org/two/cleaned/nob_Latn/
The integrity of a file can be checked with, e.g., md5sum -c 1.jsonl.zst.md5
There are 193 languages on the HPLT dataset catalogue in version 2.0. For each language and variant (deduped and cleaned), counts for number of documents, words, characters and segments are provided. If you find any problem, please contact us !
You can find stratified samples for this release in the following links:
A description of the samples can be found here.
Find interesting hierarchical treemaps to get a glimpse of the language family distribution across HPLT data!
These data are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.
Achinese (Arabic) (ace-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 11.67k
Words: 3.35M
Chars: 25.69M
Segments: 825.08k
CLEANED:
Docs: 16
Words: 8.36k
Chars: 49.74k
Segments: 117
16 downloads
Achinese (Latin) (ace-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.26M
Words: 172.75M
Chars: 1.59B
Segments: 44.95M
CLEANED:
Docs: 12.93k
Words: 8.20M
Chars: 50.85M
Segments: 206.19k
11 downloads
Afrikaans (Latin) (afr-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 9.59M
Words: 2.43B
Chars: 18.35B
Segments: 251.52M
CLEANED:
Docs: 1.46M
Words: 1.00B
Chars: 5.95B
Segments: 37.74M
6 downloads
Albanian (Latin) (als-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 11.29M
Words: 4.37B
Chars: 28.42B
Segments: 304.29M
CLEANED:
Docs: 5.39M
Words: 2.71B
Chars: 16.10B
Segments: 95.10M
5 downloads
Amharic (Ethiopic) (amh-Ethi)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 13.02M
Words: 7.21B
Chars: 65.58B
Segments: 725.19M
CLEANED:
Docs: 295.54k
Words: 195.89M
Chars: 1.03B
Segments: 7.01M
8 downloads
Arabic (Arabic) (ara-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 191.75M
Words: 73.48B
Chars: 434.47B
Segments: 5.64B
CLEANED:
Docs: 82.67M
Words: 48.14B
Chars: 279.59B
Segments: 2.20B
27 downloads
Assamese (Bangla) (asm-Beng)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 263.14k
Words: 94.13M
Chars: 617.90M
Segments: 4.76M
CLEANED:
Docs: 175.71k
Words: 73.44M
Chars: 475.83M
Segments: 2.68M
3 downloads
Asturian (Latin) (ast-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 12.53M
Words: 1.99B
Chars: 12.78B
Segments: 361.09M
CLEANED:
Docs: 273.24k
Words: 194.99M
Chars: 1.24B
Segments: 7.43M
4 downloads
Awadhi (Devanagari) (awa-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.12M
Words: 319.73M
Chars: 1.91B
Segments: 64.96M
CLEANED:
Docs: 7.28k
Words: 6.05M
Chars: 28.78M
Segments: 131.47k
2 downloads
Aymara (Latin) (ayr-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.56M
Words: 402.05M
Chars: 2.40B
Segments: 84.87M
CLEANED:
Docs: 9.22k
Words: 3.07M
Chars: 25.09M
Segments: 188.53k
3 downloads
South Azerbaijani (Arabic) (azb-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 253.48k
Words: 69.60M
Chars: 445.30M
Segments: 7.60M
CLEANED:
Docs: 66.11k
Words: 39.58M
Chars: 260.26M
Segments: 2.39M
5 downloads
Azerbaijani (Latin) (azj-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 16.09M
Words: 4.42B
Chars: 34.65B
Segments: 409.54M
CLEANED:
Docs: 6.48M
Words: 2.57B
Chars: 19.63B
Segments: 126.61M
3 downloads
Bashkir (Cyrillic) (bak-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 778.47k
Words: 144.56M
Chars: 1.02B
Segments: 14.89M
CLEANED:
Docs: 170.82k
Words: 75.33M
Chars: 558.67M
Segments: 3.14M
1 download
Bambara (Latin) (bam-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.45M
Words: 32.23M
Chars: 230.47M
Segments: 8.57M
CLEANED:
Docs: 5.72k
Words: 3.98M
Chars: 20.74M
Segments: 91.72k
3 downloads
Balinese (Latin) (ban-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 336.98k
Words: 74.92M
Chars: 628.80M
Segments: 15.37M
CLEANED:
Docs: 10.70k
Words: 11.34M
Chars: 77.26M
Segments: 601.14k
2 downloads
Belarusian (Cyrillic) (bel-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 5.75M
Words: 2.27B
Chars: 18.85B
Segments: 167.78M
CLEANED:
Docs: 2.32M
Words: 1.21B
Chars: 8.54B
Segments: 48.84M
13 downloads
Bemba (Latin) (bem-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 759.41k
Words: 116.22M
Chars: 612.21M
Segments: 34.05M
CLEANED:
Docs: 6.14k
Words: 4.52M
Chars: 32.33M
Segments: 133.54k
Bangla (Bangla) (ben-Beng)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 17.34M
Words: 6.41B
Chars: 41.25B
Segments: 493.78M
CLEANED:
Docs: 11.04M
Words: 4.64B
Chars: 30.17B
Segments: 176.01M
2 downloads
Bhojpuri (Devanagari) (bho-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 226.96k
Words: 39.34M
Chars: 213.88M
Segments: 4.15M
CLEANED:
Docs: 28.64k
Words: 13.47M
Chars: 68.68M
Segments: 458.26k
4 downloads
Banjar (Arabic) (bjn-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 14.07k
Words: 2.16M
Chars: 18.62M
Segments: 591.75k
CLEANED:
Docs: 1.11k
Words: 548.24k
Chars: 3.32M
Segments: 19.53k
2 downloads
Banjar (Latin) (bjn-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.31M
Words: 209.29M
Chars: 1.67B
Segments: 49.15M
CLEANED:
Docs: 18.76k
Words: 8.05M
Chars: 55.99M
Segments: 366.34k
Tibetan (Tibetan) (bod-Tibt)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 171.77k
Words: 15.53M
Chars: 680.83M
Segments: 2.59M
CLEANED:
Docs: 27.44k
Words: 5.78M
Chars: 268.56M
Segments: 464.99k
9 downloads
Bosnian (Latin) (bos-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 35.57M
Words: 13.58B
Chars: 89.09B
Segments: 860.56M
CLEANED:
Docs: 14.61M
Words: 7.26B
Chars: 46.09B
Segments: 268.16M
3 downloads
Buginese (Latin) (bug-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 572.44k
Words: 79.03M
Chars: 620.28M
Segments: 17.32M
CLEANED:
Docs: 2.02k
Words: 2.70M
Chars: 19.31M
Segments: 38.55k
Bulgarian (Cyrillic) (bul-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 56.66M
Words: 20.84B
Chars: 134.99B
Segments: 1.48B
CLEANED:
Docs: 28.09M
Words: 15.30B
Chars: 96.96B
Segments: 681.41M
2 downloads
Catalan (Latin) (cat-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 34.44M
Words: 12.69B
Chars: 78.07B
Segments: 724.77M
CLEANED:
Docs: 18.55M
Words: 10.02B
Chars: 60.21B
Segments: 383.34M
7 downloads
Cebuano (Latin) (ceb-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.24M
Words: 188.62M
Chars: 1.20B
Segments: 21.13M
CLEANED:
Docs: 138.84k
Words: 85.89M
Chars: 515.83M
Segments: 2.86M
Czech (Latin) (ces-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 168.65M
Words: 62.92B
Chars: 412.04B
Segments: 5.40B
CLEANED:
Docs: 75.29M
Words: 42.08B
Chars: 274.01B
Segments: 1.93B
6 downloads
Chokwe (Latin) (cjk-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 27.76k
Words: 3.79M
Chars: 27.55M
Segments: 820.23k
CLEANED:
Docs: 1.20k
Words: 964.70k
Chars: 7.43M
Segments: 36.70k
Central Kurdish (Arabic) (ckb-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.26M
Words: 442.62M
Chars: 2.86B
Segments: 22.55M
CLEANED:
Docs: 273.75k
Words: 142.65M
Chars: 913.08M
Segments: 5.23M
1 download
Crimean Tatar (Latin) (crh-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.94M
Words: 303.68M
Chars: 2.01B
Segments: 59.37M
CLEANED:
Docs: 122.74k
Words: 36.76M
Chars: 281.20M
Segments: 1.38M
Welsh (Latin) (cym-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.58M
Words: 793.12M
Chars: 5.35B
Segments: 79.45M
CLEANED:
Docs: 758.13k
Words: 409.04M
Chars: 2.40B
Segments: 15.57M
3 downloads
Danish (Latin) (dan-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 101.01M
Words: 31.83B
Chars: 205.93B
Segments: 2.80B
CLEANED:
Docs: 33.84M
Words: 21.20B
Chars: 133.41B
Segments: 873.02M
12 downloads
German (Latin) (deu-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 897.17M
Words: 344.15B
Chars: 2.48T
Segments: 28.01B
CLEANED:
Docs: 482.05M
Words: 251.48B
Chars: 1.78T
Segments: 11.13B
7 downloads
Dinka (Latin) (dik-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 72.72k
Words: 14.17M
Chars: 72.43M
Segments: 1.48M
CLEANED:
Docs: 2.33k
Words: 2.29M
Chars: 11.54M
Segments: 34.65k
Dyula (Latin) (dyu-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 22.24k
Words: 2.07M
Chars: 11.85M
Segments: 198.56k
CLEANED:
Docs: 1.39k
Words: 1.19M
Chars: 5.55M
Segments: 24.56k
Dzongkha (Tibetan) (dzo-Tibt)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 60.26k
Words: 8.12M
Chars: 96.26M
Segments: 1.85M
CLEANED:
Docs: 1.63k
Words: 422.24k
Chars: 7.38M
Segments: 39.97k
1 download
Greek (Greek) (ell-Grek)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 126.93M
Words: 55.53B
Chars: 373.82B
Segments: 3.68B
CLEANED:
Docs: 70.33M
Words: 42.70B
Chars: 283.60B
Segments: 1.85B
6 downloads
English (Latin) (eng-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.72B
Words: 3.75T
Chars: 22.79T
Segments: 220.10B
CLEANED:
Docs: 4.39B
Words: 2.86T
Chars: 17.09T
Segments: 116.52B
60 downloads
Esperanto (Latin) (epo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.15M
Words: 681.98M
Chars: 4.33B
Segments: 51.27M
CLEANED:
Docs: 818.88k
Words: 471.60M
Chars: 2.98B
Segments: 20.35M
1 download
Estonian (Latin) (est-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 27.01M
Words: 8.05B
Chars: 62.78B
Segments: 822.02M
CLEANED:
Docs: 8.45M
Words: 4.74B
Chars: 36.03B
Segments: 264.42M
13 downloads
Basque (Latin) (eus-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 8.36M
Words: 2.30B
Chars: 16.77B
Segments: 156.66M
CLEANED:
Docs: 1.97M
Words: 776.64M
Chars: 6.05B
Segments: 37.62M
3 downloads
Ewe (Latin) (ewe-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 422.52k
Words: 27.85M
Chars: 291.48M
Segments: 5.78M
CLEANED:
Docs: 3.77k
Words: 4.31M
Chars: 21.32M
Segments: 143.40k
1 download
Faroese (Latin) (fao-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 772.73k
Words: 166.00M
Chars: 1.18B
Segments: 19.69M
CLEANED:
Docs: 239.92k
Words: 93.45M
Chars: 582.04M
Segments: 4.53M
8 downloads
Fijian (Latin) (fij-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 246.03k
Words: 24.93M
Chars: 212.08M
Segments: 3.05M
CLEANED:
Docs: 8.91k
Words: 7.26M
Chars: 37.70M
Segments: 178.92k
1 download
Finnish (Latin) (fin-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 79.98M
Words: 28.26B
Chars: 235.64B
Segments: 3.12B
CLEANED:
Docs: 34.82M
Words: 18.45B
Chars: 155.71B
Segments: 976.62M
16 downloads
Fon (Latin) (fon-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 41.82k
Words: 9.08M
Chars: 57.06M
Segments: 522.76k
CLEANED:
Docs: 1.23k
Words: 1.23M
Chars: 5.34M
Segments: 14.76k
1 download
French (Latin) (fra-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 685.35M
Words: 303.58B
Chars: 1.87T
Segments: 18.88B
CLEANED:
Docs: 401.83M
Words: 237.04B
Chars: 1.46T
Segments: 10.56B
17 downloads
Friulian (Latin) (fur-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.78M
Words: 475.04M
Chars: 2.85B
Segments: 133.33M
CLEANED:
Docs: 36.67k
Words: 20.82M
Chars: 114.77M
Segments: 730.04k
Nigerian Fulfulde (Latin) (fuv-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 171.74k
Words: 56.25M
Chars: 498.80M
Segments: 8.17M
CLEANED:
Docs: 7.76k
Words: 5.14M
Chars: 29.91M
Segments: 133.98k
1 download
Oromo (Latin) (gaz-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 563.06k
Words: 120.65M
Chars: 1.21B
Segments: 14.06M
CLEANED:
Docs: 49.14k
Words: 28.88M
Chars: 219.26M
Segments: 973.63k
3 downloads
Scottish Gaelic (Latin) (gla-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.42M
Words: 302.57M
Chars: 2.12B
Segments: 43.72M
CLEANED:
Docs: 137.41k
Words: 80.66M
Chars: 483.76M
Segments: 3.31M
3 downloads
Irish (Latin) (gle-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.76M
Words: 573.32M
Chars: 3.88B
Segments: 57.52M
CLEANED:
Docs: 490.79k
Words: 295.71M
Chars: 1.75B
Segments: 10.99M
3 downloads
Galician (Latin) (glg-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 17.32M
Words: 3.52B
Chars: 24.37B
Segments: 635.06M
CLEANED:
Docs: 3.02M
Words: 1.64B
Chars: 10.11B
Segments: 61.18M
2 downloads
Guarani (Latin) (grn-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.38M
Words: 1.02B
Chars: 6.95B
Segments: 169.01M
CLEANED:
Docs: 73.42k
Words: 30.72M
Chars: 218.70M
Segments: 1.71M
Gujarati (Gujarati) (guj-Gujr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.52M
Words: 738.53M
Chars: 4.58B
Segments: 51.48M
CLEANED:
Docs: 1.13M
Words: 576.82M
Chars: 3.39B
Segments: 20.64M
2 downloads
Haitian Creole (Latin) (hat-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 6.27M
Words: 944.63M
Chars: 5.69B
Segments: 162.56M
CLEANED:
Docs: 212.69k
Words: 122.29M
Chars: 639.12M
Segments: 4.64M
Hausa (Latin) (hau-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.63M
Words: 300.87M
Chars: 1.69B
Segments: 31.88M
CLEANED:
Docs: 315.87k
Words: 152.62M
Chars: 853.83M
Segments: 5.69M
Hebrew (Hebrew) (heb-Hebr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 40.69M
Words: 16.09B
Chars: 93.70B
Segments: 1.43B
CLEANED:
Docs: 17.12M
Words: 9.97B
Chars: 56.84B
Segments: 466.63M
2 downloads
Hindi (Devanagari) (hin-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 26.80M
Words: 13.76B
Chars: 74.08B
Segments: 751.52M
CLEANED:
Docs: 13.65M
Words: 8.64B
Chars: 43.97B
Segments: 267.41M
10 downloads
Chhattisgarhi (Devanagari) (hne-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 914.27k
Words: 59.78M
Chars: 324.99M
Segments: 22.51M
CLEANED:
Docs: 2.81k
Words: 2.20M
Chars: 10.60M
Segments: 55.00k
1 download
Croatian (Latin) (hrv-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 41.23M
Words: 14.20B
Chars: 92.86B
Segments: 1.13B
CLEANED:
Docs: 12.30M
Words: 7.31B
Chars: 48.01B
Segments: 297.13M
6 downloads
Hungarian (Latin) (hun-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 116.86M
Words: 44.00B
Chars: 324.50B
Segments: 4.16B
CLEANED:
Docs: 51.87M
Words: 30.52B
Chars: 225.25B
Segments: 1.42B
1 download
Armenian (Armenian) (hye-Armn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 6.44M
Words: 1.72B
Chars: 12.97B
Segments: 123.72M
CLEANED:
Docs: 3.60M
Words: 1.40B
Chars: 10.72B
Segments: 65.24M
1 download
Igbo (Latin) (ibo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.41M
Words: 121.57M
Chars: 823.31M
Segments: 18.86M
CLEANED:
Docs: 56.29k
Words: 38.29M
Chars: 205.21M
Segments: 1.41M
3 downloads
Iloko (Latin) (ilo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.55M
Words: 164.05M
Chars: 1.10B
Segments: 40.06M
CLEANED:
Docs: 48.75k
Words: 24.78M
Chars: 156.84M
Segments: 1.12M
Indonesian (Latin) (ind-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 169.44M
Words: 78.71B
Chars: 551.63B
Segments: 4.74B
CLEANED:
Docs: 98.14M
Words: 54.62B
Chars: 384.32B
Segments: 2.39B
4 downloads
Icelandic (Latin) (isl-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 6.02M
Words: 2.13B
Chars: 13.37B
Segments: 153.03M
CLEANED:
Docs: 2.84M
Words: 1.54B
Chars: 9.60B
Segments: 69.64M
3 downloads
Italian (Latin) (ita-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 381.65M
Words: 170.20B
Chars: 1.13T
Segments: 10.21B
CLEANED:
Docs: 221.75M
Words: 127.41B
Chars: 820.82B
Segments: 5.13B
8 downloads
Javanese (Latin) (jav-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.28M
Words: 311.46M
Chars: 2.44B
Segments: 31.44M
CLEANED:
Docs: 195.97k
Words: 137.82M
Chars: 937.71M
Segments: 6.43M
3 downloads
Japanese (Japanese) (jpn-Jpan)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.16B
Words: 106.81B
Chars: 1.63T
Segments: 51.70B
CLEANED:
Docs: 417.71M
Words: 42.36B
Chars: 901.53B
Segments: 23.27B
19 downloads
Kabyle (Latin) (kab-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.35M
Words: 257.65M
Chars: 3.26B
Segments: 61.52M
CLEANED:
Docs: 15.10k
Words: 9.22M
Chars: 54.21M
Segments: 345.22k
2 downloads
Kachin (Latin) (kac-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 101.29k
Words: 39.46M
Chars: 375.99M
Segments: 9.26M
CLEANED:
Docs: 7.59k
Words: 5.96M
Chars: 28.41M
Segments: 159.42k
Kamba (Latin) (kam-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 40.20k
Words: 5.46M
Chars: 32.84M
Segments: 842.30k
CLEANED:
Docs: 1.18k
Words: 674.04k
Chars: 4.65M
Segments: 14.26k
Kannada (Kannada) (kan-Knda)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.51M
Words: 739.80M
Chars: 5.73B
Segments: 71.25M
CLEANED:
Docs: 1.34M
Words: 532.86M
Chars: 4.30B
Segments: 24.93M
1 download
Kashmiri (Arabic) (kas-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 15.18k
Words: 2.72M
Chars: 21.60M
Segments: 545.00k
CLEANED:
Docs: 949
Words: 678.02k
Chars: 3.47M
Segments: 27.11k
1 download
Kashmiri (Devanagari) (kas-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 23.44k
Words: 5.18M
Chars: 37.19M
Segments: 938.52k
CLEANED:
Docs: 106
Words: 31.94k
Chars: 185.55k
Segments: 1.36k
Georgian (Georgian) (kat-Geor)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.57M
Words: 1.93B
Chars: 15.26B
Segments: 195.10M
CLEANED:
Docs: 3.34M
Words: 1.24B
Chars: 10.16B
Segments: 63.72M
4 downloads
Kazakh (Cyrillic) (kaz-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 5.16M
Words: 2.00B
Chars: 15.35B
Segments: 151.47M
CLEANED:
Docs: 2.64M
Words: 1.41B
Chars: 11.13B
Segments: 81.01M
3 downloads
Kabiyè (Latin) (kbp-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 228.24k
Words: 25.61M
Chars: 132.74M
Segments: 2.95M
CLEANED:
Docs: 7.08k
Words: 4.26M
Chars: 20.91M
Segments: 46.79k
Kabuverdianu (Latin) (kea-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 18.25k
Words: 3.31M
Chars: 17.88M
Segments: 422.15k
CLEANED:
Docs: 1.96k
Words: 1.14M
Chars: 6.15M
Segments: 43.91k
Mongolian (Cyrillic) (khk-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.63M
Words: 1.72B
Chars: 11.89B
Segments: 88.34M
CLEANED:
Docs: 2.12M
Words: 1.34B
Chars: 9.33B
Segments: 53.47M
4 downloads
Khmer (Khmer) (khm-Khmr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.30M
Words: 210.16M
Chars: 3.59B
Segments: 38.34M
CLEANED:
Docs: 700.99k
Words: 113.80M
Chars: 2.12B
Segments: 9.86M
1 download
Kikuyu (Latin) (kik-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 82.34k
Words: 7.70M
Chars: 60.74M
Segments: 1.54M
CLEANED:
Docs: 4.00k
Words: 1.43M
Chars: 9.30M
Segments: 51.93k
Kinyarwanda (Latin) (kin-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.59M
Words: 147.93M
Chars: 1.35B
Segments: 26.79M
CLEANED:
Docs: 92.70k
Words: 50.74M
Chars: 367.20M
Segments: 1.92M
Kyrgyz (Cyrillic) (kir-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.59M
Words: 338.52M
Chars: 2.60B
Segments: 25.49M
CLEANED:
Docs: 676.11k
Words: 246.66M
Chars: 1.93B
Segments: 10.04M
4 downloads
Kimbundu (Latin) (kmb-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 70.13k
Words: 15.85M
Chars: 109.65M
Segments: 4.92M
CLEANED:
Docs: 531
Words: 383.09k
Chars: 2.07M
Segments: 11.80k
Kurdish (Latin) (kmr-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 714.99k
Words: 242.64M
Chars: 1.40B
Segments: 12.67M
CLEANED:
Docs: 364.35k
Words: 195.87M
Chars: 1.12B
Segments: 7.15M
Kanuri (Arabic) (knc-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.57k
Words: 1.01M
Chars: 4.58M
Segments: 171.98k
CLEANED:
Docs: 245
Words: 262.00k
Chars: 1.30M
Segments: 10.83k
1 download
Kanuri (Latin) (knc-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 41.67k
Words: 8.94M
Chars: 51.16M
Segments: 1.23M
CLEANED:
Docs: 2.47k
Words: 2.41M
Chars: 11.95M
Segments: 10.52k
Kongo (Latin) (kon-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 53.30k
Words: 4.11M
Chars: 25.66M
Segments: 626.53k
CLEANED:
Docs: 2.54k
Words: 1.94M
Chars: 11.28M
Segments: 47.48k
2 downloads
Korean (Hangul) (kor-Hang)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 98.75M
Words: 30.97B
Chars: 144.91B
Segments: 3.48B
CLEANED:
Docs: 38.87M
Words: 19.69B
Chars: 89.27B
Segments: 1.36B
11 downloads
Lao (Lao) (lao-Laoo)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 624.70k
Words: 66.19M
Chars: 931.66M
Segments: 9.36M
CLEANED:
Docs: 29.50k
Words: 5.18M
Chars: 84.71M
Segments: 319.95k
1 download
Ligurian (Latin) (lij-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 659.62k
Words: 157.11M
Chars: 908.36M
Segments: 29.25M
CLEANED:
Docs: 8.37k
Words: 5.59M
Chars: 31.47M
Segments: 157.72k
Limburgish (Latin) (lim-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 11.11M
Words: 1.64B
Chars: 11.19B
Segments: 347.02M
CLEANED:
Docs: 367.93k
Words: 180.62M
Chars: 1.13B
Segments: 7.14M
1 download
Lingala (Latin) (lin-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.11M
Words: 68.92M
Chars: 434.78M
Segments: 17.96M
CLEANED:
Docs: 7.59k
Words: 5.55M
Chars: 32.93M
Segments: 200.34k
Lithuanian (Latin) (lit-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 35.68M
Words: 10.03B
Chars: 76.42B
Segments: 888.10M
CLEANED:
Docs: 13.34M
Words: 6.68B
Chars: 50.41B
Segments: 322.16M
2 downloads
Lombard (Latin) (lmo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.98M
Words: 626.58M
Chars: 5.19B
Segments: 108.78M
CLEANED:
Docs: 146.16k
Words: 59.64M
Chars: 345.51M
Segments: 2.12M
Latgalian (Latin) (ltg-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 484.07k
Words: 69.01M
Chars: 569.86M
Segments: 21.60M
CLEANED:
Docs: 9.21k
Words: 3.79M
Chars: 26.89M
Segments: 151.38k
1 download
Luxembourgish (Latin) (ltz-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.39M
Words: 419.25M
Chars: 2.61B
Segments: 83.31M
CLEANED:
Docs: 246.93k
Words: 107.22M
Chars: 710.65M
Segments: 5.06M
5 downloads
Luba-Lulua (Latin) (lua-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 104.57k
Words: 7.24M
Chars: 47.21M
Segments: 1.69M
CLEANED:
Docs: 1.08k
Words: 1.37M
Chars: 9.01M
Segments: 38.69k
Ganda (Latin) (lug-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 321.60k
Words: 28.42M
Chars: 267.44M
Segments: 3.56M
CLEANED:
Docs: 21.28k
Words: 9.18M
Chars: 67.99M
Segments: 407.54k
Luo (Latin) (luo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 190.37k
Words: 19.93M
Chars: 149.68M
Segments: 3.19M
CLEANED:
Docs: 4.15k
Words: 3.73M
Chars: 20.33M
Segments: 84.12k
Mizo (Latin) (lus-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.22M
Words: 221.83M
Chars: 1.31B
Segments: 25.88M
CLEANED:
Docs: 160.38k
Words: 125.20M
Chars: 652.17M
Segments: 3.43M
Latvian (Latin) (lvs-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 23.04M
Words: 6.26B
Chars: 47.62B
Segments: 656.95M
CLEANED:
Docs: 6.77M
Words: 3.46B
Chars: 25.19B
Segments: 173.81M
3 downloads
Magahi (Devanagari) (mag-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 8.95M
Words: 4.43B
Chars: 50.01B
Segments: 35.61M
CLEANED:
Docs: 328
Words: 890.63k
Chars: 4.28M
Segments: 19.29k
Maithili (Devanagari) (mai-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 116.18k
Words: 33.60M
Chars: 170.03M
Segments: 2.12M
CLEANED:
Docs: 24.98k
Words: 17.79M
Chars: 96.77M
Segments: 645.53k
1 download
Malayalam (Malayalam) (mal-Mlym)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 4.59M
Words: 1.20B
Chars: 11.46B
Segments: 76.76M
CLEANED:
Docs: 3.10M
Words: 973.66M
Chars: 9.49B
Segments: 48.00M
2 downloads
Marathi (Devanagari) (mar-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.32M
Words: 1.25B
Chars: 8.35B
Segments: 68.56M
CLEANED:
Docs: 2.08M
Words: 980.75M
Chars: 6.62B
Segments: 36.32M
1 download
Minangkabau (Latin) (min-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 901.19k
Words: 137.36M
Chars: 1.07B
Segments: 23.70M
CLEANED:
Docs: 25.04k
Words: 10.98M
Chars: 74.80M
Segments: 600.80k
1 download
Macedonian (Cyrillic) (mkd-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.71M
Words: 2.08B
Chars: 13.39B
Segments: 164.74M
CLEANED:
Docs: 3.57M
Words: 1.49B
Chars: 9.44B
Segments: 57.01M
10 downloads
Maltese (Latin) (mlt-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.75M
Words: 930.52M
Chars: 6.97B
Segments: 129.16M
CLEANED:
Docs: 367.26k
Words: 195.81M
Chars: 1.44B
Segments: 8.68M
Manipuri (Bangla) (mni-Beng)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 14.18k
Words: 3.78M
Chars: 26.58M
Segments: 612.73k
CLEANED:
Docs: 2.93k
Words: 1.63M
Chars: 11.79M
Segments: 65.76k
1 download
Mossi (Latin) (mos-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 220.45k
Words: 37.80M
Chars: 151.08M
Segments: 6.45M
CLEANED:
Docs: 931
Words: 807.49k
Chars: 3.86M
Segments: 19.10k
Māori (Latin) (mri-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 665.39k
Words: 157.42M
Chars: 829.14M
Segments: 13.43M
CLEANED:
Docs: 108.26k
Words: 86.76M
Chars: 424.40M
Segments: 2.80M
Burmese (Myanmar) (mya-Mymr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.98M
Words: 787.24M
Chars: 9.81B
Segments: 75.11M
CLEANED:
Docs: 1.37M
Words: 453.18M
Chars: 5.82B
Segments: 30.50M
2 downloads
Dutch (Latin) (nld-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 303.51M
Words: 103.15B
Chars: 661.41B
Segments: 8.06B
CLEANED:
Docs: 138.65M
Words: 71.40B
Chars: 451.22B
Segments: 3.07B
2 downloads
Norwegian Nynorsk (Latin) (nno-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 15.13M
Words: 1.83B
Chars: 11.86B
Segments: 224.17M
CLEANED:
Docs: 1.42M
Words: 860.34M
Chars: 5.41B
Segments: 34.60M
5 downloads
Norwegian Bokmål (Latin) (nob-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 64.72M
Words: 31.42B
Chars: 200.54B
Segments: 2.00B
CLEANED:
Docs: 27.05M
Words: 21.53B
Chars: 133.27B
Segments: 675.97M
5 downloads
Nepali (Devanagari) (npi-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 4.03M
Words: 1.35B
Chars: 8.70B
Segments: 54.53M
CLEANED:
Docs: 2.78M
Words: 1.13B
Chars: 7.26B
Segments: 37.14M
2 downloads
Northern Sotho (Latin) (nso-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 409.23k
Words: 21.19M
Chars: 142.60M
Segments: 3.67M
CLEANED:
Docs: 6.07k
Words: 5.32M
Chars: 27.50M
Segments: 143.31k
Nuer (Latin) (nus-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 11.18k
Words: 5.20M
Chars: 44.57M
Segments: 275.47k
CLEANED:
Docs: 272
Words: 393.16k
Chars: 1.88M
Segments: 8.51k
Nyanja (Latin) (nya-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 371.76k
Words: 42.92M
Chars: 318.50M
Segments: 7.12M
CLEANED:
Docs: 53.12k
Words: 27.06M
Chars: 202.97M
Segments: 1.34M
Occitan (Latin) (oci-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.60M
Words: 536.38M
Chars: 3.32B
Segments: 86.55M
CLEANED:
Docs: 189.91k
Words: 102.72M
Chars: 635.59M
Segments: 4.19M
Odia (Odia) (ory-Orya)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 587.96k
Words: 145.78M
Chars: 947.33M
Segments: 5.59M
CLEANED:
Docs: 412.89k
Words: 120.13M
Chars: 781.95M
Segments: 3.60M
2 downloads
Pangasinan (Latin) (pag-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.42M
Words: 155.36M
Chars: 889.39M
Segments: 46.27M
CLEANED:
Docs: 6.90k
Words: 5.66M
Chars: 33.53M
Segments: 85.83k
Punjabi (Gurmukhi) (pan-Guru)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.05M
Words: 517.32M
Chars: 2.67B
Segments: 34.52M
CLEANED:
Docs: 584.59k
Words: 372.17M
Chars: 1.90B
Segments: 11.74M
3 downloads
Papiamento (Latin) (pap-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 8.51M
Words: 416.98M
Chars: 2.46B
Segments: 89.44M
CLEANED:
Docs: 89.81k
Words: 46.71M
Chars: 254.18M
Segments: 1.39M
Southern Pashto (Latin) (pbt-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 769.00k
Words: 334.95M
Chars: 1.59B
Segments: 13.49M
CLEANED:
Docs: 466.47k
Words: 279.44M
Chars: 1.30B
Segments: 8.45M
2 downloads
Persian (Arabic) (pes-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 196.84M
Words: 124.37B
Chars: 644.49B
Segments: 7.03B
CLEANED:
Docs: 90.50M
Words: 88.55B
Chars: 455.15B
Segments: 3.96B
3 downloads
Malagasy (Latin) (plt-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 492.55k
Words: 152.73M
Chars: 1.05B
Segments: 10.18M
CLEANED:
Docs: 207.84k
Words: 117.08M
Chars: 810.51M
Segments: 4.74M
Polish (Latin) (pol-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 382.38M
Words: 136.50B
Chars: 948.27B
Segments: 12.72B
CLEANED:
Docs: 175.41M
Words: 89.53B
Chars: 631.77B
Segments: 4.46B
7 downloads
Portuguese (Latin) (por-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 470.63M
Words: 203.71B
Chars: 1.26T
Segments: 14.18B
CLEANED:
Docs: 237.81M
Words: 146.27B
Chars: 896.79B
Segments: 6.12B
10 downloads
Dari (Arabic) (prs-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 12.33M
Words: 5.21B
Chars: 28.74B
Segments: 413.50M
CLEANED:
Docs: 2.84M
Words: 1.84B
Chars: 9.57B
Segments: 69.00M
3 downloads
Ayacucho Quechua (Latin) (quy-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 405.81k
Words: 44.88M
Chars: 327.42M
Segments: 5.30M
CLEANED:
Docs: 36.94k
Words: 17.31M
Chars: 143.45M
Segments: 494.25k
5 downloads
Romanian (Latin) (ron-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 115.84M
Words: 52.11B
Chars: 329.18B
Segments: 3.37B
CLEANED:
Docs: 65.88M
Words: 40.05B
Chars: 250.72B
Segments: 1.70B
3 downloads
Rundi (Latin) (run-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.70M
Words: 232.01M
Chars: 1.69B
Segments: 36.91M
CLEANED:
Docs: 137.30k
Words: 44.44M
Chars: 316.63M
Segments: 1.75M
Russian (Cyrillic) (rus-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.64B
Words: 696.30B
Chars: 5.01T
Segments: 49.90B
CLEANED:
Docs: 884.69M
Words: 540.88B
Chars: 3.91T
Segments: 26.29B
6 downloads
Sango (Latin) (sag-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 864.29k
Words: 136.25M
Chars: 760.17M
Segments: 39.56M
CLEANED:
Docs: 3.16k
Words: 3.61M
Chars: 16.74M
Segments: 51.90k
Sanskrit (Devanagari) (san-Deva)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 200.47k
Words: 95.80M
Chars: 746.25M
Segments: 11.58M
CLEANED:
Docs: 54.91k
Words: 43.80M
Chars: 359.21M
Segments: 3.28M
1 download
Santali (Ol Chiki) (sat-Olck)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 9.40k
Words: 2.88M
Chars: 16.98M
Segments: 217.79k
CLEANED:
Docs: 2.57k
Words: 1.09M
Chars: 6.27M
Segments: 45.80k
1 download
Sicilian (Latin) (scn-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.42M
Words: 433.55M
Chars: 3.24B
Segments: 116.89M
CLEANED:
Docs: 81.97k
Words: 42.39M
Chars: 252.40M
Segments: 1.65M
Shan (Myanmar) (shn-Mymr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 20.87k
Words: 2.99M
Chars: 33.50M
Segments: 406.32k
CLEANED:
Docs: 6.00k
Words: 1.65M
Chars: 21.22M
Segments: 92.14k
Sinhala (Sinhala) (sin-Sinh)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.04M
Words: 1.03B
Chars: 6.61B
Segments: 54.63M
CLEANED:
Docs: 1.15M
Words: 795.62M
Chars: 4.98B
Segments: 33.71M
7 downloads
Slovak (Latin) (slk-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 68.41M
Words: 20.32B
Chars: 137.70B
Segments: 2.41B
CLEANED:
Docs: 21.83M
Words: 10.63B
Chars: 70.39B
Segments: 494.28M
4 downloads
Slovenian (Latin) (slv-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 30.31M
Words: 9.83B
Chars: 68.36B
Segments: 1.01B
CLEANED:
Docs: 10.28M
Words: 5.43B
Chars: 35.27B
Segments: 238.64M
1 download
Samoan (Latin) (smo-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 295.95k
Words: 85.62M
Chars: 507.10M
Segments: 6.78M
CLEANED:
Docs: 45.86k
Words: 37.09M
Chars: 186.19M
Segments: 1.01M
Shona (Latin) (sna-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.08M
Words: 72.57M
Chars: 631.56M
Segments: 10.82M
CLEANED:
Docs: 61.08k
Words: 23.92M
Chars: 192.68M
Segments: 1.20M
Sindhi (Arabic) (snd-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 230.24k
Words: 115.44M
Chars: 626.70M
Segments: 5.58M
CLEANED:
Docs: 100.30k
Words: 89.53M
Chars: 428.73M
Segments: 2.83M
1 download
Somali (Latin) (som-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.12M
Words: 661.66M
Chars: 5.30B
Segments: 76.87M
CLEANED:
Docs: 966.51k
Words: 388.75M
Chars: 2.57B
Segments: 16.38M
1 download
Southern Sotho (Latin) (sot-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 288.14k
Words: 56.98M
Chars: 332.29M
Segments: 8.03M
CLEANED:
Docs: 43.92k
Words: 31.00M
Chars: 171.54M
Segments: 1.09M
Spanish (Latin) (spa-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 838.93M
Words: 414.23B
Chars: 2.53T
Segments: 22.17B
CLEANED:
Docs: 503.07M
Words: 321.95B
Chars: 1.95T
Segments: 12.12B
7 downloads
Sardinian (Latin) (srd-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 32.04M
Words: 3.71B
Chars: 25.18B
Segments: 675.85M
CLEANED:
Docs: 53.81k
Words: 23.89M
Chars: 148.80M
Segments: 917.09k
Serbian (Cyrillic) (srp-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 9.18M
Words: 3.83B
Chars: 26.69B
Segments: 249.63M
CLEANED:
Docs: 4.12M
Words: 2.52B
Chars: 16.16B
Segments: 93.81M
6 downloads
Swati (Latin) (ssw-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 262.40k
Words: 19.25M
Chars: 219.37M
Segments: 5.11M
CLEANED:
Docs: 2.04k
Words: 994.30k
Chars: 8.82M
Segments: 62.13k
Sundanese (Latin) (sun-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.89M
Words: 505.50M
Chars: 3.45B
Segments: 83.33M
CLEANED:
Docs: 114.75k
Words: 69.63M
Chars: 475.44M
Segments: 3.24M
4 downloads
Swedish (Latin) (swe-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 157.92M
Words: 58.47B
Chars: 374.21B
Segments: 4.81B
CLEANED:
Docs: 66.81M
Words: 40.10B
Chars: 251.18B
Segments: 1.75B
9 downloads
Swahili (Latin) (swh-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 4.42M
Words: 1.15B
Chars: 7.55B
Segments: 95.87M
CLEANED:
Docs: 1.37M
Words: 717.65M
Chars: 4.67B
Segments: 34.31M
2 downloads
Silesian (Latin) (szl-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.25M
Words: 497.28M
Chars: 3.75B
Segments: 73.39M
CLEANED:
Docs: 40.93k
Words: 14.68M
Chars: 103.88M
Segments: 636.57k
Tamil (Tamil) (tam-Taml)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 9.73M
Words: 3.96B
Chars: 34.09B
Segments: 322.60M
CLEANED:
Docs: 6.11M
Words: 2.98B
Chars: 26.24B
Segments: 168.59M
2 downloads
Tamasheq (Latin) (taq-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 253.19k
Words: 62.62M
Chars: 452.93M
Segments: 30.90M
CLEANED:
Docs: 1.75k
Words: 1.54M
Chars: 8.85M
Segments: 13.88k
2 downloads
Tamasheq (Tifinagh) (taq_Tfng)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 101
Words: 21.32k
Chars: 149.82k
Segments: 1.08k
Central Atlas Tamazight (Tifinagh) (tzm-Tfng)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 5.17k
Words: 1.24M
Chars: 15.97M
Segments: 324.49k
Tatar (Cyrillic) (tat-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.90M
Words: 452.35M
Chars: 3.21B
Segments: 36.63M
CLEANED:
Docs: 630.68k
Words: 296.70M
Chars: 2.16B
Segments: 13.45M
1 download
Telugu (Telugu) (tel-Telu)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.20M
Words: 1.05B
Chars: 8.06B
Segments: 70.31M
CLEANED:
Docs: 2.06M
Words: 835.42M
Chars: 6.51B
Segments: 39.19M
2 downloads
Tajik (Cyrillic) (tgk-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.41M
Words: 878.17M
Chars: 6.17B
Segments: 86.32M
CLEANED:
Docs: 1.26M
Words: 624.76M
Chars: 4.59B
Segments: 24.85M
Filipino (Latin) (tgl-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 8.57M
Words: 3.40B
Chars: 23.06B
Segments: 321.67M
CLEANED:
Docs: 1.87M
Words: 1.35B
Chars: 8.13B
Segments: 52.88M
4 downloads
Thai (Thai) (tha-Thai)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 81.71M
Words: 11.59B
Chars: 155.94B
Segments: 2.18B
CLEANED:
Docs: 17.70M
Words: 3.51B
Chars: 59.99B
Segments: 339.05M
10 downloads
Tigrinya (Ethiopic) (tir-Ethi)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.03M
Words: 194.51M
Chars: 2.31B
Segments: 38.58M
CLEANED:
Docs: 64.69k
Words: 36.72M
Chars: 181.70M
Segments: 1.13M
1 download
Tok Pisin (Latin) (tpi-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.16M
Words: 79.19M
Chars: 453.63M
Segments: 16.46M
CLEANED:
Docs: 13.98k
Words: 12.51M
Chars: 64.54M
Segments: 282.37k
Tswana (Latin) (tsn-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 361.19k
Words: 22.20M
Chars: 168.50M
Segments: 5.43M
CLEANED:
Docs: 6.05k
Words: 5.27M
Chars: 27.68M
Segments: 132.17k
2 downloads
Tsonga (Latin) (tso-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 193.13k
Words: 21.88M
Chars: 151.72M
Segments: 3.37M
CLEANED:
Docs: 11.01k
Words: 8.67M
Chars: 49.30M
Segments: 221.25k
Turkmen (Latin) (tuk-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 806.53k
Words: 230.24M
Chars: 1.65B
Segments: 30.73M
CLEANED:
Docs: 171.04k
Words: 70.68M
Chars: 570.17M
Segments: 3.36M
Tumbuka (Latin) (tum-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 59.60k
Words: 16.12M
Chars: 142.19M
Segments: 2.33M
CLEANED:
Docs: 4.38k
Words: 2.88M
Chars: 21.10M
Segments: 99.01k
Turkish (Latin) (tur-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 236.66M
Words: 73.92B
Chars: 543.97B
Segments: 5.87B
CLEANED:
Docs: 116.57M
Words: 51.67B
Chars: 389.75B
Segments: 2.57B
8 downloads
Akan (Latin) (twi-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 619.40k
Words: 36.55M
Chars: 258.14M
Segments: 6.27M
CLEANED:
Docs: 5.86k
Words: 4.70M
Chars: 24.18M
Segments: 125.61k
Uyghur (Arabic) (uig-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.09M
Words: 317.32M
Chars: 2.48B
Segments: 21.44M
CLEANED:
Docs: 442.40k
Words: 223.91M
Chars: 1.75B
Segments: 8.98M
5 downloads
Ukrainian (Cyrillic) (ukr-Cyrl)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 81.51M
Words: 31.90B
Chars: 231.82B
Segments: 2.09B
CLEANED:
Docs: 47.40M
Words: 25.23B
Chars: 182.92B
Segments: 1.17B
5 downloads
Umbundu (Latin) (umb-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 62.75k
Words: 6.39M
Chars: 43.81M
Segments: 1.09M
CLEANED:
Docs: 2.47k
Words: 2.43M
Chars: 15.41M
Segments: 59.91k
Urdu (Arabic) (urd-Arab)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 7.16M
Words: 3.28B
Chars: 15.89B
Segments: 229.88M
CLEANED:
Docs: 3.19M
Words: 2.13B
Chars: 10.01B
Segments: 50.63M
6 downloads
Uzbek (Latin) (uzn-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 9.75M
Words: 1.24B
Chars: 10.34B
Segments: 198.89M
CLEANED:
Docs: 706.92k
Words: 351.32M
Chars: 2.85B
Segments: 14.80M
3 downloads
Venetian (Latin) (vec-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 11.52M
Words: 1.61B
Chars: 10.70B
Segments: 425.91M
CLEANED:
Docs: 84.81k
Words: 35.25M
Chars: 218.06M
Segments: 1.58M
1 download
Vietnamese (Latin) (vie-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 174.14M
Words: 105.24B
Chars: 491.19B
Segments: 5.35B
CLEANED:
Docs: 100.75M
Words: 83.20B
Chars: 379.59B
Segments: 3.02B
7 downloads
Waray (Latin) (war-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 286.40k
Words: 23.27M
Chars: 136.08M
Segments: 3.12M
CLEANED:
Docs: 13.87k
Words: 5.89M
Chars: 35.59M
Segments: 200.94k
Wolof (Latin) (wol-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.97M
Words: 38.69M
Chars: 246.24M
Segments: 10.43M
CLEANED:
Docs: 5.68k
Words: 5.46M
Chars: 27.55M
Segments: 161.47k
Xhosa (Latin) (xho-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.41M
Words: 494.95M
Chars: 6.62B
Segments: 56.47M
CLEANED:
Docs: 63.09k
Words: 30.34M
Chars: 258.73M
Segments: 1.82M
Yiddish (Hebrew) (ydd-Hebr)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 414.98k
Words: 150.99M
Chars: 911.06M
Segments: 9.48M
CLEANED:
Docs: 128.26k
Words: 77.53M
Chars: 458.62M
Segments: 2.94M
Yoruba (Latin) (yor-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.89M
Words: 246.78M
Chars: 1.50B
Segments: 30.70M
CLEANED:
Docs: 66.13k
Words: 42.81M
Chars: 217.89M
Segments: 1.47M
Cantonese (Traditional) (yue-Hant)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 3.77M
Words: 235.77M
Chars: 2.27B
Segments: 131.06M
CLEANED:
Docs: 61.29k
Words: 3.27M
Chars: 74.36M
Segments: 1.24M
4 downloads
Simplified Chinese (zho-Hans)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 2.71B
Words: 149.31B
Chars: 3.67T
Segments: 76.75B
CLEANED:
Docs: 1.25B
Words: 74.01B
Chars: 2.35T
Segments: 42.45B
23 downloads
Traditional Chinese (zho-Hant)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 321.27M
Words: 17.31B
Chars: 417.51B
Segments: 7.82B
CLEANED:
Docs: 157.11M
Words: 9.51B
Chars: 286.98B
Segments: 4.48B
6 downloads
Malay (Latin) (zsm-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 69.23M
Words: 28.66B
Chars: 193.52B
Segments: 3.11B
CLEANED:
Docs: 18.42M
Words: 11.48B
Chars: 78.45B
Segments: 579.82M
3 downloads
Zulu (Latin) (zul-Latn)
Creative Commons CC0 licenseSource: CC/IA
DEDUPLICATED:
Docs: 1.31M
Words: 156.42M
Chars: 1.33B
Segments: 33.14M
CLEANED:
Docs: 113.62k
Words: 44.36M
Chars: 380.92M
Segments: 2.71M
2 downloads
These data are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.