We use cookies on our site.

HPLT Datasets v2

Read the paperDownload v2.0

Version 2.0 of the HPLT Monolingual Datasets is now published. These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: deduplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as deduplicated minus those filtered out by our cleaning heuristics. The cleaned variant is recommended unless you want to try your own cleaning pipelines.

Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake, and text extraction pipeline was run on LUMI supercomputer.

HPLT Monolingual Datasets version 2.0 (the deduplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on.

What's new in v2.0

  • The size of the source web collections has increased 2.5x: 4.5 petabytes of compressed web data in total (mostly from Internet Archive, but also from Common Crawl).
  • The text extraction pipeline now uses Trafilatura, which results in more efficient boilerplate removal: thus, less noise in the data.
  • Language identification now uses a refined version of OpenLID instead of CLD2.
  • See more

What are the HPLT Analytics reports linked to each language?

These automated reports provide useful statistics about the clean version of the HPLT v.2.0 datasets. They are the result of running the HPLT Analytics Tool on them. They allow inspecting dataset properties prior to a full download.

dataset pipeline

Output format

The output format is JSONL, where each line is a valid JSON object, providing a full document with all its metadata and text content. An example is provided here:

{"f":"./CC-MAIN-20170116095124-00401-ip-10-171-10-70.ec2.internal.warc.2.gz","o":37461524,"s":8114,"rs":27035,
    "u":"http://blogtailors.blogspot.com/2010/08/saida-de-emergencia-procura-tradutores.html",
    "c":"text/html","ts":"2017-01-24T09:05:15Z", 
    "collection":"cc17",
    "lang":["por_Latn","slk_Latn","kmb_Latn"],"prob":[1,0,0],
    "text":"A editora Saída de Emergência procura revisores
            e tradutores profissionais em regime freelancer com formação
            superior linguística e experiência na área de revisão e 
            tradução literária.\nPreza-se a capacidade de cumprimento
            de prazos bem definidos. Os interessados poderão enviar 
            candidatura paraa geral@saidademergencia.com.\nquarta-feira, 
            4 de agosto de 2010\nSaída de Emergência procura tradutores 
            e revisores em regime freelancer\nA editora Saída de Emergência
            procura revisores e tradutores profissionais em regime 
            freelancer com formação superior linguística eexperiência na 
            área de revisão e tradução literária.",
    "seg_langs":["por_Latn","por_Latn","por_Latn","por_Latn","por_Latn"],
    "robotstxt":"allowed",
    "id":"92a22c9672a52ae5af0da0457a184151",
    "filter":"keep",
    "pii":[[296,323]],
    "doc_scores":[6.4,10,10,10,10,10,8,0,0]}
{"f":"./path/to/80716-00468.warc.gz","o":579437,"s":1100,"rs":44535,
...
    "text":"More texts\n...",
...

In each document text field, each segment is concatenated using new-line separators. The first 7 fields are inherited from warc2text HTML extraction from the WARCs, explained here, with the exception ofp field which is replaced by text and l replaced by lang and prob, which describe the three top identified languages for the document and their prediction probabilities. The rest of the output is explained here.

How to download it

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/two/deduplicated/epo_Latn_map.txt

wget -i https://data.hplt-project.org/two/cleaned/epo_Latn_map.txt

Full download

If you want to download all the available files from deduplicated or cleaned variants in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/deduplicated/hplt_monolingual_map_deduplicated_2.0.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/cleaned/hplt_monolingual_map_cleaned_2.0.txt

Validating downloads

Every .zst file in the deduplicated and cleaned variants is accompanied with a corresponding .zst.md5 file containing its MD5 checksum. See, for example, this directory with the cleaned Norwegian data: https://data.hplt-project.org/two/cleaned/nob_Latn/

The integrity of a file can be checked with, e.g., md5sum -c 1.jsonl.zst.md5

Datasets Catalogue

There are 193 languages on the HPLT dataset catalogue in version 2.0. For each language and variant (deduped and cleaned), counts for number of documents, words, characters and segments are provided. If you find any problem, please contact us !

NEW: Register Labels as extra metadata.


Register labels for the HPLT v2 data release can be now downloaded and merged with the deduplicated version of 104 out of the 193 available languages. The register classification scheme is further explained in Henriksson et al. (2024).
Please, use the following bash script to merge the register label files with the main HPLT files for any given language: https://github.com/hplt-project/HPLT-textpipes/blob/main/tools/merge_labels.sh
Use it like, e.g.,
bash merge_labels.sh nob_LatnIt will download the HPLT datasets for the language, verify their MD5, download register label files, and then merge them together, so that the labels are added to the corresponding json lines from the main dataset.

Dataset cards with the tag icon contain register labels.

Samples

sample image

You can find stratified samples for this release in the following links:

  • See them in HPLT Analytics
  • Cleaned samples per lang
  • Cleaned samples per group

A description of the samples can be found here.

Language family distribution in HPLT

Find interesting hierarchical treemaps to get a glimpse of the language family distribution across HPLT data!

treemap-image
  • Cleaned chars language family distribution
  • Cleaned docs language family distribution
  • Deduplicated chars language family distribution
  • Deduplicated docs language family distribution
+

License and takedown

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these text data has been extracted.*
  • We license the actual packaging of these text data under the Creative Commons CC0 license ("no rights reserved") .
public-domain-logo

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

Romanian (ro)

Achinese (Arabic) (ace-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 28.54 MB
cleaned 86.48 kB

Source: CC/IA

DEDUPLICATED:

Docs: 11.67k

Words: 3.35M

Chars: 25.69M

Segments: 825.08k

CLEANED:

Docs: 16

Words: 8.36k

Chars: 49.74k

Segments: 117

20 downloads

Achinese (Latin) (ace-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.65 GB
cleaned 51.83 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.26M

Words: 172.75M

Chars: 1.59B

Segments: 44.95M

CLEANED:

Docs: 12.93k

Words: 8.20M

Chars: 50.85M

Segments: 206.19k

13 downloads

Afrikaans (Latin) (afr-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 18.55 GB
cleaned 6.01 GB

Source: CC/IA

DEDUPLICATED:

Docs: 9.59M

Words: 2.43B

Chars: 18.35B

Segments: 251.52M

CLEANED:

Docs: 1.46M

Words: 1.00B

Chars: 5.95B

Segments: 37.74M

11 downloads

Albanian (Latin) (als-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 30.09 GB
cleaned 17.28 GB

Source: CC/IA

DEDUPLICATED:

Docs: 11.29M

Words: 4.37B

Chars: 28.42B

Segments: 304.29M

CLEANED:

Docs: 5.39M

Words: 2.71B

Chars: 16.10B

Segments: 95.10M

6 downloads

Amharic (Ethiopic) (amh-Ethi)

Creative Commons CC0 license
hplt analytics logo
dedup 67.69 GB
cleaned 2.57 GB

Source: CC/IA

DEDUPLICATED:

Docs: 13.02M

Words: 7.21B

Chars: 65.58B

Segments: 725.19M

CLEANED:

Docs: 295.54k

Words: 195.89M

Chars: 1.03B

Segments: 7.01M

9 downloads

Arabic (Arabic) (ara-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 743.90 GB
cleaned 496.26 GB

Source: CC/IA

DEDUPLICATED:

Docs: 191.75M

Words: 73.48B

Chars: 434.47B

Segments: 5.64B

CLEANED:

Docs: 82.67M

Words: 48.14B

Chars: 279.59B

Segments: 2.20B

35 downloads

Assamese (Bangla) (asm-Beng)

Creative Commons CC0 license
hplt analytics logo
dedup 1.53 GB
cleaned 1.24 GB

Source: CC/IA

DEDUPLICATED:

Docs: 263.14k

Words: 94.13M

Chars: 617.90M

Segments: 4.76M

CLEANED:

Docs: 175.71k

Words: 73.44M

Chars: 475.83M

Segments: 2.68M

4 downloads

Asturian (Latin) (ast-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 13.55 GB
cleaned 1.28 GB

Source: CC/IA

DEDUPLICATED:

Docs: 12.53M

Words: 1.99B

Chars: 12.78B

Segments: 361.09M

CLEANED:

Docs: 273.24k

Words: 194.99M

Chars: 1.24B

Segments: 7.43M

4 downloads

Awadhi (Devanagari) (awa-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 2.36 GB
cleaned 71.43 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.12M

Words: 319.73M

Chars: 1.91B

Segments: 64.96M

CLEANED:

Docs: 7.28k

Words: 6.05M

Chars: 28.78M

Segments: 131.47k

2 downloads

Aymara (Latin) (ayr-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.44 GB
cleaned 26.04 MB

Source: CC/IA

DEDUPLICATED:

Docs: 3.56M

Words: 402.05M

Chars: 2.40B

Segments: 84.87M

CLEANED:

Docs: 9.22k

Words: 3.07M

Chars: 25.09M

Segments: 188.53k

4 downloads

South Azerbaijani (Arabic) (azb-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 774.63 MB
cleaned 470.19 MB

Source: CC/IA

DEDUPLICATED:

Docs: 253.48k

Words: 69.60M

Chars: 445.30M

Segments: 7.60M

CLEANED:

Docs: 66.11k

Words: 39.58M

Chars: 260.26M

Segments: 2.39M

5 downloads

Azerbaijani (Latin) (azj-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 39.57 GB
cleaned 22.74 GB

Source: CC/IA

DEDUPLICATED:

Docs: 16.09M

Words: 4.42B

Chars: 34.65B

Segments: 409.54M

CLEANED:

Docs: 6.48M

Words: 2.57B

Chars: 19.63B

Segments: 126.61M

3 downloads

Bashkir (Cyrillic) (bak-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 1.76 GB
cleaned 1.01 GB

Source: CC/IA

DEDUPLICATED:

Docs: 778.47k

Words: 144.56M

Chars: 1.02B

Segments: 14.89M

CLEANED:

Docs: 170.82k

Words: 75.33M

Chars: 558.67M

Segments: 3.14M

1 download

Bambara (Latin) (bam-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 241.51 MB
cleaned 22.38 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.45M

Words: 32.23M

Chars: 230.47M

Segments: 8.57M

CLEANED:

Docs: 5.72k

Words: 3.98M

Chars: 20.74M

Segments: 91.72k

4 downloads

Balinese (Latin) (ban-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 641.62 MB
cleaned 77.94 MB

Source: CC/IA

DEDUPLICATED:

Docs: 336.98k

Words: 74.92M

Chars: 628.80M

Segments: 15.37M

CLEANED:

Docs: 10.70k

Words: 11.34M

Chars: 77.26M

Segments: 601.14k

2 downloads

Belarusian (Cyrillic) (bel-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 33.24 GB
cleaned 15.33 GB

Source: CC/IA

DEDUPLICATED:

Docs: 5.75M

Words: 2.27B

Chars: 18.85B

Segments: 167.78M

CLEANED:

Docs: 2.32M

Words: 1.21B

Chars: 8.54B

Segments: 48.84M

13 downloads

Bemba (Latin) (bem-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 627.54 MB
cleaned 32.67 MB

Source: CC/IA

DEDUPLICATED:

Docs: 759.41k

Words: 116.22M

Chars: 612.21M

Segments: 34.05M

CLEANED:

Docs: 6.14k

Words: 4.52M

Chars: 32.33M

Segments: 133.54k

1 download

Bangla (Bangla) (ben-Beng)

Creative Commons CC0 license
hplt analytics logo
dedup 103.48 GB
cleaned 79.30 GB

Source: CC/IA

DEDUPLICATED:

Docs: 17.34M

Words: 6.41B

Chars: 41.25B

Segments: 493.78M

CLEANED:

Docs: 11.04M

Words: 4.64B

Chars: 30.17B

Segments: 176.01M

4 downloads

Bhojpuri (Devanagari) (bho-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 493.56 MB
cleaned 171.46 MB

Source: CC/IA

DEDUPLICATED:

Docs: 226.96k

Words: 39.34M

Chars: 213.88M

Segments: 4.15M

CLEANED:

Docs: 28.64k

Words: 13.47M

Chars: 68.68M

Segments: 458.26k

4 downloads

Banjar (Arabic) (bjn-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 30.84 MB
cleaned 5.92 MB

Source: CC/IA

DEDUPLICATED:

Docs: 14.07k

Words: 2.16M

Chars: 18.62M

Segments: 591.75k

CLEANED:

Docs: 1.11k

Words: 548.24k

Chars: 3.32M

Segments: 19.53k

2 downloads

Banjar (Latin) (bjn-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.76 GB
cleaned 56.29 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.31M

Words: 209.29M

Chars: 1.67B

Segments: 49.15M

CLEANED:

Docs: 18.76k

Words: 8.05M

Chars: 55.99M

Segments: 366.34k

Tibetan (Tibetan) (bod-Tibt)

Creative Commons CC0 license
hplt analytics logo
dedup 2.00 GB
cleaned 789.19 MB

Source: CC/IA

DEDUPLICATED:

Docs: 171.77k

Words: 15.53M

Chars: 680.83M

Segments: 2.59M

CLEANED:

Docs: 27.44k

Words: 5.78M

Chars: 268.56M

Segments: 464.99k

9 downloads

Bosnian (Latin) (bos-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 91.82 GB
cleaned 47.52 GB

Source: CC/IA

DEDUPLICATED:

Docs: 35.57M

Words: 13.58B

Chars: 89.09B

Segments: 860.56M

CLEANED:

Docs: 14.61M

Words: 7.26B

Chars: 46.09B

Segments: 268.16M

3 downloads

Buginese (Latin) (bug-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 638.32 MB
cleaned 19.65 MB

Source: CC/IA

DEDUPLICATED:

Docs: 572.44k

Words: 79.03M

Chars: 620.28M

Segments: 17.32M

CLEANED:

Docs: 2.02k

Words: 2.70M

Chars: 19.31M

Segments: 38.55k

Bulgarian (Cyrillic) (bul-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 234.23 GB
cleaned 172.37 GB

Source: CC/IA

DEDUPLICATED:

Docs: 56.66M

Words: 20.84B

Chars: 134.99B

Segments: 1.48B

CLEANED:

Docs: 28.09M

Words: 15.30B

Chars: 96.96B

Segments: 681.41M

8 downloads

Catalan (Latin) (cat-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 80.28 GB
cleaned 61.99 GB

Source: CC/IA

DEDUPLICATED:

Docs: 34.44M

Words: 12.69B

Chars: 78.07B

Segments: 724.77M

CLEANED:

Docs: 18.55M

Words: 10.02B

Chars: 60.21B

Segments: 383.34M

9 downloads

Cebuano (Latin) (ceb-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.21 GB
cleaned 519.98 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.24M

Words: 188.62M

Chars: 1.20B

Segments: 21.13M

CLEANED:

Docs: 138.84k

Words: 85.89M

Chars: 515.83M

Segments: 2.86M

Czech (Latin) (ces-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 453.28 GB
cleaned 303.59 GB

Source: CC/IA

DEDUPLICATED:

Docs: 168.65M

Words: 62.92B

Chars: 412.04B

Segments: 5.40B

CLEANED:

Docs: 75.29M

Words: 42.08B

Chars: 274.01B

Segments: 1.93B

10 downloads

Chokwe (Latin) (cjk-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 28.27 MB
cleaned 7.53 MB

Source: CC/IA

DEDUPLICATED:

Docs: 27.76k

Words: 3.79M

Chars: 27.55M

Segments: 820.23k

CLEANED:

Docs: 1.20k

Words: 964.70k

Chars: 7.43M

Segments: 36.70k

8 downloads

Central Kurdish (Arabic) (ckb-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 5.13 GB
cleaned 1.66 GB

Source: CC/IA

DEDUPLICATED:

Docs: 1.26M

Words: 442.62M

Chars: 2.86B

Segments: 22.55M

CLEANED:

Docs: 273.75k

Words: 142.65M

Chars: 913.08M

Segments: 5.23M

1 download

Crimean Tatar (Latin) (crh-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.17 GB
cleaned 317.86 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.94M

Words: 303.68M

Chars: 2.01B

Segments: 59.37M

CLEANED:

Docs: 122.74k

Words: 36.76M

Chars: 281.20M

Segments: 1.38M

Welsh (Latin) (cym-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 5.43 GB
cleaned 2.43 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.58M

Words: 793.12M

Chars: 5.35B

Segments: 79.45M

CLEANED:

Docs: 758.13k

Words: 409.04M

Chars: 2.40B

Segments: 15.57M

3 downloads

Danish (Latin) (dan-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 210.77 GB
cleaned 136.69 GB

Source: CC/IA

DEDUPLICATED:

Docs: 101.01M

Words: 31.83B

Chars: 205.93B

Segments: 2.80B

CLEANED:

Docs: 33.84M

Words: 21.20B

Chars: 133.41B

Segments: 873.02M

14 downloads

German (Latin) (deu-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.52 TB
cleaned 1.81 TB

Source: CC/IA

DEDUPLICATED:

Docs: 897.17M

Words: 344.15B

Chars: 2.48T

Segments: 28.01B

CLEANED:

Docs: 482.05M

Words: 251.48B

Chars: 1.78T

Segments: 11.13B

12 downloads

Dinka (Latin) (dik-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 93.55 MB
cleaned 13.16 MB

Source: CC/IA

DEDUPLICATED:

Docs: 72.72k

Words: 14.17M

Chars: 72.43M

Segments: 1.48M

CLEANED:

Docs: 2.33k

Words: 2.29M

Chars: 11.54M

Segments: 34.65k

3 downloads

Dyula (Latin) (dyu-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 12.54 MB
cleaned 6.00 MB

Source: CC/IA

DEDUPLICATED:

Docs: 22.24k

Words: 2.07M

Chars: 11.85M

Segments: 198.56k

CLEANED:

Docs: 1.39k

Words: 1.19M

Chars: 5.55M

Segments: 24.56k

3 downloads

Dzongkha (Tibetan) (dzo-Tibt)

Creative Commons CC0 license
dedup 169.85 MB
cleaned 20.81 MB

Source: CC/IA

DEDUPLICATED:

Docs: 60.26k

Words: 8.12M

Chars: 96.26M

Segments: 1.85M

CLEANED:

Docs: 1.63k

Words: 422.24k

Chars: 7.38M

Segments: 39.97k

1 download

Greek (Greek) (ell-Grek)

Creative Commons CC0 license
hplt analytics logo
dedup 648.09 GB
cleaned 504.80 GB

Source: CC/IA

DEDUPLICATED:

Docs: 126.93M

Words: 55.53B

Chars: 373.82B

Segments: 3.68B

CLEANED:

Docs: 70.33M

Words: 42.70B

Chars: 283.60B

Segments: 1.85B

11 downloads

English (Latin) (eng-Latn)

Creative Commons CC0 license
dedup 22.94 TB
cleaned 17.19 TB

Source: CC/IA

DEDUPLICATED:

Docs: 7.72B

Words: 3.75T

Chars: 22.79T

Segments: 220.10B

CLEANED:

Docs: 4.39B

Words: 2.86T

Chars: 17.09T

Segments: 116.52B

77 downloads

Esperanto (Latin) (epo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 4.43 GB
cleaned 3.04 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.15M

Words: 681.98M

Chars: 4.33B

Segments: 51.27M

CLEANED:

Docs: 818.88k

Words: 471.60M

Chars: 2.98B

Segments: 20.35M

2 downloads

Estonian (Latin) (est-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 64.58 GB
cleaned 37.15 GB

Source: CC/IA

DEDUPLICATED:

Docs: 27.01M

Words: 8.05B

Chars: 62.78B

Segments: 822.02M

CLEANED:

Docs: 8.45M

Words: 4.74B

Chars: 36.03B

Segments: 264.42M

14 downloads

Basque (Latin) (eus-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 16.88 GB
cleaned 6.08 GB

Source: CC/IA

DEDUPLICATED:

Docs: 8.36M

Words: 2.30B

Chars: 16.77B

Segments: 156.66M

CLEANED:

Docs: 1.97M

Words: 776.64M

Chars: 6.05B

Segments: 37.62M

3 downloads

Ewe (Latin) (ewe-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 324.38 MB
cleaned 23.36 MB

Source: CC/IA

DEDUPLICATED:

Docs: 422.52k

Words: 27.85M

Chars: 291.48M

Segments: 5.78M

CLEANED:

Docs: 3.77k

Words: 4.31M

Chars: 21.32M

Segments: 143.40k

3 downloads

Faroese (Latin) (fao-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.25 GB
cleaned 625.63 MB

Source: CC/IA

DEDUPLICATED:

Docs: 772.73k

Words: 166.00M

Chars: 1.18B

Segments: 19.69M

CLEANED:

Docs: 239.92k

Words: 93.45M

Chars: 582.04M

Segments: 4.53M

8 downloads

Fijian (Latin) (fij-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 214.56 MB
cleaned 37.98 MB

Source: CC/IA

DEDUPLICATED:

Docs: 246.03k

Words: 24.93M

Chars: 212.08M

Segments: 3.05M

CLEANED:

Docs: 8.91k

Words: 7.26M

Chars: 37.70M

Segments: 178.92k

2 downloads

Finnish (Latin) (fin-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 243.97 GB
cleaned 161.93 GB

Source: CC/IA

DEDUPLICATED:

Docs: 79.98M

Words: 28.26B

Chars: 235.64B

Segments: 3.12B

CLEANED:

Docs: 34.82M

Words: 18.45B

Chars: 155.71B

Segments: 976.62M

21 downloads

Fon (Latin) (fon-Latn)

Creative Commons CC0 license
dedup 73.19 MB
cleaned 6.52 MB

Source: CC/IA

DEDUPLICATED:

Docs: 41.82k

Words: 9.08M

Chars: 57.06M

Segments: 522.76k

CLEANED:

Docs: 1.23k

Words: 1.23M

Chars: 5.34M

Segments: 14.76k

2 downloads

French (Latin) (fra-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.93 TB
cleaned 1.51 TB

Source: CC/IA

DEDUPLICATED:

Docs: 685.35M

Words: 303.58B

Chars: 1.87T

Segments: 18.88B

CLEANED:

Docs: 401.83M

Words: 237.04B

Chars: 1.46T

Segments: 10.56B

24 downloads

Friulian (Latin) (fur-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.00 GB
cleaned 118.58 MB

Source: CC/IA

DEDUPLICATED:

Docs: 7.78M

Words: 475.04M

Chars: 2.85B

Segments: 133.33M

CLEANED:

Docs: 36.67k

Words: 20.82M

Chars: 114.77M

Segments: 730.04k

Nigerian Fulfulde (Latin) (fuv-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 512.71 MB
cleaned 31.52 MB

Source: CC/IA

DEDUPLICATED:

Docs: 171.74k

Words: 56.25M

Chars: 498.80M

Segments: 8.17M

CLEANED:

Docs: 7.76k

Words: 5.14M

Chars: 29.91M

Segments: 133.98k

2 downloads

Oromo (Latin) (gaz-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.22 GB
cleaned 223.54 MB

Source: CC/IA

DEDUPLICATED:

Docs: 563.06k

Words: 120.65M

Chars: 1.21B

Segments: 14.06M

CLEANED:

Docs: 49.14k

Words: 28.88M

Chars: 219.26M

Segments: 973.63k

4 downloads

Scottish Gaelic (Latin) (gla-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.17 GB
cleaned 497.77 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.42M

Words: 302.57M

Chars: 2.12B

Segments: 43.72M

CLEANED:

Docs: 137.41k

Words: 80.66M

Chars: 483.76M

Segments: 3.31M

3 downloads

Irish (Latin) (gle-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 4.06 GB
cleaned 1.86 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.76M

Words: 573.32M

Chars: 3.88B

Segments: 57.52M

CLEANED:

Docs: 490.79k

Words: 295.71M

Chars: 1.75B

Segments: 10.99M

5 downloads

Galician (Latin) (glg-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 25.05 GB
cleaned 10.37 GB

Source: CC/IA

DEDUPLICATED:

Docs: 17.32M

Words: 3.52B

Chars: 24.37B

Segments: 635.06M

CLEANED:

Docs: 3.02M

Words: 1.64B

Chars: 10.11B

Segments: 61.18M

2 downloads

Guarani (Latin) (grn-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 7.21 GB
cleaned 227.36 MB

Source: CC/IA

DEDUPLICATED:

Docs: 7.38M

Words: 1.02B

Chars: 6.95B

Segments: 169.01M

CLEANED:

Docs: 73.42k

Words: 30.72M

Chars: 218.70M

Segments: 1.71M

1 download

Gujarati (Gujarati) (guj-Gujr)

Creative Commons CC0 license
hplt analytics logo
dedup 10.77 GB
cleaned 8.60 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.52M

Words: 738.53M

Chars: 4.58B

Segments: 51.48M

CLEANED:

Docs: 1.13M

Words: 576.82M

Chars: 3.39B

Segments: 20.64M

3 downloads

Haitian Creole (Latin) (hat-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 5.93 GB
cleaned 657.92 MB

Source: CC/IA

DEDUPLICATED:

Docs: 6.27M

Words: 944.63M

Chars: 5.69B

Segments: 162.56M

CLEANED:

Docs: 212.69k

Words: 122.29M

Chars: 639.12M

Segments: 4.64M

1 download

Hausa (Latin) (hau-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.72 GB
cleaned 865.89 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.63M

Words: 300.87M

Chars: 1.69B

Segments: 31.88M

CLEANED:

Docs: 315.87k

Words: 152.62M

Chars: 853.83M

Segments: 5.69M

1 download

Hebrew (Hebrew) (heb-Hebr)

Creative Commons CC0 license
hplt analytics logo
dedup 158.15 GB
cleaned 99.98 GB

Source: CC/IA

DEDUPLICATED:

Docs: 40.69M

Words: 16.09B

Chars: 93.70B

Segments: 1.43B

CLEANED:

Docs: 17.12M

Words: 9.97B

Chars: 56.84B

Segments: 466.63M

3 downloads

Hindi (Devanagari) (hin-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 176.45 GB
cleaned 109.37 GB

Source: CC/IA

DEDUPLICATED:

Docs: 26.80M

Words: 13.76B

Chars: 74.08B

Segments: 751.52M

CLEANED:

Docs: 13.65M

Words: 8.64B

Chars: 43.97B

Segments: 267.41M

12 downloads

Chhattisgarhi (Devanagari) (hne-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 445.18 MB
cleaned 25.85 MB

Source: CC/IA

DEDUPLICATED:

Docs: 914.27k

Words: 59.78M

Chars: 324.99M

Segments: 22.51M

CLEANED:

Docs: 2.81k

Words: 2.20M

Chars: 10.60M

Segments: 55.00k

1 download

Croatian (Latin) (hrv-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 95.38 GB
cleaned 49.37 GB

Source: CC/IA

DEDUPLICATED:

Docs: 41.23M

Words: 14.20B

Chars: 92.86B

Segments: 1.13B

CLEANED:

Docs: 12.30M

Words: 7.31B

Chars: 48.01B

Segments: 297.13M

7 downloads

Hungarian (Latin) (hun-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 353.43 GB
cleaned 247.09 GB

Source: CC/IA

DEDUPLICATED:

Docs: 116.86M

Words: 44.00B

Chars: 324.50B

Segments: 4.16B

CLEANED:

Docs: 51.87M

Words: 30.52B

Chars: 225.25B

Segments: 1.42B

1 download

Armenian (Armenian) (hye-Armn)

Creative Commons CC0 license
hplt analytics logo
dedup 23.03 GB
cleaned 19.38 GB

Source: CC/IA

DEDUPLICATED:

Docs: 6.44M

Words: 1.72B

Chars: 12.97B

Segments: 123.72M

CLEANED:

Docs: 3.60M

Words: 1.40B

Chars: 10.72B

Segments: 65.24M

1 download

Igbo (Latin) (ibo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 884.15 MB
cleaned 241.99 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.41M

Words: 121.57M

Chars: 823.31M

Segments: 18.86M

CLEANED:

Docs: 56.29k

Words: 38.29M

Chars: 205.21M

Segments: 1.41M

4 downloads

Iloko (Latin) (ilo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.12 GB
cleaned 157.79 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.55M

Words: 164.05M

Chars: 1.10B

Segments: 40.06M

CLEANED:

Docs: 48.75k

Words: 24.78M

Chars: 156.84M

Segments: 1.12M

Indonesian (Latin) (ind-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 554.43 GB
cleaned 386.04 GB

Source: CC/IA

DEDUPLICATED:

Docs: 169.44M

Words: 78.71B

Chars: 551.63B

Segments: 4.74B

CLEANED:

Docs: 98.14M

Words: 54.62B

Chars: 384.32B

Segments: 2.39B

6 downloads

Icelandic (Latin) (isl-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 14.70 GB
cleaned 10.59 GB

Source: CC/IA

DEDUPLICATED:

Docs: 6.02M

Words: 2.13B

Chars: 13.37B

Segments: 153.03M

CLEANED:

Docs: 2.84M

Words: 1.54B

Chars: 9.60B

Segments: 69.64M

3 downloads

Italian (Latin) (ita-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.14 TB
cleaned 831.84 GB

Source: CC/IA

DEDUPLICATED:

Docs: 381.65M

Words: 170.20B

Chars: 1.13T

Segments: 10.21B

CLEANED:

Docs: 221.75M

Words: 127.41B

Chars: 820.82B

Segments: 5.13B

13 downloads

Javanese (Latin) (jav-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.47 GB
cleaned 953.58 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.28M

Words: 311.46M

Chars: 2.44B

Segments: 31.44M

CLEANED:

Docs: 195.97k

Words: 137.82M

Chars: 937.71M

Segments: 6.43M

4 downloads

Japanese (Japanese) (jpn-Jpan)

Creative Commons CC0 license
hplt analytics logo
dedup 3.95 TB
cleaned 2.40 TB

Source: CC/IA

DEDUPLICATED:

Docs: 1.16B

Words: 106.81B

Chars: 1.63T

Segments: 51.70B

CLEANED:

Docs: 417.71M

Words: 42.36B

Chars: 901.53B

Segments: 23.27B

23 downloads

Kabyle (Latin) (kab-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.42 GB
cleaned 57.54 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.35M

Words: 257.65M

Chars: 3.26B

Segments: 61.52M

CLEANED:

Docs: 15.10k

Words: 9.22M

Chars: 54.21M

Segments: 345.22k

4 downloads

Kachin (Latin) (kac-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 409.83 MB
cleaned 28.79 MB

Source: CC/IA

DEDUPLICATED:

Docs: 101.29k

Words: 39.46M

Chars: 375.99M

Segments: 9.26M

CLEANED:

Docs: 7.59k

Words: 5.96M

Chars: 28.41M

Segments: 159.42k

Kamba (Latin) (kam-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 34.94 MB
cleaned 5.15 MB

Source: CC/IA

DEDUPLICATED:

Docs: 40.20k

Words: 5.46M

Chars: 32.84M

Segments: 842.30k

CLEANED:

Docs: 1.18k

Words: 674.04k

Chars: 4.65M

Segments: 14.26k

4 downloads

Kannada (Kannada) (kan-Knda)

Creative Commons CC0 license
hplt analytics logo
dedup 14.22 GB
cleaned 11.26 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.51M

Words: 739.80M

Chars: 5.73B

Segments: 71.25M

CLEANED:

Docs: 1.34M

Words: 532.86M

Chars: 4.30B

Segments: 24.93M

3 downloads

Kashmiri (Arabic) (kas-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 28.25 MB
cleaned 6.19 MB

Source: CC/IA

DEDUPLICATED:

Docs: 15.18k

Words: 2.72M

Chars: 21.60M

Segments: 545.00k

CLEANED:

Docs: 949

Words: 678.02k

Chars: 3.47M

Segments: 27.11k

1 download

Kashmiri (Devanagari) (kas-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 53.69 MB
cleaned 441.77 kB

Source: CC/IA

DEDUPLICATED:

Docs: 23.44k

Words: 5.18M

Chars: 37.19M

Segments: 938.52k

CLEANED:

Docs: 106

Words: 31.94k

Chars: 185.55k

Segments: 1.36k

1 download

Georgian (Georgian) (kat-Geor)

Creative Commons CC0 license
hplt analytics logo
dedup 37.20 GB
cleaned 26.59 GB

Source: CC/IA

DEDUPLICATED:

Docs: 7.57M

Words: 1.93B

Chars: 15.26B

Segments: 195.10M

CLEANED:

Docs: 3.34M

Words: 1.24B

Chars: 10.16B

Segments: 63.72M

4 downloads

Kazakh (Cyrillic) (kaz-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 27.43 GB
cleaned 20.24 GB

Source: CC/IA

DEDUPLICATED:

Docs: 5.16M

Words: 2.00B

Chars: 15.35B

Segments: 151.47M

CLEANED:

Docs: 2.64M

Words: 1.41B

Chars: 11.13B

Segments: 81.01M

3 downloads

Kabiyè (Latin) (kbp-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 147.71 MB
cleaned 25.71 MB

Source: CC/IA

DEDUPLICATED:

Docs: 228.24k

Words: 25.61M

Chars: 132.74M

Segments: 2.95M

CLEANED:

Docs: 7.08k

Words: 4.26M

Chars: 20.91M

Segments: 46.79k

4 downloads

Kabuverdianu (Latin) (kea-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 18.51 MB
cleaned 6.29 MB

Source: CC/IA

DEDUPLICATED:

Docs: 18.25k

Words: 3.31M

Chars: 17.88M

Segments: 422.15k

CLEANED:

Docs: 1.96k

Words: 1.14M

Chars: 6.15M

Segments: 43.91k

Mongolian (Cyrillic) (khk-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 15.56 GB
cleaned 12.35 GB

Source: CC/IA

DEDUPLICATED:

Docs: 3.63M

Words: 1.72B

Chars: 11.89B

Segments: 88.34M

CLEANED:

Docs: 2.12M

Words: 1.34B

Chars: 9.33B

Segments: 53.47M

5 downloads

Khmer (Khmer) (khm-Khmr)

Creative Commons CC0 license
hplt analytics logo
dedup 9.71 GB
cleaned 5.89 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.30M

Words: 210.16M

Chars: 3.59B

Segments: 38.34M

CLEANED:

Docs: 700.99k

Words: 113.80M

Chars: 2.12B

Segments: 9.86M

2 downloads

Kikuyu (Latin) (kik-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 63.67 MB
cleaned 10.46 MB

Source: CC/IA

DEDUPLICATED:

Docs: 82.34k

Words: 7.70M

Chars: 60.74M

Segments: 1.54M

CLEANED:

Docs: 4.00k

Words: 1.43M

Chars: 9.30M

Segments: 51.93k

1 download

Kinyarwanda (Latin) (kin-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.40 GB
cleaned 374.02 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.59M

Words: 147.93M

Chars: 1.35B

Segments: 26.79M

CLEANED:

Docs: 92.70k

Words: 50.74M

Chars: 367.20M

Segments: 1.92M

1 download

Kyrgyz (Cyrillic) (kir-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 4.63 GB
cleaned 3.52 GB

Source: CC/IA

DEDUPLICATED:

Docs: 1.59M

Words: 338.52M

Chars: 2.60B

Segments: 25.49M

CLEANED:

Docs: 676.11k

Words: 246.66M

Chars: 1.93B

Segments: 10.04M

5 downloads

Kimbundu (Latin) (kmb-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 111.08 MB
cleaned 2.11 MB

Source: CC/IA

DEDUPLICATED:

Docs: 70.13k

Words: 15.85M

Chars: 109.65M

Segments: 4.92M

CLEANED:

Docs: 531

Words: 383.09k

Chars: 2.07M

Segments: 11.80k

3 downloads

Kurdish (Latin) (kmr-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.55 GB
cleaned 1.24 GB

Source: CC/IA

DEDUPLICATED:

Docs: 714.99k

Words: 242.64M

Chars: 1.40B

Segments: 12.67M

CLEANED:

Docs: 364.35k

Words: 195.87M

Chars: 1.12B

Segments: 7.15M

Kanuri (Arabic) (knc-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 7.80 MB
cleaned 2.34 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.57k

Words: 1.01M

Chars: 4.58M

Segments: 171.98k

CLEANED:

Docs: 245

Words: 262.00k

Chars: 1.30M

Segments: 10.83k

2 downloads

Kanuri (Latin) (knc-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 55.45 MB
cleaned 13.66 MB

Source: CC/IA

DEDUPLICATED:

Docs: 41.67k

Words: 8.94M

Chars: 51.16M

Segments: 1.23M

CLEANED:

Docs: 2.47k

Words: 2.41M

Chars: 11.95M

Segments: 10.52k

1 download

Kongo (Latin) (kon-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 26.74 MB
cleaned 11.39 MB

Source: CC/IA

DEDUPLICATED:

Docs: 53.30k

Words: 4.11M

Chars: 25.66M

Segments: 626.53k

CLEANED:

Docs: 2.54k

Words: 1.94M

Chars: 11.28M

Segments: 47.48k

3 downloads

Korean (Hangul) (kor-Hang)

Creative Commons CC0 license
hplt analytics logo
dedup 301.47 GB
cleaned 201.89 GB

Source: CC/IA

DEDUPLICATED:

Docs: 98.75M

Words: 30.97B

Chars: 144.91B

Segments: 3.48B

CLEANED:

Docs: 38.87M

Words: 19.69B

Chars: 89.27B

Segments: 1.36B

17 downloads

Lao (Lao) (lao-Laoo)

Creative Commons CC0 license
hplt analytics logo
dedup 2.40 GB
cleaned 232.58 MB

Source: CC/IA

DEDUPLICATED:

Docs: 624.70k

Words: 66.19M

Chars: 931.66M

Segments: 9.36M

CLEANED:

Docs: 29.50k

Words: 5.18M

Chars: 84.71M

Segments: 319.95k

8 downloads

Ligurian (Latin) (lij-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.06 GB
cleaned 35.47 MB

Source: CC/IA

DEDUPLICATED:

Docs: 659.62k

Words: 157.11M

Chars: 908.36M

Segments: 29.25M

CLEANED:

Docs: 8.37k

Words: 5.59M

Chars: 31.47M

Segments: 157.72k

Limburgish (Latin) (lim-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 11.35 GB
cleaned 1.14 GB

Source: CC/IA

DEDUPLICATED:

Docs: 11.11M

Words: 1.64B

Chars: 11.19B

Segments: 347.02M

CLEANED:

Docs: 367.93k

Words: 180.62M

Chars: 1.13B

Segments: 7.14M

1 download

Lingala (Latin) (lin-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 446.06 MB
cleaned 33.61 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.11M

Words: 68.92M

Chars: 434.78M

Segments: 17.96M

CLEANED:

Docs: 7.59k

Words: 5.55M

Chars: 32.93M

Segments: 200.34k

1 download

Lithuanian (Latin) (lit-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 80.67 GB
cleaned 53.57 GB

Source: CC/IA

DEDUPLICATED:

Docs: 35.68M

Words: 10.03B

Chars: 76.42B

Segments: 888.10M

CLEANED:

Docs: 13.34M

Words: 6.68B

Chars: 50.41B

Segments: 322.16M

3 downloads

Lombard (Latin) (lmo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 5.35 GB
cleaned 359.16 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.98M

Words: 626.58M

Chars: 5.19B

Segments: 108.78M

CLEANED:

Docs: 146.16k

Words: 59.64M

Chars: 345.51M

Segments: 2.12M

Latgalian (Latin) (ltg-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 606.24 MB
cleaned 28.90 MB

Source: CC/IA

DEDUPLICATED:

Docs: 484.07k

Words: 69.01M

Chars: 569.86M

Segments: 21.60M

CLEANED:

Docs: 9.21k

Words: 3.79M

Chars: 26.89M

Segments: 151.38k

2 downloads

Luxembourgish (Latin) (ltz-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.68 GB
cleaned 731.01 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.39M

Words: 419.25M

Chars: 2.61B

Segments: 83.31M

CLEANED:

Docs: 246.93k

Words: 107.22M

Chars: 710.65M

Segments: 5.06M

5 downloads

Luba-Lulua (Latin) (lua-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 47.93 MB
cleaned 9.09 MB

Source: CC/IA

DEDUPLICATED:

Docs: 104.57k

Words: 7.24M

Chars: 47.21M

Segments: 1.69M

CLEANED:

Docs: 1.08k

Words: 1.37M

Chars: 9.01M

Segments: 38.69k

1 download

Ganda (Latin) (lug-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 272.09 MB
cleaned 69.52 MB

Source: CC/IA

DEDUPLICATED:

Docs: 321.60k

Words: 28.42M

Chars: 267.44M

Segments: 3.56M

CLEANED:

Docs: 21.28k

Words: 9.18M

Chars: 67.99M

Segments: 407.54k

2 downloads

Luo (Latin) (luo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 151.28 MB
cleaned 20.71 MB

Source: CC/IA

DEDUPLICATED:

Docs: 190.37k

Words: 19.93M

Chars: 149.68M

Segments: 3.19M

CLEANED:

Docs: 4.15k

Words: 3.73M

Chars: 20.33M

Segments: 84.12k

1 download

Mizo (Latin) (lus-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.33 GB
cleaned 657.84 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.22M

Words: 221.83M

Chars: 1.31B

Segments: 25.88M

CLEANED:

Docs: 160.38k

Words: 125.20M

Chars: 652.17M

Segments: 3.43M

Latvian (Latin) (lvs-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 51.34 GB
cleaned 27.43 GB

Source: CC/IA

DEDUPLICATED:

Docs: 23.04M

Words: 6.26B

Chars: 47.62B

Segments: 656.95M

CLEANED:

Docs: 6.77M

Words: 3.46B

Chars: 25.19B

Segments: 173.81M

4 downloads

Magahi (Devanagari) (mag-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 50.14 GB
cleaned 10.58 MB

Source: CC/IA

DEDUPLICATED:

Docs: 8.95M

Words: 4.43B

Chars: 50.01B

Segments: 35.61M

CLEANED:

Docs: 328

Words: 890.63k

Chars: 4.28M

Segments: 19.29k

Maithili (Devanagari) (mai-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 400.06 MB
cleaned 245.24 MB

Source: CC/IA

DEDUPLICATED:

Docs: 116.18k

Words: 33.60M

Chars: 170.03M

Segments: 2.12M

CLEANED:

Docs: 24.98k

Words: 17.79M

Chars: 96.77M

Segments: 645.53k

2 downloads

Malayalam (Malayalam) (mal-Mlym)

Creative Commons CC0 license
hplt analytics logo
dedup 30.02 GB
cleaned 25.49 GB

Source: CC/IA

DEDUPLICATED:

Docs: 4.59M

Words: 1.20B

Chars: 11.46B

Segments: 76.76M

CLEANED:

Docs: 3.10M

Words: 973.66M

Chars: 9.49B

Segments: 48.00M

4 downloads

Marathi (Devanagari) (mar-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 21.14 GB
cleaned 17.23 GB

Source: CC/IA

DEDUPLICATED:

Docs: 3.32M

Words: 1.25B

Chars: 8.35B

Segments: 68.56M

CLEANED:

Docs: 2.08M

Words: 980.75M

Chars: 6.62B

Segments: 36.32M

3 downloads

Minangkabau (Latin) (min-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.09 GB
cleaned 75.12 MB

Source: CC/IA

DEDUPLICATED:

Docs: 901.19k

Words: 137.36M

Chars: 1.07B

Segments: 23.70M

CLEANED:

Docs: 25.04k

Words: 10.98M

Chars: 74.80M

Segments: 600.80k

1 download

Macedonian (Cyrillic) (mkd-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 22.94 GB
cleaned 16.86 GB

Source: CC/IA

DEDUPLICATED:

Docs: 7.71M

Words: 2.08B

Chars: 13.39B

Segments: 164.74M

CLEANED:

Docs: 3.57M

Words: 1.49B

Chars: 9.44B

Segments: 57.01M

10 downloads

Maltese (Latin) (mlt-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 7.16 GB
cleaned 1.50 GB

Source: CC/IA

DEDUPLICATED:

Docs: 7.75M

Words: 930.52M

Chars: 6.97B

Segments: 129.16M

CLEANED:

Docs: 367.26k

Words: 195.81M

Chars: 1.44B

Segments: 8.68M

1 download

Manipuri (Bangla) (mni-Beng)

Creative Commons CC0 license
hplt analytics logo
dedup 59.36 MB
cleaned 30.85 MB

Source: CC/IA

DEDUPLICATED:

Docs: 14.18k

Words: 3.78M

Chars: 26.58M

Segments: 612.73k

CLEANED:

Docs: 2.93k

Words: 1.63M

Chars: 11.79M

Segments: 65.76k

2 downloads

Mossi (Latin) (mos-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 180.55 MB
cleaned 4.36 MB

Source: CC/IA

DEDUPLICATED:

Docs: 220.45k

Words: 37.80M

Chars: 151.08M

Segments: 6.45M

CLEANED:

Docs: 931

Words: 807.49k

Chars: 3.86M

Segments: 19.10k

1 download

Māori (Latin) (mri-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 855.78 MB
cleaned 433.29 MB

Source: CC/IA

DEDUPLICATED:

Docs: 665.39k

Words: 157.42M

Chars: 829.14M

Segments: 13.43M

CLEANED:

Docs: 108.26k

Words: 86.76M

Chars: 424.40M

Segments: 2.80M

Burmese (Myanmar) (mya-Mymr)

Creative Commons CC0 license
hplt analytics logo
dedup 26.32 GB
cleaned 16.03 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.98M

Words: 787.24M

Chars: 9.81B

Segments: 75.11M

CLEANED:

Docs: 1.37M

Words: 453.18M

Chars: 5.82B

Segments: 30.50M

2 downloads

Dutch (Latin) (nld-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 664.77 GB
cleaned 453.07 GB

Source: CC/IA

DEDUPLICATED:

Docs: 303.51M

Words: 103.15B

Chars: 661.41B

Segments: 8.06B

CLEANED:

Docs: 138.65M

Words: 71.40B

Chars: 451.22B

Segments: 3.07B

3 downloads

Norwegian Nynorsk (Latin) (nno-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 12.14 GB
cleaned 5.53 GB

Source: CC/IA

DEDUPLICATED:

Docs: 15.13M

Words: 1.83B

Chars: 11.86B

Segments: 224.17M

CLEANED:

Docs: 1.42M

Words: 860.34M

Chars: 5.41B

Segments: 34.60M

7 downloads

Norwegian Bokmål (Latin) (nob-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 204.81 GB
cleaned 136.22 GB

Source: CC/IA

DEDUPLICATED:

Docs: 64.72M

Words: 31.42B

Chars: 200.54B

Segments: 2.00B

CLEANED:

Docs: 27.05M

Words: 21.53B

Chars: 133.27B

Segments: 675.97M

16 downloads

Nepali (Devanagari) (npi-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 22.67 GB
cleaned 19.23 GB

Source: CC/IA

DEDUPLICATED:

Docs: 4.03M

Words: 1.35B

Chars: 8.70B

Segments: 54.53M

CLEANED:

Docs: 2.78M

Words: 1.13B

Chars: 7.26B

Segments: 37.14M

3 downloads

Northern Sotho (Latin) (nso-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 149.23 MB
cleaned 28.31 MB

Source: CC/IA

DEDUPLICATED:

Docs: 409.23k

Words: 21.19M

Chars: 142.60M

Segments: 3.67M

CLEANED:

Docs: 6.07k

Words: 5.32M

Chars: 27.50M

Segments: 143.31k

3 downloads

Nuer (Latin) (nus-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 55.15 MB
cleaned 2.24 MB

Source: CC/IA

DEDUPLICATED:

Docs: 11.18k

Words: 5.20M

Chars: 44.57M

Segments: 275.47k

CLEANED:

Docs: 272

Words: 393.16k

Chars: 1.88M

Segments: 8.51k

3 downloads

Nyanja (Latin) (nya-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 322.21 MB
cleaned 204.61 MB

Source: CC/IA

DEDUPLICATED:

Docs: 371.76k

Words: 42.92M

Chars: 318.50M

Segments: 7.12M

CLEANED:

Docs: 53.12k

Words: 27.06M

Chars: 202.97M

Segments: 1.34M

2 downloads

Occitan (Latin) (oci-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.42 GB
cleaned 655.41 MB

Source: CC/IA

DEDUPLICATED:

Docs: 3.60M

Words: 536.38M

Chars: 3.32B

Segments: 86.55M

CLEANED:

Docs: 189.91k

Words: 102.72M

Chars: 635.59M

Segments: 4.19M

9 downloads

Odia (Odia) (ory-Orya)

Creative Commons CC0 license
hplt analytics logo
dedup 2.45 GB
cleaned 2.06 GB

Source: CC/IA

DEDUPLICATED:

Docs: 587.96k

Words: 145.78M

Chars: 947.33M

Segments: 5.59M

CLEANED:

Docs: 412.89k

Words: 120.13M

Chars: 781.95M

Segments: 3.60M

3 downloads

Pangasinan (Latin) (pag-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 956.20 MB
cleaned 33.93 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.42M

Words: 155.36M

Chars: 889.39M

Segments: 46.27M

CLEANED:

Docs: 6.90k

Words: 5.66M

Chars: 33.53M

Segments: 85.83k

Punjabi (Gurmukhi) (pan-Guru)

Creative Commons CC0 license
hplt analytics logo
dedup 6.26 GB
cleaned 4.75 GB

Source: CC/IA

DEDUPLICATED:

Docs: 1.05M

Words: 517.32M

Chars: 2.67B

Segments: 34.52M

CLEANED:

Docs: 584.59k

Words: 372.17M

Chars: 1.90B

Segments: 11.74M

5 downloads

Papiamento (Latin) (pap-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.51 GB
cleaned 257.64 MB

Source: CC/IA

DEDUPLICATED:

Docs: 8.51M

Words: 416.98M

Chars: 2.46B

Segments: 89.44M

CLEANED:

Docs: 89.81k

Words: 46.71M

Chars: 254.18M

Segments: 1.39M

Southern Pashto (Latin) (pbt-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 2.76 GB
cleaned 2.30 GB

Source: CC/IA

DEDUPLICATED:

Docs: 769.00k

Words: 334.95M

Chars: 1.59B

Segments: 13.49M

CLEANED:

Docs: 466.47k

Words: 279.44M

Chars: 1.30B

Segments: 8.45M

3 downloads

Persian (Arabic) (pes-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 1.12 TB
cleaned 799.58 GB

Source: CC/IA

DEDUPLICATED:

Docs: 196.84M

Words: 124.37B

Chars: 644.49B

Segments: 7.03B

CLEANED:

Docs: 90.50M

Words: 88.55B

Chars: 455.15B

Segments: 3.96B

4 downloads

Malagasy (Latin) (plt-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.07 GB
cleaned 824.05 MB

Source: CC/IA

DEDUPLICATED:

Docs: 492.55k

Words: 152.73M

Chars: 1.05B

Segments: 10.18M

CLEANED:

Docs: 207.84k

Words: 117.08M

Chars: 810.51M

Segments: 4.74M

1 download

Polish (Latin) (pol-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 993.11 GB
cleaned 665.34 GB

Source: CC/IA

DEDUPLICATED:

Docs: 382.38M

Words: 136.50B

Chars: 948.27B

Segments: 12.72B

CLEANED:

Docs: 175.41M

Words: 89.53B

Chars: 631.77B

Segments: 4.46B

33 downloads

Portuguese (Latin) (por-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.29 TB
cleaned 924.45 GB

Source: CC/IA

DEDUPLICATED:

Docs: 470.63M

Words: 203.71B

Chars: 1.26T

Segments: 14.18B

CLEANED:

Docs: 237.81M

Words: 146.27B

Chars: 896.79B

Segments: 6.12B

15 downloads

Dari (Arabic) (prs-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 46.55 GB
cleaned 16.75 GB

Source: CC/IA

DEDUPLICATED:

Docs: 12.33M

Words: 5.21B

Chars: 28.74B

Segments: 413.50M

CLEANED:

Docs: 2.84M

Words: 1.84B

Chars: 9.57B

Segments: 69.00M

6 downloads

Ayacucho Quechua (Latin) (quy-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 334.54 MB
cleaned 146.11 MB

Source: CC/IA

DEDUPLICATED:

Docs: 405.81k

Words: 44.88M

Chars: 327.42M

Segments: 5.30M

CLEANED:

Docs: 36.94k

Words: 17.31M

Chars: 143.45M

Segments: 494.25k

5 downloads

Romanian (Latin) (ron-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 339.34 GB
cleaned 259.17 GB

Source: CC/IA

DEDUPLICATED:

Docs: 115.84M

Words: 52.11B

Chars: 329.18B

Segments: 3.37B

CLEANED:

Docs: 65.88M

Words: 40.05B

Chars: 250.72B

Segments: 1.70B

4 downloads

Rundi (Latin) (run-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.72 GB
cleaned 322.89 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.70M

Words: 232.01M

Chars: 1.69B

Segments: 36.91M

CLEANED:

Docs: 137.30k

Words: 44.44M

Chars: 316.63M

Segments: 1.75M

1 download

Russian (Cyrillic) (rus-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 8.80 TB
cleaned 7.03 TB

Source: CC/IA

DEDUPLICATED:

Docs: 1.64B

Words: 696.30B

Chars: 5.01T

Segments: 49.90B

CLEANED:

Docs: 884.69M

Words: 540.88B

Chars: 3.91T

Segments: 26.29B

9 downloads

Sango (Latin) (sag-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 775.52 MB
cleaned 17.42 MB

Source: CC/IA

DEDUPLICATED:

Docs: 864.29k

Words: 136.25M

Chars: 760.17M

Segments: 39.56M

CLEANED:

Docs: 3.16k

Words: 3.61M

Chars: 16.74M

Segments: 51.90k

2 downloads

Sanskrit (Devanagari) (san-Deva)

Creative Commons CC0 license
hplt analytics logo
dedup 1.90 GB
cleaned 959.37 MB

Source: CC/IA

DEDUPLICATED:

Docs: 200.47k

Words: 95.80M

Chars: 746.25M

Segments: 11.58M

CLEANED:

Docs: 54.91k

Words: 43.80M

Chars: 359.21M

Segments: 3.28M

2 downloads

Santali (Ol Chiki) (sat-Olck)

Creative Commons CC0 license
hplt analytics logo
dedup 35.51 MB
cleaned 15.27 MB

Source: CC/IA

DEDUPLICATED:

Docs: 9.40k

Words: 2.88M

Chars: 16.98M

Segments: 217.79k

CLEANED:

Docs: 2.57k

Words: 1.09M

Chars: 6.27M

Segments: 45.80k

3 downloads

Sicilian (Latin) (scn-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.35 GB
cleaned 260.62 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.42M

Words: 433.55M

Chars: 3.24B

Segments: 116.89M

CLEANED:

Docs: 81.97k

Words: 42.39M

Chars: 252.40M

Segments: 1.65M

Shan (Myanmar) (shn-Mymr)

Creative Commons CC0 license
hplt analytics logo
dedup 83.39 MB
cleaned 57.98 MB

Source: CC/IA

DEDUPLICATED:

Docs: 20.87k

Words: 2.99M

Chars: 33.50M

Segments: 406.32k

CLEANED:

Docs: 6.00k

Words: 1.65M

Chars: 21.22M

Segments: 92.14k

Sinhala (Sinhala) (sin-Sinh)

Creative Commons CC0 license
hplt analytics logo
dedup 15.97 GB
cleaned 12.60 GB

Source: CC/IA

DEDUPLICATED:

Docs: 2.04M

Words: 1.03B

Chars: 6.61B

Segments: 54.63M

CLEANED:

Docs: 1.15M

Words: 795.62M

Chars: 4.98B

Segments: 33.71M

8 downloads

Slovak (Latin) (slk-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 148.08 GB
cleaned 76.36 GB

Source: CC/IA

DEDUPLICATED:

Docs: 68.41M

Words: 20.32B

Chars: 137.70B

Segments: 2.41B

CLEANED:

Docs: 21.83M

Words: 10.63B

Chars: 70.39B

Segments: 494.28M

6 downloads

Slovenian (Latin) (slv-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 69.96 GB
cleaned 36.17 GB

Source: CC/IA

DEDUPLICATED:

Docs: 30.31M

Words: 9.83B

Chars: 68.36B

Segments: 1.01B

CLEANED:

Docs: 10.28M

Words: 5.43B

Chars: 35.27B

Segments: 238.64M

1 download

Samoan (Latin) (smo-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 520.03 MB
cleaned 189.78 MB

Source: CC/IA

DEDUPLICATED:

Docs: 295.95k

Words: 85.62M

Chars: 507.10M

Segments: 6.78M

CLEANED:

Docs: 45.86k

Words: 37.09M

Chars: 186.19M

Segments: 1.01M

Shona (Latin) (sna-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 641.39 MB
cleaned 193.21 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.08M

Words: 72.57M

Chars: 631.56M

Segments: 10.82M

CLEANED:

Docs: 61.08k

Words: 23.92M

Chars: 192.68M

Segments: 1.20M

Sindhi (Arabic) (snd-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 1.05 GB
cleaned 757.21 MB

Source: CC/IA

DEDUPLICATED:

Docs: 230.24k

Words: 115.44M

Chars: 626.70M

Segments: 5.58M

CLEANED:

Docs: 100.30k

Words: 89.53M

Chars: 428.73M

Segments: 2.83M

2 downloads

Somali (Latin) (som-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 5.36 GB
cleaned 2.58 GB

Source: CC/IA

DEDUPLICATED:

Docs: 3.12M

Words: 661.66M

Chars: 5.30B

Segments: 76.87M

CLEANED:

Docs: 966.51k

Words: 388.75M

Chars: 2.57B

Segments: 16.38M

4 downloads

Southern Sotho (Latin) (sot-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 336.85 MB
cleaned 172.45 MB

Source: CC/IA

DEDUPLICATED:

Docs: 288.14k

Words: 56.98M

Chars: 332.29M

Segments: 8.03M

CLEANED:

Docs: 43.92k

Words: 31.00M

Chars: 171.54M

Segments: 1.09M

1 download

Spanish (Latin) (spa-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.58 TB
cleaned 1.99 TB

Source: CC/IA

DEDUPLICATED:

Docs: 838.93M

Words: 414.23B

Chars: 2.53T

Segments: 22.17B

CLEANED:

Docs: 503.07M

Words: 321.95B

Chars: 1.95T

Segments: 12.12B

16 downloads

Sardinian (Latin) (srd-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 25.52 GB
cleaned 151.51 MB

Source: CC/IA

DEDUPLICATED:

Docs: 32.04M

Words: 3.71B

Chars: 25.18B

Segments: 675.85M

CLEANED:

Docs: 53.81k

Words: 23.89M

Chars: 148.80M

Segments: 917.09k

Serbian (Cyrillic) (srp-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 45.89 GB
cleaned 28.79 GB

Source: CC/IA

DEDUPLICATED:

Docs: 9.18M

Words: 3.83B

Chars: 26.69B

Segments: 249.63M

CLEANED:

Docs: 4.12M

Words: 2.52B

Chars: 16.16B

Segments: 93.81M

6 downloads

Swati (Latin) (ssw-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 228.53 MB
cleaned 8.89 MB

Source: CC/IA

DEDUPLICATED:

Docs: 262.40k

Words: 19.25M

Chars: 219.37M

Segments: 5.11M

CLEANED:

Docs: 2.04k

Words: 994.30k

Chars: 8.82M

Segments: 62.13k

3 downloads

Sundanese (Latin) (sun-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.50 GB
cleaned 483.15 MB

Source: CC/IA

DEDUPLICATED:

Docs: 2.89M

Words: 505.50M

Chars: 3.45B

Segments: 83.33M

CLEANED:

Docs: 114.75k

Words: 69.63M

Chars: 475.44M

Segments: 3.24M

4 downloads

Swedish (Latin) (swe-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 388.52 GB
cleaned 261.87 GB

Source: CC/IA

DEDUPLICATED:

Docs: 157.92M

Words: 58.47B

Chars: 374.21B

Segments: 4.81B

CLEANED:

Docs: 66.81M

Words: 40.10B

Chars: 251.18B

Segments: 1.75B

13 downloads

Swahili (Latin) (swh-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 7.62 GB
cleaned 4.70 GB

Source: CC/IA

DEDUPLICATED:

Docs: 4.42M

Words: 1.15B

Chars: 7.55B

Segments: 95.87M

CLEANED:

Docs: 1.37M

Words: 717.65M

Chars: 4.67B

Segments: 34.31M

4 downloads

Silesian (Latin) (szl-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 3.87 GB
cleaned 110.44 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.25M

Words: 497.28M

Chars: 3.75B

Segments: 73.39M

CLEANED:

Docs: 40.93k

Words: 14.68M

Chars: 103.88M

Segments: 636.57k

Tamil (Tamil) (tam-Taml)

Creative Commons CC0 license
hplt analytics logo
dedup 87.18 GB
cleaned 69.44 GB

Source: CC/IA

DEDUPLICATED:

Docs: 9.73M

Words: 3.96B

Chars: 34.09B

Segments: 322.60M

CLEANED:

Docs: 6.11M

Words: 2.98B

Chars: 26.24B

Segments: 168.59M

4 downloads

Tamasheq (Latin) (taq-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 478.55 MB
cleaned 9.89 MB

Source: CC/IA

DEDUPLICATED:

Docs: 253.19k

Words: 62.62M

Chars: 452.93M

Segments: 30.90M

CLEANED:

Docs: 1.75k

Words: 1.54M

Chars: 8.85M

Segments: 13.88k

5 downloads

Tamasheq (Tifinagh) (taq_Tfng)

Creative Commons CC0 license
dedup 378.67 kB

Source: CC/IA

DEDUPLICATED:

Docs: 101

Words: 21.32k

Chars: 149.82k

Segments: 1.08k

Central Atlas Tamazight (Tifinagh) (tzm-Tfng)

Creative Commons CC0 license
dedup 42.41 MB

Source: CC/IA

DEDUPLICATED:

Docs: 5.17k

Words: 1.24M

Chars: 15.97M

Segments: 324.49k

Tatar (Cyrillic) (tat-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 5.67 GB
cleaned 3.91 GB

Source: CC/IA

DEDUPLICATED:

Docs: 1.90M

Words: 452.35M

Chars: 3.21B

Segments: 36.63M

CLEANED:

Docs: 630.68k

Words: 296.70M

Chars: 2.16B

Segments: 13.45M

5 downloads

Telugu (Telugu) (tel-Telu)

Creative Commons CC0 license
hplt analytics logo
dedup 20.26 GB
cleaned 16.89 GB

Source: CC/IA

DEDUPLICATED:

Docs: 3.20M

Words: 1.05B

Chars: 8.06B

Segments: 70.31M

CLEANED:

Docs: 2.06M

Words: 835.42M

Chars: 6.51B

Segments: 39.19M

5 downloads

Tajik (Cyrillic) (tgk-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 10.86 GB
cleaned 8.35 GB

Source: CC/IA

DEDUPLICATED:

Docs: 3.41M

Words: 878.17M

Chars: 6.17B

Segments: 86.32M

CLEANED:

Docs: 1.26M

Words: 624.76M

Chars: 4.59B

Segments: 24.85M

2 downloads

Filipino (Latin) (tgl-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 23.17 GB
cleaned 8.18 GB

Source: CC/IA

DEDUPLICATED:

Docs: 8.57M

Words: 3.40B

Chars: 23.06B

Segments: 321.67M

CLEANED:

Docs: 1.87M

Words: 1.35B

Chars: 8.13B

Segments: 52.88M

6 downloads

Thai (Thai) (tha-Thai)

Creative Commons CC0 license
hplt analytics logo
dedup 395.45 GB
cleaned 163.67 GB

Source: CC/IA

DEDUPLICATED:

Docs: 81.71M

Words: 11.59B

Chars: 155.94B

Segments: 2.18B

CLEANED:

Docs: 17.70M

Words: 3.51B

Chars: 59.99B

Segments: 339.05M

19 downloads

Tigrinya (Ethiopic) (tir-Ethi)

Creative Commons CC0 license
hplt analytics logo
dedup 3.17 GB
cleaned 456.37 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.03M

Words: 194.51M

Chars: 2.31B

Segments: 38.58M

CLEANED:

Docs: 64.69k

Words: 36.72M

Chars: 181.70M

Segments: 1.13M

2 downloads

Tok Pisin (Latin) (tpi-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 458.15 MB
cleaned 64.93 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.16M

Words: 79.19M

Chars: 453.63M

Segments: 16.46M

CLEANED:

Docs: 13.98k

Words: 12.51M

Chars: 64.54M

Segments: 282.37k

Tswana (Latin) (tsn-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 170.41 MB
cleaned 27.86 MB

Source: CC/IA

DEDUPLICATED:

Docs: 361.19k

Words: 22.20M

Chars: 168.50M

Segments: 5.43M

CLEANED:

Docs: 6.05k

Words: 5.27M

Chars: 27.68M

Segments: 132.17k

3 downloads

Tsonga (Latin) (tso-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 154.02 MB
cleaned 50.11 MB

Source: CC/IA

DEDUPLICATED:

Docs: 193.13k

Words: 21.88M

Chars: 151.72M

Segments: 3.37M

CLEANED:

Docs: 11.01k

Words: 8.67M

Chars: 49.30M

Segments: 221.25k

3 downloads

Turkmen (Latin) (tuk-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.77 GB
cleaned 626.47 MB

Source: CC/IA

DEDUPLICATED:

Docs: 806.53k

Words: 230.24M

Chars: 1.65B

Segments: 30.73M

CLEANED:

Docs: 171.04k

Words: 70.68M

Chars: 570.17M

Segments: 3.36M

1 download

Tumbuka (Latin) (tum-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 144.26 MB
cleaned 21.62 MB

Source: CC/IA

DEDUPLICATED:

Docs: 59.60k

Words: 16.12M

Chars: 142.19M

Segments: 2.33M

CLEANED:

Docs: 4.38k

Words: 2.88M

Chars: 21.10M

Segments: 99.01k

3 downloads

Turkish (Latin) (tur-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 591.66 GB
cleaned 426.16 GB

Source: CC/IA

DEDUPLICATED:

Docs: 236.66M

Words: 73.92B

Chars: 543.97B

Segments: 5.87B

CLEANED:

Docs: 116.57M

Words: 51.67B

Chars: 389.75B

Segments: 2.57B

10 downloads

Akan (Latin) (twi-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 268.42 MB
cleaned 27.35 MB

Source: CC/IA

DEDUPLICATED:

Docs: 619.40k

Words: 36.55M

Chars: 258.14M

Segments: 6.27M

CLEANED:

Docs: 5.86k

Words: 4.70M

Chars: 24.18M

Segments: 125.61k

1 download

Uyghur (Arabic) (uig-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 4.50 GB
cleaned 3.25 GB

Source: CC/IA

DEDUPLICATED:

Docs: 1.09M

Words: 317.32M

Chars: 2.48B

Segments: 21.44M

CLEANED:

Docs: 442.40k

Words: 223.91M

Chars: 1.75B

Segments: 8.98M

6 downloads

Ukrainian (Cyrillic) (ukr-Cyrl)

Creative Commons CC0 license
hplt analytics logo
dedup 412.30 GB
cleaned 330.27 GB

Source: CC/IA

DEDUPLICATED:

Docs: 81.51M

Words: 31.90B

Chars: 231.82B

Segments: 2.09B

CLEANED:

Docs: 47.40M

Words: 25.23B

Chars: 182.92B

Segments: 1.17B

8 downloads

Umbundu (Latin) (umb-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 45.41 MB
cleaned 15.58 MB

Source: CC/IA

DEDUPLICATED:

Docs: 62.75k

Words: 6.39M

Chars: 43.81M

Segments: 1.09M

CLEANED:

Docs: 2.47k

Words: 2.43M

Chars: 15.41M

Segments: 59.91k

1 download

Urdu (Arabic) (urd-Arab)

Creative Commons CC0 license
hplt analytics logo
dedup 26.57 GB
cleaned 17.66 GB

Source: CC/IA

DEDUPLICATED:

Docs: 7.16M

Words: 3.28B

Chars: 15.89B

Segments: 229.88M

CLEANED:

Docs: 3.19M

Words: 2.13B

Chars: 10.01B

Segments: 50.63M

13 downloads

Uzbek (Latin) (uzn-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 10.59 GB
cleaned 2.94 GB

Source: CC/IA

DEDUPLICATED:

Docs: 9.75M

Words: 1.24B

Chars: 10.34B

Segments: 198.89M

CLEANED:

Docs: 706.92k

Words: 351.32M

Chars: 2.85B

Segments: 14.80M

4 downloads

Venetian (Latin) (vec-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 10.87 GB
cleaned 222.45 MB

Source: CC/IA

DEDUPLICATED:

Docs: 11.52M

Words: 1.61B

Chars: 10.70B

Segments: 425.91M

CLEANED:

Docs: 84.81k

Words: 35.25M

Chars: 218.06M

Segments: 1.58M

1 download

Vietnamese (Latin) (vie-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 630.05 GB
cleaned 494.66 GB

Source: CC/IA

DEDUPLICATED:

Docs: 174.14M

Words: 105.24B

Chars: 491.19B

Segments: 5.35B

CLEANED:

Docs: 100.75M

Words: 83.20B

Chars: 379.59B

Segments: 3.02B

11 downloads

Waray (Latin) (war-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 138.17 MB
cleaned 35.80 MB

Source: CC/IA

DEDUPLICATED:

Docs: 286.40k

Words: 23.27M

Chars: 136.08M

Segments: 3.12M

CLEANED:

Docs: 13.87k

Words: 5.89M

Chars: 35.59M

Segments: 200.94k

Wolof (Latin) (wol-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 257.74 MB
cleaned 28.68 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.97M

Words: 38.69M

Chars: 246.24M

Segments: 10.43M

CLEANED:

Docs: 5.68k

Words: 5.46M

Chars: 27.55M

Segments: 161.47k

1 download

Xhosa (Latin) (xho-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 6.72 GB
cleaned 259.51 MB

Source: CC/IA

DEDUPLICATED:

Docs: 3.41M

Words: 494.95M

Chars: 6.62B

Segments: 56.47M

CLEANED:

Docs: 63.09k

Words: 30.34M

Chars: 258.73M

Segments: 1.82M

4 downloads

Yiddish (Hebrew) (ydd-Hebr)

Creative Commons CC0 license
hplt analytics logo
dedup 1.57 GB
cleaned 814.74 MB

Source: CC/IA

DEDUPLICATED:

Docs: 414.98k

Words: 150.99M

Chars: 911.06M

Segments: 9.48M

CLEANED:

Docs: 128.26k

Words: 77.53M

Chars: 458.62M

Segments: 2.94M

Yoruba (Latin) (yor-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 2.02 GB
cleaned 254.85 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.89M

Words: 246.78M

Chars: 1.50B

Segments: 30.70M

CLEANED:

Docs: 66.13k

Words: 42.81M

Chars: 217.89M

Segments: 1.47M

2 downloads

Cantonese (Traditional) (yue-Hant)

Creative Commons CC0 license
hplt analytics logo
dedup 4.15 GB
cleaned 189.86 MB

Source: CC/IA

DEDUPLICATED:

Docs: 3.77M

Words: 235.77M

Chars: 2.27B

Segments: 131.06M

CLEANED:

Docs: 61.29k

Words: 3.27M

Chars: 74.36M

Segments: 1.24M

8 downloads

Simplified Chinese (zho-Hans)

Creative Commons CC0 license
hplt analytics logo
dedup 9.33 TB
cleaned 6.25 TB

Source: CC/IA

DEDUPLICATED:

Docs: 2.71B

Words: 149.31B

Chars: 3.67T

Segments: 76.75B

CLEANED:

Docs: 1.25B

Words: 74.01B

Chars: 2.35T

Segments: 42.45B

31 downloads

Traditional Chinese (zho-Hant)

Creative Commons CC0 license
hplt analytics logo
dedup 1.03 TB
cleaned 743.69 GB

Source: CC/IA

DEDUPLICATED:

Docs: 321.27M

Words: 17.31B

Chars: 417.51B

Segments: 7.82B

CLEANED:

Docs: 157.11M

Words: 9.51B

Chars: 286.98B

Segments: 4.48B

8 downloads

Malay (Latin) (zsm-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 195.62 GB
cleaned 78.84 GB

Source: CC/IA

DEDUPLICATED:

Docs: 69.23M

Words: 28.66B

Chars: 193.52B

Segments: 3.11B

CLEANED:

Docs: 18.42M

Words: 11.48B

Chars: 78.45B

Segments: 579.82M

4 downloads

Zulu (Latin) (zul-Latn)

Creative Commons CC0 license
hplt analytics logo
dedup 1.37 GB
cleaned 382.54 MB

Source: CC/IA

DEDUPLICATED:

Docs: 1.31M

Words: 156.42M

Chars: 1.33B

Segments: 33.14M

CLEANED:

Docs: 113.62k

Words: 44.36M

Chars: 380.92M

Segments: 2.71M

6 downloads

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these text data has been extracted.*
  • We license the actual packaging of these text data under the Creative Commons CC0 license ("no rights reserved") .
public-domain-logo

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.