HPLT Datasets v2

Read the paperDownload v2.0

Version 2.0 of the HPLT Monolingual Datasets is now published. These collections are available under the Creative Commons CC0 license and bring significant improvements compared to previous releases (version 1.2). Similarly to 1.2, the release comes in two variants: deduplicated (21 TB in size) and cleaned (15 TB in size). The cleaned variant contains the same documents as deduplicated minus those filtered out by our cleaning heuristics. The cleaned variant is recommended unless you want to try your own cleaning pipelines.

Similar to the previous releases, version 2.0 datasets are hosted by Sigma2 NIRD Data Lake, and text extraction pipeline was run on LUMI supercomputer.

HPLT Monolingual Datasets version 2.0 (the deduplicated variant) feature about 7.6 trillion whitespace-separated words and about 52 trillion characters extracted from 21 billion documents, compared to 5.6 trillion words and 42 trillion characters extracted from 5 billion documents in version 1.2. All in all, you can expect less noise and boilerplate, less duplicates, more unique documents, and generally better quality texts to train language models on.

What's new in v2.0

What are the HPLT Analytics reports linked to each language?

These automated reports provide useful statistics about the clean version of the HPLT v.2.0 datasets. They are the result of running the HPLT Analytics Tool on them. They allow inspecting dataset properties prior to a full download.

dataset pipeline

Output format

The output format is JSONL, where each line is a valid JSON object, providing a full document with all its metadata and text content. An example is provided here:

{"f":"./CC-MAIN-20170116095124-00401-ip-10-171-10-70.ec2.internal.warc.2.gz","o":37461524,"s":8114,"rs":27035,
    "u":"http://blogtailors.blogspot.com/2010/08/saida-de-emergencia-procura-tradutores.html",
    "c":"text/html","ts":"2017-01-24T09:05:15Z", 
    "collection":"cc17",
    "lang":["por_Latn","slk_Latn","kmb_Latn"],"prob":[1,0,0],
    "text":"A editora Saída de Emergência procura revisores
            e tradutores profissionais em regime freelancer com formação
            superior linguística e experiência na área de revisão e 
            tradução literária.\nPreza-se a capacidade de cumprimento
            de prazos bem definidos. Os interessados poderão enviar 
            candidatura paraa geral@saidademergencia.com.\nquarta-feira, 
            4 de agosto de 2010\nSaída de Emergência procura tradutores 
            e revisores em regime freelancer\nA editora Saída de Emergência
            procura revisores e tradutores profissionais em regime 
            freelancer com formação superior linguística eexperiência na 
            área de revisão e tradução literária.",
    "seg_langs":["por_Latn","por_Latn","por_Latn","por_Latn","por_Latn"],
    "robotstxt":"allowed",
    "id":"92a22c9672a52ae5af0da0457a184151",
    "filter":"keep",
    "pii":[[296,323]],
    "doc_scores":[6.4,10,10,10,10,10,8,0,0]}
{"f":"./path/to/80716-00468.warc.gz","o":579437,"s":1100,"rs":44535,
...
    "text":"More texts\n...",
...

In each document text field, each segment is concatenated using new-line separators. The first 7 fields are inherited from warc2text HTML extraction from the WARCs, explained here, with the exception ofp field which is replaced by text and l replaced by lang and prob, which describe the three top identified languages for the document and their prediction probabilities. The rest of the output is explained here.

How to download it

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:

wget -i https://data.hplt-project.org/two/deduplicated/epo_Latn_map.txt

wget -i https://data.hplt-project.org/two/cleaned/epo_Latn_map.txt

Full download

If you want to download all the available files from deduplicated or cleaned variants in one click, use

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/deduplicated/hplt_monolingual_map_deduplicated_2.0.txt

wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/cleaned/hplt_monolingual_map_cleaned_2.0.txt

Validating downloads

Every .zst file in the deduplicated and cleaned variants is accompanied with a corresponding .zst.md5 file containing its MD5 checksum. See, for example, this directory with the cleaned Norwegian data: https://data.hplt-project.org/two/cleaned/nob_Latn/

The integrity of a file can be checked with, e.g., md5sum -c 1.jsonl.zst.md5

Datasets Catalogue

There are 193 languages on the HPLT dataset catalogue in version 2.0. For each language and variant (deduped and cleaned), counts for number of documents, words, characters and segments are provided. If you find any problem, please contact us !

Language family distribution in HPLT

+

License and takedown

License

These data are released under this licensing scheme:

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

Source: CC/IA

DEDUPLICATED:

Docs: 11.67k

Words: 3.35M

Chars: 25.69M

Segments: 825.08k

CLEANED:

Docs: 16

Words: 8.36k

Chars: 49.74k

Segments: 117

16 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.26M

Words: 172.75M

Chars: 1.59B

Segments: 44.95M

CLEANED:

Docs: 12.93k

Words: 8.20M

Chars: 50.85M

Segments: 206.19k

11 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 9.59M

Words: 2.43B

Chars: 18.35B

Segments: 251.52M

CLEANED:

Docs: 1.46M

Words: 1.00B

Chars: 5.95B

Segments: 37.74M

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 11.29M

Words: 4.37B

Chars: 28.42B

Segments: 304.29M

CLEANED:

Docs: 5.39M

Words: 2.71B

Chars: 16.10B

Segments: 95.10M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 13.02M

Words: 7.21B

Chars: 65.58B

Segments: 725.19M

CLEANED:

Docs: 295.54k

Words: 195.89M

Chars: 1.03B

Segments: 7.01M

8 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 191.75M

Words: 73.48B

Chars: 434.47B

Segments: 5.64B

CLEANED:

Docs: 82.67M

Words: 48.14B

Chars: 279.59B

Segments: 2.20B

27 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 263.14k

Words: 94.13M

Chars: 617.90M

Segments: 4.76M

CLEANED:

Docs: 175.71k

Words: 73.44M

Chars: 475.83M

Segments: 2.68M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 12.53M

Words: 1.99B

Chars: 12.78B

Segments: 361.09M

CLEANED:

Docs: 273.24k

Words: 194.99M

Chars: 1.24B

Segments: 7.43M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.12M

Words: 319.73M

Chars: 1.91B

Segments: 64.96M

CLEANED:

Docs: 7.28k

Words: 6.05M

Chars: 28.78M

Segments: 131.47k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 3.56M

Words: 402.05M

Chars: 2.40B

Segments: 84.87M

CLEANED:

Docs: 9.22k

Words: 3.07M

Chars: 25.09M

Segments: 188.53k

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 253.48k

Words: 69.60M

Chars: 445.30M

Segments: 7.60M

CLEANED:

Docs: 66.11k

Words: 39.58M

Chars: 260.26M

Segments: 2.39M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 16.09M

Words: 4.42B

Chars: 34.65B

Segments: 409.54M

CLEANED:

Docs: 6.48M

Words: 2.57B

Chars: 19.63B

Segments: 126.61M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 778.47k

Words: 144.56M

Chars: 1.02B

Segments: 14.89M

CLEANED:

Docs: 170.82k

Words: 75.33M

Chars: 558.67M

Segments: 3.14M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 1.45M

Words: 32.23M

Chars: 230.47M

Segments: 8.57M

CLEANED:

Docs: 5.72k

Words: 3.98M

Chars: 20.74M

Segments: 91.72k

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 336.98k

Words: 74.92M

Chars: 628.80M

Segments: 15.37M

CLEANED:

Docs: 10.70k

Words: 11.34M

Chars: 77.26M

Segments: 601.14k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 5.75M

Words: 2.27B

Chars: 18.85B

Segments: 167.78M

CLEANED:

Docs: 2.32M

Words: 1.21B

Chars: 8.54B

Segments: 48.84M

13 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 759.41k

Words: 116.22M

Chars: 612.21M

Segments: 34.05M

CLEANED:

Docs: 6.14k

Words: 4.52M

Chars: 32.33M

Segments: 133.54k

Source: CC/IA

DEDUPLICATED:

Docs: 17.34M

Words: 6.41B

Chars: 41.25B

Segments: 493.78M

CLEANED:

Docs: 11.04M

Words: 4.64B

Chars: 30.17B

Segments: 176.01M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 226.96k

Words: 39.34M

Chars: 213.88M

Segments: 4.15M

CLEANED:

Docs: 28.64k

Words: 13.47M

Chars: 68.68M

Segments: 458.26k

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 14.07k

Words: 2.16M

Chars: 18.62M

Segments: 591.75k

CLEANED:

Docs: 1.11k

Words: 548.24k

Chars: 3.32M

Segments: 19.53k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.31M

Words: 209.29M

Chars: 1.67B

Segments: 49.15M

CLEANED:

Docs: 18.76k

Words: 8.05M

Chars: 55.99M

Segments: 366.34k

Source: CC/IA

DEDUPLICATED:

Docs: 171.77k

Words: 15.53M

Chars: 680.83M

Segments: 2.59M

CLEANED:

Docs: 27.44k

Words: 5.78M

Chars: 268.56M

Segments: 464.99k

9 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 35.57M

Words: 13.58B

Chars: 89.09B

Segments: 860.56M

CLEANED:

Docs: 14.61M

Words: 7.26B

Chars: 46.09B

Segments: 268.16M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 572.44k

Words: 79.03M

Chars: 620.28M

Segments: 17.32M

CLEANED:

Docs: 2.02k

Words: 2.70M

Chars: 19.31M

Segments: 38.55k

Source: CC/IA

DEDUPLICATED:

Docs: 56.66M

Words: 20.84B

Chars: 134.99B

Segments: 1.48B

CLEANED:

Docs: 28.09M

Words: 15.30B

Chars: 96.96B

Segments: 681.41M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 34.44M

Words: 12.69B

Chars: 78.07B

Segments: 724.77M

CLEANED:

Docs: 18.55M

Words: 10.02B

Chars: 60.21B

Segments: 383.34M

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.24M

Words: 188.62M

Chars: 1.20B

Segments: 21.13M

CLEANED:

Docs: 138.84k

Words: 85.89M

Chars: 515.83M

Segments: 2.86M

Source: CC/IA

DEDUPLICATED:

Docs: 168.65M

Words: 62.92B

Chars: 412.04B

Segments: 5.40B

CLEANED:

Docs: 75.29M

Words: 42.08B

Chars: 274.01B

Segments: 1.93B

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 27.76k

Words: 3.79M

Chars: 27.55M

Segments: 820.23k

CLEANED:

Docs: 1.20k

Words: 964.70k

Chars: 7.43M

Segments: 36.70k

Source: CC/IA

DEDUPLICATED:

Docs: 1.26M

Words: 442.62M

Chars: 2.86B

Segments: 22.55M

CLEANED:

Docs: 273.75k

Words: 142.65M

Chars: 913.08M

Segments: 5.23M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 2.94M

Words: 303.68M

Chars: 2.01B

Segments: 59.37M

CLEANED:

Docs: 122.74k

Words: 36.76M

Chars: 281.20M

Segments: 1.38M

Source: CC/IA

DEDUPLICATED:

Docs: 2.58M

Words: 793.12M

Chars: 5.35B

Segments: 79.45M

CLEANED:

Docs: 758.13k

Words: 409.04M

Chars: 2.40B

Segments: 15.57M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 101.01M

Words: 31.83B

Chars: 205.93B

Segments: 2.80B

CLEANED:

Docs: 33.84M

Words: 21.20B

Chars: 133.41B

Segments: 873.02M

12 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 897.17M

Words: 344.15B

Chars: 2.48T

Segments: 28.01B

CLEANED:

Docs: 482.05M

Words: 251.48B

Chars: 1.78T

Segments: 11.13B

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 72.72k

Words: 14.17M

Chars: 72.43M

Segments: 1.48M

CLEANED:

Docs: 2.33k

Words: 2.29M

Chars: 11.54M

Segments: 34.65k

Source: CC/IA

DEDUPLICATED:

Docs: 22.24k

Words: 2.07M

Chars: 11.85M

Segments: 198.56k

CLEANED:

Docs: 1.39k

Words: 1.19M

Chars: 5.55M

Segments: 24.56k

Source: CC/IA

DEDUPLICATED:

Docs: 60.26k

Words: 8.12M

Chars: 96.26M

Segments: 1.85M

CLEANED:

Docs: 1.63k

Words: 422.24k

Chars: 7.38M

Segments: 39.97k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 126.93M

Words: 55.53B

Chars: 373.82B

Segments: 3.68B

CLEANED:

Docs: 70.33M

Words: 42.70B

Chars: 283.60B

Segments: 1.85B

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 7.72B

Words: 3.75T

Chars: 22.79T

Segments: 220.10B

CLEANED:

Docs: 4.39B

Words: 2.86T

Chars: 17.09T

Segments: 116.52B

60 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 2.15M

Words: 681.98M

Chars: 4.33B

Segments: 51.27M

CLEANED:

Docs: 818.88k

Words: 471.60M

Chars: 2.98B

Segments: 20.35M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 27.01M

Words: 8.05B

Chars: 62.78B

Segments: 822.02M

CLEANED:

Docs: 8.45M

Words: 4.74B

Chars: 36.03B

Segments: 264.42M

13 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 8.36M

Words: 2.30B

Chars: 16.77B

Segments: 156.66M

CLEANED:

Docs: 1.97M

Words: 776.64M

Chars: 6.05B

Segments: 37.62M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 422.52k

Words: 27.85M

Chars: 291.48M

Segments: 5.78M

CLEANED:

Docs: 3.77k

Words: 4.31M

Chars: 21.32M

Segments: 143.40k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 772.73k

Words: 166.00M

Chars: 1.18B

Segments: 19.69M

CLEANED:

Docs: 239.92k

Words: 93.45M

Chars: 582.04M

Segments: 4.53M

8 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 246.03k

Words: 24.93M

Chars: 212.08M

Segments: 3.05M

CLEANED:

Docs: 8.91k

Words: 7.26M

Chars: 37.70M

Segments: 178.92k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 79.98M

Words: 28.26B

Chars: 235.64B

Segments: 3.12B

CLEANED:

Docs: 34.82M

Words: 18.45B

Chars: 155.71B

Segments: 976.62M

16 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 41.82k

Words: 9.08M

Chars: 57.06M

Segments: 522.76k

CLEANED:

Docs: 1.23k

Words: 1.23M

Chars: 5.34M

Segments: 14.76k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 685.35M

Words: 303.58B

Chars: 1.87T

Segments: 18.88B

CLEANED:

Docs: 401.83M

Words: 237.04B

Chars: 1.46T

Segments: 10.56B

17 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 7.78M

Words: 475.04M

Chars: 2.85B

Segments: 133.33M

CLEANED:

Docs: 36.67k

Words: 20.82M

Chars: 114.77M

Segments: 730.04k

Source: CC/IA

DEDUPLICATED:

Docs: 171.74k

Words: 56.25M

Chars: 498.80M

Segments: 8.17M

CLEANED:

Docs: 7.76k

Words: 5.14M

Chars: 29.91M

Segments: 133.98k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 563.06k

Words: 120.65M

Chars: 1.21B

Segments: 14.06M

CLEANED:

Docs: 49.14k

Words: 28.88M

Chars: 219.26M

Segments: 973.63k

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.42M

Words: 302.57M

Chars: 2.12B

Segments: 43.72M

CLEANED:

Docs: 137.41k

Words: 80.66M

Chars: 483.76M

Segments: 3.31M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 2.76M

Words: 573.32M

Chars: 3.88B

Segments: 57.52M

CLEANED:

Docs: 490.79k

Words: 295.71M

Chars: 1.75B

Segments: 10.99M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 17.32M

Words: 3.52B

Chars: 24.37B

Segments: 635.06M

CLEANED:

Docs: 3.02M

Words: 1.64B

Chars: 10.11B

Segments: 61.18M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 7.38M

Words: 1.02B

Chars: 6.95B

Segments: 169.01M

CLEANED:

Docs: 73.42k

Words: 30.72M

Chars: 218.70M

Segments: 1.71M

Source: CC/IA

DEDUPLICATED:

Docs: 2.52M

Words: 738.53M

Chars: 4.58B

Segments: 51.48M

CLEANED:

Docs: 1.13M

Words: 576.82M

Chars: 3.39B

Segments: 20.64M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 6.27M

Words: 944.63M

Chars: 5.69B

Segments: 162.56M

CLEANED:

Docs: 212.69k

Words: 122.29M

Chars: 639.12M

Segments: 4.64M

Source: CC/IA

DEDUPLICATED:

Docs: 1.63M

Words: 300.87M

Chars: 1.69B

Segments: 31.88M

CLEANED:

Docs: 315.87k

Words: 152.62M

Chars: 853.83M

Segments: 5.69M

Source: CC/IA

DEDUPLICATED:

Docs: 40.69M

Words: 16.09B

Chars: 93.70B

Segments: 1.43B

CLEANED:

Docs: 17.12M

Words: 9.97B

Chars: 56.84B

Segments: 466.63M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 26.80M

Words: 13.76B

Chars: 74.08B

Segments: 751.52M

CLEANED:

Docs: 13.65M

Words: 8.64B

Chars: 43.97B

Segments: 267.41M

10 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 914.27k

Words: 59.78M

Chars: 324.99M

Segments: 22.51M

CLEANED:

Docs: 2.81k

Words: 2.20M

Chars: 10.60M

Segments: 55.00k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 41.23M

Words: 14.20B

Chars: 92.86B

Segments: 1.13B

CLEANED:

Docs: 12.30M

Words: 7.31B

Chars: 48.01B

Segments: 297.13M

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 116.86M

Words: 44.00B

Chars: 324.50B

Segments: 4.16B

CLEANED:

Docs: 51.87M

Words: 30.52B

Chars: 225.25B

Segments: 1.42B

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 6.44M

Words: 1.72B

Chars: 12.97B

Segments: 123.72M

CLEANED:

Docs: 3.60M

Words: 1.40B

Chars: 10.72B

Segments: 65.24M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 1.41M

Words: 121.57M

Chars: 823.31M

Segments: 18.86M

CLEANED:

Docs: 56.29k

Words: 38.29M

Chars: 205.21M

Segments: 1.41M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 2.55M

Words: 164.05M

Chars: 1.10B

Segments: 40.06M

CLEANED:

Docs: 48.75k

Words: 24.78M

Chars: 156.84M

Segments: 1.12M

Source: CC/IA

DEDUPLICATED:

Docs: 169.44M

Words: 78.71B

Chars: 551.63B

Segments: 4.74B

CLEANED:

Docs: 98.14M

Words: 54.62B

Chars: 384.32B

Segments: 2.39B

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 6.02M

Words: 2.13B

Chars: 13.37B

Segments: 153.03M

CLEANED:

Docs: 2.84M

Words: 1.54B

Chars: 9.60B

Segments: 69.64M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 381.65M

Words: 170.20B

Chars: 1.13T

Segments: 10.21B

CLEANED:

Docs: 221.75M

Words: 127.41B

Chars: 820.82B

Segments: 5.13B

8 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.28M

Words: 311.46M

Chars: 2.44B

Segments: 31.44M

CLEANED:

Docs: 195.97k

Words: 137.82M

Chars: 937.71M

Segments: 6.43M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.16B

Words: 106.81B

Chars: 1.63T

Segments: 51.70B

CLEANED:

Docs: 417.71M

Words: 42.36B

Chars: 901.53B

Segments: 23.27B

19 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.35M

Words: 257.65M

Chars: 3.26B

Segments: 61.52M

CLEANED:

Docs: 15.10k

Words: 9.22M

Chars: 54.21M

Segments: 345.22k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 101.29k

Words: 39.46M

Chars: 375.99M

Segments: 9.26M

CLEANED:

Docs: 7.59k

Words: 5.96M

Chars: 28.41M

Segments: 159.42k

Source: CC/IA

DEDUPLICATED:

Docs: 40.20k

Words: 5.46M

Chars: 32.84M

Segments: 842.30k

CLEANED:

Docs: 1.18k

Words: 674.04k

Chars: 4.65M

Segments: 14.26k

Source: CC/IA

DEDUPLICATED:

Docs: 2.51M

Words: 739.80M

Chars: 5.73B

Segments: 71.25M

CLEANED:

Docs: 1.34M

Words: 532.86M

Chars: 4.30B

Segments: 24.93M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 15.18k

Words: 2.72M

Chars: 21.60M

Segments: 545.00k

CLEANED:

Docs: 949

Words: 678.02k

Chars: 3.47M

Segments: 27.11k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 23.44k

Words: 5.18M

Chars: 37.19M

Segments: 938.52k

CLEANED:

Docs: 106

Words: 31.94k

Chars: 185.55k

Segments: 1.36k

Source: CC/IA

DEDUPLICATED:

Docs: 7.57M

Words: 1.93B

Chars: 15.26B

Segments: 195.10M

CLEANED:

Docs: 3.34M

Words: 1.24B

Chars: 10.16B

Segments: 63.72M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 5.16M

Words: 2.00B

Chars: 15.35B

Segments: 151.47M

CLEANED:

Docs: 2.64M

Words: 1.41B

Chars: 11.13B

Segments: 81.01M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 228.24k

Words: 25.61M

Chars: 132.74M

Segments: 2.95M

CLEANED:

Docs: 7.08k

Words: 4.26M

Chars: 20.91M

Segments: 46.79k

Source: CC/IA

DEDUPLICATED:

Docs: 18.25k

Words: 3.31M

Chars: 17.88M

Segments: 422.15k

CLEANED:

Docs: 1.96k

Words: 1.14M

Chars: 6.15M

Segments: 43.91k

Source: CC/IA

DEDUPLICATED:

Docs: 3.63M

Words: 1.72B

Chars: 11.89B

Segments: 88.34M

CLEANED:

Docs: 2.12M

Words: 1.34B

Chars: 9.33B

Segments: 53.47M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 2.30M

Words: 210.16M

Chars: 3.59B

Segments: 38.34M

CLEANED:

Docs: 700.99k

Words: 113.80M

Chars: 2.12B

Segments: 9.86M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 82.34k

Words: 7.70M

Chars: 60.74M

Segments: 1.54M

CLEANED:

Docs: 4.00k

Words: 1.43M

Chars: 9.30M

Segments: 51.93k

Source: CC/IA

DEDUPLICATED:

Docs: 2.59M

Words: 147.93M

Chars: 1.35B

Segments: 26.79M

CLEANED:

Docs: 92.70k

Words: 50.74M

Chars: 367.20M

Segments: 1.92M

Source: CC/IA

DEDUPLICATED:

Docs: 1.59M

Words: 338.52M

Chars: 2.60B

Segments: 25.49M

CLEANED:

Docs: 676.11k

Words: 246.66M

Chars: 1.93B

Segments: 10.04M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 70.13k

Words: 15.85M

Chars: 109.65M

Segments: 4.92M

CLEANED:

Docs: 531

Words: 383.09k

Chars: 2.07M

Segments: 11.80k

Source: CC/IA

DEDUPLICATED:

Docs: 714.99k

Words: 242.64M

Chars: 1.40B

Segments: 12.67M

CLEANED:

Docs: 364.35k

Words: 195.87M

Chars: 1.12B

Segments: 7.15M

Source: CC/IA

DEDUPLICATED:

Docs: 2.57k

Words: 1.01M

Chars: 4.58M

Segments: 171.98k

CLEANED:

Docs: 245

Words: 262.00k

Chars: 1.30M

Segments: 10.83k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 41.67k

Words: 8.94M

Chars: 51.16M

Segments: 1.23M

CLEANED:

Docs: 2.47k

Words: 2.41M

Chars: 11.95M

Segments: 10.52k

Source: CC/IA

DEDUPLICATED:

Docs: 53.30k

Words: 4.11M

Chars: 25.66M

Segments: 626.53k

CLEANED:

Docs: 2.54k

Words: 1.94M

Chars: 11.28M

Segments: 47.48k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 98.75M

Words: 30.97B

Chars: 144.91B

Segments: 3.48B

CLEANED:

Docs: 38.87M

Words: 19.69B

Chars: 89.27B

Segments: 1.36B

11 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 624.70k

Words: 66.19M

Chars: 931.66M

Segments: 9.36M

CLEANED:

Docs: 29.50k

Words: 5.18M

Chars: 84.71M

Segments: 319.95k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 659.62k

Words: 157.11M

Chars: 908.36M

Segments: 29.25M

CLEANED:

Docs: 8.37k

Words: 5.59M

Chars: 31.47M

Segments: 157.72k

Source: CC/IA

DEDUPLICATED:

Docs: 11.11M

Words: 1.64B

Chars: 11.19B

Segments: 347.02M

CLEANED:

Docs: 367.93k

Words: 180.62M

Chars: 1.13B

Segments: 7.14M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 1.11M

Words: 68.92M

Chars: 434.78M

Segments: 17.96M

CLEANED:

Docs: 7.59k

Words: 5.55M

Chars: 32.93M

Segments: 200.34k

Source: CC/IA

DEDUPLICATED:

Docs: 35.68M

Words: 10.03B

Chars: 76.42B

Segments: 888.10M

CLEANED:

Docs: 13.34M

Words: 6.68B

Chars: 50.41B

Segments: 322.16M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.98M

Words: 626.58M

Chars: 5.19B

Segments: 108.78M

CLEANED:

Docs: 146.16k

Words: 59.64M

Chars: 345.51M

Segments: 2.12M

Source: CC/IA

DEDUPLICATED:

Docs: 484.07k

Words: 69.01M

Chars: 569.86M

Segments: 21.60M

CLEANED:

Docs: 9.21k

Words: 3.79M

Chars: 26.89M

Segments: 151.38k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 2.39M

Words: 419.25M

Chars: 2.61B

Segments: 83.31M

CLEANED:

Docs: 246.93k

Words: 107.22M

Chars: 710.65M

Segments: 5.06M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 104.57k

Words: 7.24M

Chars: 47.21M

Segments: 1.69M

CLEANED:

Docs: 1.08k

Words: 1.37M

Chars: 9.01M

Segments: 38.69k

Source: CC/IA

DEDUPLICATED:

Docs: 321.60k

Words: 28.42M

Chars: 267.44M

Segments: 3.56M

CLEANED:

Docs: 21.28k

Words: 9.18M

Chars: 67.99M

Segments: 407.54k

Source: CC/IA

DEDUPLICATED:

Docs: 190.37k

Words: 19.93M

Chars: 149.68M

Segments: 3.19M

CLEANED:

Docs: 4.15k

Words: 3.73M

Chars: 20.33M

Segments: 84.12k

Source: CC/IA

DEDUPLICATED:

Docs: 1.22M

Words: 221.83M

Chars: 1.31B

Segments: 25.88M

CLEANED:

Docs: 160.38k

Words: 125.20M

Chars: 652.17M

Segments: 3.43M

Source: CC/IA

DEDUPLICATED:

Docs: 23.04M

Words: 6.26B

Chars: 47.62B

Segments: 656.95M

CLEANED:

Docs: 6.77M

Words: 3.46B

Chars: 25.19B

Segments: 173.81M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 8.95M

Words: 4.43B

Chars: 50.01B

Segments: 35.61M

CLEANED:

Docs: 328

Words: 890.63k

Chars: 4.28M

Segments: 19.29k

Source: CC/IA

DEDUPLICATED:

Docs: 116.18k

Words: 33.60M

Chars: 170.03M

Segments: 2.12M

CLEANED:

Docs: 24.98k

Words: 17.79M

Chars: 96.77M

Segments: 645.53k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 4.59M

Words: 1.20B

Chars: 11.46B

Segments: 76.76M

CLEANED:

Docs: 3.10M

Words: 973.66M

Chars: 9.49B

Segments: 48.00M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 3.32M

Words: 1.25B

Chars: 8.35B

Segments: 68.56M

CLEANED:

Docs: 2.08M

Words: 980.75M

Chars: 6.62B

Segments: 36.32M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 901.19k

Words: 137.36M

Chars: 1.07B

Segments: 23.70M

CLEANED:

Docs: 25.04k

Words: 10.98M

Chars: 74.80M

Segments: 600.80k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 7.71M

Words: 2.08B

Chars: 13.39B

Segments: 164.74M

CLEANED:

Docs: 3.57M

Words: 1.49B

Chars: 9.44B

Segments: 57.01M

10 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 7.75M

Words: 930.52M

Chars: 6.97B

Segments: 129.16M

CLEANED:

Docs: 367.26k

Words: 195.81M

Chars: 1.44B

Segments: 8.68M

Source: CC/IA

DEDUPLICATED:

Docs: 14.18k

Words: 3.78M

Chars: 26.58M

Segments: 612.73k

CLEANED:

Docs: 2.93k

Words: 1.63M

Chars: 11.79M

Segments: 65.76k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 220.45k

Words: 37.80M

Chars: 151.08M

Segments: 6.45M

CLEANED:

Docs: 931

Words: 807.49k

Chars: 3.86M

Segments: 19.10k

Source: CC/IA

DEDUPLICATED:

Docs: 665.39k

Words: 157.42M

Chars: 829.14M

Segments: 13.43M

CLEANED:

Docs: 108.26k

Words: 86.76M

Chars: 424.40M

Segments: 2.80M

Source: CC/IA

DEDUPLICATED:

Docs: 2.98M

Words: 787.24M

Chars: 9.81B

Segments: 75.11M

CLEANED:

Docs: 1.37M

Words: 453.18M

Chars: 5.82B

Segments: 30.50M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 303.51M

Words: 103.15B

Chars: 661.41B

Segments: 8.06B

CLEANED:

Docs: 138.65M

Words: 71.40B

Chars: 451.22B

Segments: 3.07B

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 15.13M

Words: 1.83B

Chars: 11.86B

Segments: 224.17M

CLEANED:

Docs: 1.42M

Words: 860.34M

Chars: 5.41B

Segments: 34.60M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 64.72M

Words: 31.42B

Chars: 200.54B

Segments: 2.00B

CLEANED:

Docs: 27.05M

Words: 21.53B

Chars: 133.27B

Segments: 675.97M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 4.03M

Words: 1.35B

Chars: 8.70B

Segments: 54.53M

CLEANED:

Docs: 2.78M

Words: 1.13B

Chars: 7.26B

Segments: 37.14M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 409.23k

Words: 21.19M

Chars: 142.60M

Segments: 3.67M

CLEANED:

Docs: 6.07k

Words: 5.32M

Chars: 27.50M

Segments: 143.31k

Source: CC/IA

DEDUPLICATED:

Docs: 11.18k

Words: 5.20M

Chars: 44.57M

Segments: 275.47k

CLEANED:

Docs: 272

Words: 393.16k

Chars: 1.88M

Segments: 8.51k

Source: CC/IA

DEDUPLICATED:

Docs: 371.76k

Words: 42.92M

Chars: 318.50M

Segments: 7.12M

CLEANED:

Docs: 53.12k

Words: 27.06M

Chars: 202.97M

Segments: 1.34M

Source: CC/IA

DEDUPLICATED:

Docs: 3.60M

Words: 536.38M

Chars: 3.32B

Segments: 86.55M

CLEANED:

Docs: 189.91k

Words: 102.72M

Chars: 635.59M

Segments: 4.19M

Source: CC/IA

DEDUPLICATED:

Docs: 587.96k

Words: 145.78M

Chars: 947.33M

Segments: 5.59M

CLEANED:

Docs: 412.89k

Words: 120.13M

Chars: 781.95M

Segments: 3.60M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.42M

Words: 155.36M

Chars: 889.39M

Segments: 46.27M

CLEANED:

Docs: 6.90k

Words: 5.66M

Chars: 33.53M

Segments: 85.83k

Source: CC/IA

DEDUPLICATED:

Docs: 1.05M

Words: 517.32M

Chars: 2.67B

Segments: 34.52M

CLEANED:

Docs: 584.59k

Words: 372.17M

Chars: 1.90B

Segments: 11.74M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 8.51M

Words: 416.98M

Chars: 2.46B

Segments: 89.44M

CLEANED:

Docs: 89.81k

Words: 46.71M

Chars: 254.18M

Segments: 1.39M

Source: CC/IA

DEDUPLICATED:

Docs: 769.00k

Words: 334.95M

Chars: 1.59B

Segments: 13.49M

CLEANED:

Docs: 466.47k

Words: 279.44M

Chars: 1.30B

Segments: 8.45M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 196.84M

Words: 124.37B

Chars: 644.49B

Segments: 7.03B

CLEANED:

Docs: 90.50M

Words: 88.55B

Chars: 455.15B

Segments: 3.96B

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 492.55k

Words: 152.73M

Chars: 1.05B

Segments: 10.18M

CLEANED:

Docs: 207.84k

Words: 117.08M

Chars: 810.51M

Segments: 4.74M

Source: CC/IA

DEDUPLICATED:

Docs: 382.38M

Words: 136.50B

Chars: 948.27B

Segments: 12.72B

CLEANED:

Docs: 175.41M

Words: 89.53B

Chars: 631.77B

Segments: 4.46B

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 470.63M

Words: 203.71B

Chars: 1.26T

Segments: 14.18B

CLEANED:

Docs: 237.81M

Words: 146.27B

Chars: 896.79B

Segments: 6.12B

10 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 12.33M

Words: 5.21B

Chars: 28.74B

Segments: 413.50M

CLEANED:

Docs: 2.84M

Words: 1.84B

Chars: 9.57B

Segments: 69.00M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 405.81k

Words: 44.88M

Chars: 327.42M

Segments: 5.30M

CLEANED:

Docs: 36.94k

Words: 17.31M

Chars: 143.45M

Segments: 494.25k

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 115.84M

Words: 52.11B

Chars: 329.18B

Segments: 3.37B

CLEANED:

Docs: 65.88M

Words: 40.05B

Chars: 250.72B

Segments: 1.70B

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.70M

Words: 232.01M

Chars: 1.69B

Segments: 36.91M

CLEANED:

Docs: 137.30k

Words: 44.44M

Chars: 316.63M

Segments: 1.75M

Source: CC/IA

DEDUPLICATED:

Docs: 1.64B

Words: 696.30B

Chars: 5.01T

Segments: 49.90B

CLEANED:

Docs: 884.69M

Words: 540.88B

Chars: 3.91T

Segments: 26.29B

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 864.29k

Words: 136.25M

Chars: 760.17M

Segments: 39.56M

CLEANED:

Docs: 3.16k

Words: 3.61M

Chars: 16.74M

Segments: 51.90k

Source: CC/IA

DEDUPLICATED:

Docs: 200.47k

Words: 95.80M

Chars: 746.25M

Segments: 11.58M

CLEANED:

Docs: 54.91k

Words: 43.80M

Chars: 359.21M

Segments: 3.28M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 9.40k

Words: 2.88M

Chars: 16.98M

Segments: 217.79k

CLEANED:

Docs: 2.57k

Words: 1.09M

Chars: 6.27M

Segments: 45.80k

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 2.42M

Words: 433.55M

Chars: 3.24B

Segments: 116.89M

CLEANED:

Docs: 81.97k

Words: 42.39M

Chars: 252.40M

Segments: 1.65M

Source: CC/IA

DEDUPLICATED:

Docs: 20.87k

Words: 2.99M

Chars: 33.50M

Segments: 406.32k

CLEANED:

Docs: 6.00k

Words: 1.65M

Chars: 21.22M

Segments: 92.14k

Source: CC/IA

DEDUPLICATED:

Docs: 2.04M

Words: 1.03B

Chars: 6.61B

Segments: 54.63M

CLEANED:

Docs: 1.15M

Words: 795.62M

Chars: 4.98B

Segments: 33.71M

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 68.41M

Words: 20.32B

Chars: 137.70B

Segments: 2.41B

CLEANED:

Docs: 21.83M

Words: 10.63B

Chars: 70.39B

Segments: 494.28M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 30.31M

Words: 9.83B

Chars: 68.36B

Segments: 1.01B

CLEANED:

Docs: 10.28M

Words: 5.43B

Chars: 35.27B

Segments: 238.64M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 295.95k

Words: 85.62M

Chars: 507.10M

Segments: 6.78M

CLEANED:

Docs: 45.86k

Words: 37.09M

Chars: 186.19M

Segments: 1.01M

Source: CC/IA

DEDUPLICATED:

Docs: 1.08M

Words: 72.57M

Chars: 631.56M

Segments: 10.82M

CLEANED:

Docs: 61.08k

Words: 23.92M

Chars: 192.68M

Segments: 1.20M

Source: CC/IA

DEDUPLICATED:

Docs: 230.24k

Words: 115.44M

Chars: 626.70M

Segments: 5.58M

CLEANED:

Docs: 100.30k

Words: 89.53M

Chars: 428.73M

Segments: 2.83M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 3.12M

Words: 661.66M

Chars: 5.30B

Segments: 76.87M

CLEANED:

Docs: 966.51k

Words: 388.75M

Chars: 2.57B

Segments: 16.38M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 288.14k

Words: 56.98M

Chars: 332.29M

Segments: 8.03M

CLEANED:

Docs: 43.92k

Words: 31.00M

Chars: 171.54M

Segments: 1.09M

Source: CC/IA

DEDUPLICATED:

Docs: 838.93M

Words: 414.23B

Chars: 2.53T

Segments: 22.17B

CLEANED:

Docs: 503.07M

Words: 321.95B

Chars: 1.95T

Segments: 12.12B

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 32.04M

Words: 3.71B

Chars: 25.18B

Segments: 675.85M

CLEANED:

Docs: 53.81k

Words: 23.89M

Chars: 148.80M

Segments: 917.09k

Source: CC/IA

DEDUPLICATED:

Docs: 9.18M

Words: 3.83B

Chars: 26.69B

Segments: 249.63M

CLEANED:

Docs: 4.12M

Words: 2.52B

Chars: 16.16B

Segments: 93.81M

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 262.40k

Words: 19.25M

Chars: 219.37M

Segments: 5.11M

CLEANED:

Docs: 2.04k

Words: 994.30k

Chars: 8.82M

Segments: 62.13k

Source: CC/IA

DEDUPLICATED:

Docs: 2.89M

Words: 505.50M

Chars: 3.45B

Segments: 83.33M

CLEANED:

Docs: 114.75k

Words: 69.63M

Chars: 475.44M

Segments: 3.24M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 157.92M

Words: 58.47B

Chars: 374.21B

Segments: 4.81B

CLEANED:

Docs: 66.81M

Words: 40.10B

Chars: 251.18B

Segments: 1.75B

9 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 4.42M

Words: 1.15B

Chars: 7.55B

Segments: 95.87M

CLEANED:

Docs: 1.37M

Words: 717.65M

Chars: 4.67B

Segments: 34.31M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.25M

Words: 497.28M

Chars: 3.75B

Segments: 73.39M

CLEANED:

Docs: 40.93k

Words: 14.68M

Chars: 103.88M

Segments: 636.57k

Source: CC/IA

DEDUPLICATED:

Docs: 9.73M

Words: 3.96B

Chars: 34.09B

Segments: 322.60M

CLEANED:

Docs: 6.11M

Words: 2.98B

Chars: 26.24B

Segments: 168.59M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 253.19k

Words: 62.62M

Chars: 452.93M

Segments: 30.90M

CLEANED:

Docs: 1.75k

Words: 1.54M

Chars: 8.85M

Segments: 13.88k

2 downloads

Tamasheq (Tifinagh) (taq_Tfng)

Creative Commons CC0 license

Source: CC/IA

DEDUPLICATED:

Docs: 101

Words: 21.32k

Chars: 149.82k

Segments: 1.08k

Central Atlas Tamazight (Tifinagh) (tzm-Tfng)

Creative Commons CC0 license

Source: CC/IA

DEDUPLICATED:

Docs: 5.17k

Words: 1.24M

Chars: 15.97M

Segments: 324.49k

Source: CC/IA

DEDUPLICATED:

Docs: 1.90M

Words: 452.35M

Chars: 3.21B

Segments: 36.63M

CLEANED:

Docs: 630.68k

Words: 296.70M

Chars: 2.16B

Segments: 13.45M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 3.20M

Words: 1.05B

Chars: 8.06B

Segments: 70.31M

CLEANED:

Docs: 2.06M

Words: 835.42M

Chars: 6.51B

Segments: 39.19M

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 3.41M

Words: 878.17M

Chars: 6.17B

Segments: 86.32M

CLEANED:

Docs: 1.26M

Words: 624.76M

Chars: 4.59B

Segments: 24.85M

Source: CC/IA

DEDUPLICATED:

Docs: 8.57M

Words: 3.40B

Chars: 23.06B

Segments: 321.67M

CLEANED:

Docs: 1.87M

Words: 1.35B

Chars: 8.13B

Segments: 52.88M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 81.71M

Words: 11.59B

Chars: 155.94B

Segments: 2.18B

CLEANED:

Docs: 17.70M

Words: 3.51B

Chars: 59.99B

Segments: 339.05M

10 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.03M

Words: 194.51M

Chars: 2.31B

Segments: 38.58M

CLEANED:

Docs: 64.69k

Words: 36.72M

Chars: 181.70M

Segments: 1.13M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 1.16M

Words: 79.19M

Chars: 453.63M

Segments: 16.46M

CLEANED:

Docs: 13.98k

Words: 12.51M

Chars: 64.54M

Segments: 282.37k

Source: CC/IA

DEDUPLICATED:

Docs: 361.19k

Words: 22.20M

Chars: 168.50M

Segments: 5.43M

CLEANED:

Docs: 6.05k

Words: 5.27M

Chars: 27.68M

Segments: 132.17k

2 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 193.13k

Words: 21.88M

Chars: 151.72M

Segments: 3.37M

CLEANED:

Docs: 11.01k

Words: 8.67M

Chars: 49.30M

Segments: 221.25k

Source: CC/IA

DEDUPLICATED:

Docs: 806.53k

Words: 230.24M

Chars: 1.65B

Segments: 30.73M

CLEANED:

Docs: 171.04k

Words: 70.68M

Chars: 570.17M

Segments: 3.36M

Source: CC/IA

DEDUPLICATED:

Docs: 59.60k

Words: 16.12M

Chars: 142.19M

Segments: 2.33M

CLEANED:

Docs: 4.38k

Words: 2.88M

Chars: 21.10M

Segments: 99.01k

Source: CC/IA

DEDUPLICATED:

Docs: 236.66M

Words: 73.92B

Chars: 543.97B

Segments: 5.87B

CLEANED:

Docs: 116.57M

Words: 51.67B

Chars: 389.75B

Segments: 2.57B

8 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 619.40k

Words: 36.55M

Chars: 258.14M

Segments: 6.27M

CLEANED:

Docs: 5.86k

Words: 4.70M

Chars: 24.18M

Segments: 125.61k

Source: CC/IA

DEDUPLICATED:

Docs: 1.09M

Words: 317.32M

Chars: 2.48B

Segments: 21.44M

CLEANED:

Docs: 442.40k

Words: 223.91M

Chars: 1.75B

Segments: 8.98M

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 81.51M

Words: 31.90B

Chars: 231.82B

Segments: 2.09B

CLEANED:

Docs: 47.40M

Words: 25.23B

Chars: 182.92B

Segments: 1.17B

5 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 62.75k

Words: 6.39M

Chars: 43.81M

Segments: 1.09M

CLEANED:

Docs: 2.47k

Words: 2.43M

Chars: 15.41M

Segments: 59.91k

Source: CC/IA

DEDUPLICATED:

Docs: 7.16M

Words: 3.28B

Chars: 15.89B

Segments: 229.88M

CLEANED:

Docs: 3.19M

Words: 2.13B

Chars: 10.01B

Segments: 50.63M

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 9.75M

Words: 1.24B

Chars: 10.34B

Segments: 198.89M

CLEANED:

Docs: 706.92k

Words: 351.32M

Chars: 2.85B

Segments: 14.80M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 11.52M

Words: 1.61B

Chars: 10.70B

Segments: 425.91M

CLEANED:

Docs: 84.81k

Words: 35.25M

Chars: 218.06M

Segments: 1.58M

1 download

Source: CC/IA

DEDUPLICATED:

Docs: 174.14M

Words: 105.24B

Chars: 491.19B

Segments: 5.35B

CLEANED:

Docs: 100.75M

Words: 83.20B

Chars: 379.59B

Segments: 3.02B

7 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 286.40k

Words: 23.27M

Chars: 136.08M

Segments: 3.12M

CLEANED:

Docs: 13.87k

Words: 5.89M

Chars: 35.59M

Segments: 200.94k

Source: CC/IA

DEDUPLICATED:

Docs: 1.97M

Words: 38.69M

Chars: 246.24M

Segments: 10.43M

CLEANED:

Docs: 5.68k

Words: 5.46M

Chars: 27.55M

Segments: 161.47k

Source: CC/IA

DEDUPLICATED:

Docs: 3.41M

Words: 494.95M

Chars: 6.62B

Segments: 56.47M

CLEANED:

Docs: 63.09k

Words: 30.34M

Chars: 258.73M

Segments: 1.82M

Source: CC/IA

DEDUPLICATED:

Docs: 414.98k

Words: 150.99M

Chars: 911.06M

Segments: 9.48M

CLEANED:

Docs: 128.26k

Words: 77.53M

Chars: 458.62M

Segments: 2.94M

Source: CC/IA

DEDUPLICATED:

Docs: 1.89M

Words: 246.78M

Chars: 1.50B

Segments: 30.70M

CLEANED:

Docs: 66.13k

Words: 42.81M

Chars: 217.89M

Segments: 1.47M

Source: CC/IA

DEDUPLICATED:

Docs: 3.77M

Words: 235.77M

Chars: 2.27B

Segments: 131.06M

CLEANED:

Docs: 61.29k

Words: 3.27M

Chars: 74.36M

Segments: 1.24M

4 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 2.71B

Words: 149.31B

Chars: 3.67T

Segments: 76.75B

CLEANED:

Docs: 1.25B

Words: 74.01B

Chars: 2.35T

Segments: 42.45B

23 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 321.27M

Words: 17.31B

Chars: 417.51B

Segments: 7.82B

CLEANED:

Docs: 157.11M

Words: 9.51B

Chars: 286.98B

Segments: 4.48B

6 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 69.23M

Words: 28.66B

Chars: 193.52B

Segments: 3.11B

CLEANED:

Docs: 18.42M

Words: 11.48B

Chars: 78.45B

Segments: 579.82M

3 downloads

Source: CC/IA

DEDUPLICATED:

Docs: 1.31M

Words: 156.42M

Chars: 1.33B

Segments: 33.14M

CLEANED:

Docs: 113.62k

Words: 44.36M

Chars: 380.92M

Segments: 2.71M

2 downloads

License

These data are released under this licensing scheme:

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • You can reach us at hplt-datasets@ufal.mff.cuni.cz

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

*It is your resposibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

We use cookies on our site.