HPLT logo

High Performance
Language Technologies

new release logo

DATASETS AVAILABLE

A space that combines petabytes of natural language data with large-scale model training

Lots of monolingual and multilingual data consistently formatted and curated

Efficient and high-quality language and translation models

Sustainable and reusable workflows using high-perfomance computing

HPLT's factsheet

fair

Our data and models will be shared through FAIR repositories, catalogues and marketplaces for easy discovery, access, replication and exploitation.

transparent

Our models will be reproducible with information and evaluation metrics shown in publicly available dashboards and leaderboards.

high-quality

By applying consistent cleaning, anonymization, bias-reduction, and metadata routines will enhance the quality and ethical properties of texts.

efficient

Our models will make use of NLP-aware supercomputing power in HPC centres to produce efficient models and pipelines.

Contributed Datasets

We would like to thank the following institutions for their contributed datasets:

  • Institute of the Estonian Language contributed several versions of the Estonian National corpus in a suitable format to run HPLT cleaning tools. We redistribute the contributed datasets and the HPLT cleaned versions received under the original CC-by license.

Estonian National Corpus 19 and 21 and 23 (original under CC-by)

16.43M docs

3.25B words

Estonian National Corpus 19 and 21 and 23 (HPLT cleaning applied)

11.50M docs

2.95B words

Success stories

Dataset

HPLT curation: Institute of the Estonian Langugage Corpus

Institute of the Estonian Language contributed several versions of the Estonian National corpus in a suitable format to run HPLT cleaning tools. We redistribute the contributed datasets and the HPLT cleaned versions received under the original CC-by license.

Dataset

CulturaX: a refiltered HPLT dataset

From the team that brought you CulturaX, we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the HPLT v1.1 dataset. Please note that  HPLT v1.2 has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: Vistral-7B-Chat.  https://huggingface.co/datasets/ontocord/CulturaP

Dataset

CulturaP: a more permissive HPLT dataset

From the team that brought you CulturaX and CulturaY, we present CulturaP, a filtered subset of the multilingual dataset CulturaY that we believe is more likely to be copyright permissive and usable. CulturaY is in turn based on HPLT v1.1  dataset. Ultimately, this dataset is based on Common Crawl and the Internet Archive. https://huggingface.co/datasets/ontocord/CulturaY

Dataset

HugginFace-friendly HPLT Datasets

This repository contains the means to access the datasets created by the HPLT project. These large-scale web-crawled corpora based on CommonCrawl and the Internet Archive are accessible in 75 languages. The full dump is available as well as deduplicated and further cleaned versions depending on the config that you use (see the usage example below). https://huggingface.co/datasets/HPLT/hplt_monolingual_v1_2

LLM Models

HPLT-based NORA.LLM used by Schibsted media

The Norwegian media group Schibsted media uses NORA.LLM language models in their NLP pipelines for Norwegian. These models were trained on HPLT datasets (among other data)

I would like to...


HPLT Events

HPLT Tools

See all

KEY ASPECTS OF HPLT

+

What languages will HPLT cover?

We aim at covering around 80 languages, the ones for which we have counts of at least 100 million words in large web crawls collections. For more information, please see the language table in the About section.

+

Is HPLT planning to deliver new data sets?

Yes! We will explore 7PB from the Internet Archive collections and 5PB from Common Crawl. We hope to deliver lots of new data. But not only! We will also reprocess available datasets to enhance their quality.

+

Will HPLT train GPT, BERT or T5-like large language models?

This is it! We intend to train 100s to 1,000s of large language models of different flavours. HPLT will give the NLP community access to a landscape of efficient and high-quality language models for a variety of languages.

+

Do HPLT's goals include machine translation models as well?

Sure. Efficient machine translation models at scale are one of the ambitions of this project. We want to release models that run on CPU, that are easily reproducible and of the highest possible quality.

+

Can one contribute to HPLT?

Of course. If you find a data set that you would like to contribute or that needs reprocessing by HPLT, please contact us. We will also need people to help us with taking a look at corpora and the output of models for particular tasks. Just get in touch with us!

+

Can I use the data you are producing?

Yes, we will publish several growing versions of both monolingual and parallel plain text datasets. They will be available to all. However, since we do not own the original data, it is your responsibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

+

Why HPLT?

Because we want to make language modelling better suited with consistent open data sets and reproducible and efficient models. Because we want HPC centres be suitable for NLP processing at scale. And because we still need transparent large language models and machine translation models for many languages to open research and business opportunities for them. If you were interested on the name, please visit the About section.

Stay up-to-date with us! Get information about new releases, content and more!

We use cookies on our site.