Lots of monolingual and multilingual data consistently formatted and curated
Efficient and high-quality language and translation models
Sustainable and reusable workflows using high-perfomance computing
Image from storyset Freepik
Our data and models will be shared through FAIR repositories, catalogues and marketplaces for easy discovery, access, replication and exploitation.
Our models will be reproducible with information and evaluation metrics shown in publicly available dashboards and leaderboards.
By applying consistent cleaning, anonymization, bias-reduction, and metadata routines will enhance the quality and ethical properties of texts.
Our models will make use of NLP-aware supercomputing power in HPC centres to produce efficient models and pipelines.
We would like to thank the following institutions for their contributed datasets:
After a two-year pandemic hiatus, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) join f...
6-8 February, 2023
The 1st edition of the Workshop on Open Community-Driven Machine Translation (CrowdMT 2023) will be held in Tampere, Finland, on J...
15 June, 2023
June, 17th-25th, 2023, the HPLT consortium will held a hackathon around a set of topics related to corpora curation: language iden...
17-25 June, 2023
The purpose of the trainer is to provide the user with a flexible way of scheduling various sources of input data, as well as augm...
GitHub
OpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to OpusT...
GitHub
This tool provides a full range of analytics automatically computed on either monolingual or bilingual data sets to help making in...
GitHub
We aim at covering around 80 languages, the ones for which we have counts of at least 100 million words in large web crawls collections. For more information, please see the language table in the About section.
Yes! We will explore 7PB from the Internet Archive collections and 5PB from Common Crawl. We hope to deliver lots of new data. But not only! We will also reprocess available datasets to enhance their quality.
This is it! We intend to train 100s to 1,000s of large language models of different flavours. HPLT will give the NLP community access to a landscape of efficient and high-quality language models for a variety of languages.
Sure. Efficient machine translation models at scale are one of the ambitions of this project. We want to release models that run on CPU, that are easily reproducible and of the highest possible quality.
Of course. If you find a data set that you would like to contribute or that needs reprocessing by HPLT, please contact us. We will also need people to help us with taking a look at corpora and the output of models for particular tasks. Just get in touch with us!
Yes, we will publish several growing versions of both monolingual and parallel plain text datasets. They will be available to all. However, since we do not own the original data, it is your responsibility that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.
Because we want to make language modelling better suited with consistent open data sets and reproducible and efficient models. Because we want HPC centres be suitable for NLP processing at scale. And because we still need transparent large language models and machine translation models for many languages to open research and business opportunities for them. If you were interested on the name, please visit the About section.