EinaCat Machine Translation Model for the Life Sciences Domain
Model description
This model was trained from scratch using the MTUOC training framework and the Marian-NMT toolkit. A general Catalan-English model was first created using data from HPLT and NLLB, which comprised 16,037,694 sentence pairs after cleaning, rescoring, and language verification.
Afterwards, we curated a domain-specific corpus consisting of Wikipedia articles, Universitat Oberta de Catalunya teaching materials, and magazines from the Hemeroteca Científica Catalana of the Institut d’Estudis Catalans. This specialised dataset, containing 460,512 sentence pairs, was used to fine-tune the general model for the Life Sciences domain.
Intended uses and limitations
This model can be used for machine translation from Catalan to English, specifically optimised for texts in the Life Sciences domain.
How to use
You can download this model and run it with MTUOC-server, a dedicated framework for translation models which includes Marian NMT. MTUOC-server handles preprocessing, tokenization, and translation requests.
The server is available at MTUOC-server GitHub, where installation instructions, usage examples, and configuration guides are provided. Once the server is running, you can send sentences in Catalan and receive English translations generated by the model.
Limitations and bias
At the time of submission, no measures have been taken to estimate bias or toxicity embedded in the model. We are aware that the model may be biased, and research on this topic may update this model card in the future.
Training
Training data
The general model was trained on a combination of datasets including Opus and HPLT, which were cleaned, rescored using SBERT, and language-verified. Catalan segments were adapted to the new orthographic standard from the IEC, producing 16,037,694 parallel segments.
The model was fine-tuned with domain-specific corpora from:
- Hemeroteca Científica Catalana (IEC)
- Wikipedia
- Universitat Oberta de Catalunya teaching materials
- Aina subcorpora for Life Sciences and Medicine
Total: 460,512 sentence pairs.
Training procedure
Data preparation:
All datasets are deduplicated and filtered to remove sentence pairs with a cosine similarity below 0.75.
Similarity scores were computed using a MTUOC rescoring system for the Wikipedia, Opus, and HPLT datasets, and LaBSE for the others.
All Catalan segments were then normalised to follow the IEC's updated grammar guidelines using an MTUOC script.
Tokenization:
All data is tokenized using SentencePiece with a shared encoder/decoder vocabulary of 64k tokens.
Hyperparameters:
| Parameter | Value |
|---|---|
| Embedding size | 512 |
| Feed-forward network dim | 2048 |
| Number of attention heads | 8 |
| Encoder layers | 6 |
| Decoder layers | 6 |
| Learning rate schedule | warmup (16k) + inverse-sqrt decay |
| Base learning rate | 0.0003 |
| Clip norm | 5 |
| Dropout | 0.1 |
| Label smoothing | 0.1 |
| Tied embeddings | True |
Evaluation
Metrics: BLEU, chrF2, TER on a domain-specific evaluation corpus.
Results:
| System | BLEU | chrF2 | TER |
|---|---|---|---|
| Marian GEN | 55.3 | 75.5 | 32.2 |
| Marian GEN 2 | 65.3 | 81.9 | 24.4 |
| Marian FT2 24 | 70.0 | 84.1 | 21.4 |
| Google Translate | 71.6 | 85.6 | 20.0 |
The Marian FT2 24 model shows a significant improvement over the general model (+4.6 BLEU), confirming the effectiveness of domain-specific fine-tuning, while remaining slightly below Google Translate for this domain.
Additional information
Author: Antoni Oliver and Gemma Segués, Universitat Oberta de Catalunya.
Contact: [email protected], [email protected]
Copyright: © 2025 Antoni Oliver and Gemma Segués, Universitat Oberta de Catalunya.
Funding: Supported by Institut d’Estudis Catalans through the project Eines d’intel·ligència artificial per al foment de la comunicació científica en català.