EinaCat Machine Translation Model for the Life Sciences Domain

Model description

This model was trained from scratch using the MTUOC training framework and the Marian-NMT toolkit. A general Catalan-English model was first created using data from HPLT and NLLB, which comprised 16,037,694 sentence pairs after cleaning, rescoring, and language verification.

Afterwards, we curated a domain-specific corpus consisting of Wikipedia articles, Universitat Oberta de Catalunya teaching materials, and magazines from the Hemeroteca Científica Catalana of the Institut d’Estudis Catalans. This specialised dataset, containing 460,512 sentence pairs, was used to fine-tune the general model for the Life Sciences domain.

Intended uses and limitations

This model can be used for machine translation from Catalan to English, specifically optimised for texts in the Life Sciences domain.

How to use

You can download this model and run it with MTUOC-server, a dedicated framework for translation models which includes Marian NMT. MTUOC-server handles preprocessing, tokenization, and translation requests.

The server is available at MTUOC-server GitHub, where installation instructions, usage examples, and configuration guides are provided. Once the server is running, you can send sentences in Catalan and receive English translations generated by the model.

Limitations and bias

At the time of submission, no measures have been taken to estimate bias or toxicity embedded in the model. We are aware that the model may be biased, and research on this topic may update this model card in the future.

Training

Training data

The general model was trained on a combination of datasets including Opus and HPLT, which were cleaned, rescored using SBERT, and language-verified. Catalan segments were adapted to the new orthographic standard from the IEC, producing 16,037,694 parallel segments.

The model was fine-tuned with domain-specific corpora from:

  • Hemeroteca Científica Catalana (IEC)
  • Wikipedia
  • Universitat Oberta de Catalunya teaching materials
  • Aina subcorpora for Life Sciences and Medicine

Total: 460,512 sentence pairs.

Training procedure

Data preparation:
All datasets are deduplicated and filtered to remove sentence pairs with a cosine similarity below 0.75. Similarity scores were computed using a MTUOC rescoring system for the Wikipedia, Opus, and HPLT datasets, and LaBSE for the others.

All Catalan segments were then normalised to follow the IEC's updated grammar guidelines using an MTUOC script.

Tokenization:
All data is tokenized using SentencePiece with a shared encoder/decoder vocabulary of 64k tokens.

Hyperparameters:

Parameter Value
Embedding size 512
Feed-forward network dim 2048
Number of attention heads 8
Encoder layers 6
Decoder layers 6
Learning rate schedule warmup (16k) + inverse-sqrt decay
Base learning rate 0.0003
Clip norm 5
Dropout 0.1
Label smoothing 0.1
Tied embeddings True

Evaluation

Metrics: BLEU, chrF2, TER on a domain-specific evaluation corpus.

Results:

System BLEU chrF2 TER
Marian GEN 55.3 75.5 32.2
Marian GEN 2 65.3 81.9 24.4
Marian FT2 24 70.0 84.1 21.4
Google Translate 71.6 85.6 20.0

The Marian FT2 24 model shows a significant improvement over the general model (+4.6 BLEU), confirming the effectiveness of domain-specific fine-tuning, while remaining slightly below Google Translate for this domain.

Additional information

Author: Antoni Oliver and Gemma Segués, Universitat Oberta de Catalunya.
Contact: [email protected], [email protected]
Copyright: © 2025 Antoni Oliver and Gemma Segués, Universitat Oberta de Catalunya.
Funding: Supported by Institut d’Estudis Catalans through the project Eines d’intel·ligència artificial per al foment de la comunicació científica en català.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support