THIVLVC: Latin ByT5 Lemmatizer

THIVLVC is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.

Performance Analysis

The following table compares THIVLVC against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.

Benchmark THIVLVC UDPipe 2.0 Trankit (XLM-R) Stanza (v1.5) GreTa (T5)
Perseus (Poetry) 93.48% 91.04% 70.34% 91.44% 91.14%
UDante (Medieval) 85.85% 84.80% - 78.08% -
PROIEL (Classical) 97.29% 96.65% 97.21% 90.88% -
ITTB (Scholastic) 98.64% 99.03% 99.13% 96.50% -
LLCT (Late Latin) 88.92% 97.40% 96.2% 97.10% -

THIVLVC achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.

Usage

Important: For best results, especially on short sentences or fragments, use beam search (num_beams=5).

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "Zual/THIVLVC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def lemmatize(text):
    inputs = tokenizer(text, return_tensors="pt")
    # Using beam search (num_beams=5) for better accuracy
    outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
print(lemmatize("Amorem canat")) 
# Expected Output: "amor cano"

This model was produced by Luc Pommeret at LISN (CNRS, Université Paris-Saclay).

Downloads last month
17
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Zual/THIVLVC