THIVLVC: Latin ByT5 Lemmatizer
THIVLVC is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.
Performance Analysis
The following table compares THIVLVC against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.
| Benchmark | THIVLVC | UDPipe 2.0 | Trankit (XLM-R) | Stanza (v1.5) | GreTa (T5) |
|---|---|---|---|---|---|
| Perseus (Poetry) | 93.48% | 91.04% | 70.34% | 91.44% | 91.14% |
| UDante (Medieval) | 85.85% | 84.80% | - | 78.08% | - |
| PROIEL (Classical) | 97.29% | 96.65% | 97.21% | 90.88% | - |
| ITTB (Scholastic) | 98.64% | 99.03% | 99.13% | 96.50% | - |
| LLCT (Late Latin) | 88.92% | 97.40% | 96.2% | 97.10% | - |
THIVLVC achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.
Usage
Important: For best results, especially on short sentences or fragments, use beam search (num_beams=5).
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "Zual/THIVLVC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
def lemmatize(text):
inputs = tokenizer(text, return_tensors="pt")
# Using beam search (num_beams=5) for better accuracy
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example
print(lemmatize("Amorem canat"))
# Expected Output: "amor cano"
This model was produced by Luc Pommeret at LISN (CNRS, Université Paris-Saclay).
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support