THIVLVC: Latin ByT5 Lemmatizer

THIVLVC is a state-of-the-art Latin lemmatizer based on the ByT5 (base) architecture. It was developed at LISN (CNRS) to provide a high-performance, unified model for diverse Latin corpora.

Performance Analysis

The following table compares THIVLVC against major industry standards across the five Universal Dependencies (UD) Latin benchmarks.

Benchmark	THIVLVC	UDPipe 2.0	Trankit (XLM-R)	Stanza (v1.5)	GreTa (T5)
Perseus (Poetry)	93.48%	91.04%	70.34%	91.44%	91.14%
UDante (Medieval)	85.85%	84.80%	-	78.08%	-
PROIEL (Classical)	97.29%	96.65%	97.21%	90.88%	-
ITTB (Scholastic)	98.64%	99.03%	99.13%	96.50%	-
LLCT (Late Latin)	88.92%	97.40%	96.2%	97.10%	-

THIVLVC achieves state-of-the-art results on three major benchmarks: Perseus (Classical Poetry), UDante (Medieval Prose), and PROIEL (Biblical/Classical). It is particularly effective for complex literary and medieval texts.

Usage

Important: For best results, especially on short sentences or fragments, use beam search (num_beams=5).

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "Zual/THIVLVC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def lemmatize(text):
    inputs = tokenizer(text, return_tensors="pt")
    # Using beam search (num_beams=5) for better accuracy
    outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
print(lemmatize("Amorem canat")) 
# Expected Output: "amor cano"

This model was produced by Luc Pommeret at LISN (CNRS, Université Paris-Saclay).

Downloads last month: 17

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Zual
/

THIVLVC

THIVLVC: Latin ByT5 Lemmatizer

Performance Analysis

Usage

Dataset used to train Zual/THIVLVC