🏷️ lfm2.5-1.2b-unesco-tagger-v3

Fine-tuned model for extracting UNESCO Thesaurus keywords from documents.

Model Version: v3 (600 training examples) Performance: F1=0.361, Precision=0.364, Recall=0.368 Training Data: unesco-data-ai/unesco-thesaurus-sft

📋 Model Description

This model is fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct to automatically tag documents with keywords from the UNESCO Thesaurus.

✨ Use Cases:

  • 📚 Document classification and indexing
  • 🗂️ Metadata extraction for digital libraries
  • 🔍 Knowledge organization and discovery
  • 🏛️ Automated tagging for UNESCO/UNESDOC documents

🚀 Usage

Basic Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    trust_remote_code=True
)

# Prepare input
text = """The UNESCO Recommendation on the Ethics of Artificial Intelligence
addresses ethical issues related to AI systems throughout their life cycle,
including research, design, development, deployment, and use."""

prompt = f"Extract UNESCO Thesaurus keywords from this text:\n\n{text}"
messages = [{"role": "user", "content": prompt}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

print(response)
# Output: ["Artificial intelligence", "Ethics of science", "Human rights", ...]

📝 Input Format

Prompt template:

Extract UNESCO Thesaurus keywords from this text:

{document_text}

Output format: JSON array of keywords

["Keyword1", "Keyword2", "Keyword3"]

🌍 Real-World Example

📄 Document: UNESCO Recommendation on the Ethics of Artificial Intelligence (44 pages, ~103,000 characters)

⚙️ Method: Document chunked into 41 segments, keywords extracted and ranked by frequency

🎯 Keywords extracted:

{
  "keywords": [
    "Artificial intelligence",
    "Ethics of science",
    "Computer applications",
    "Human rights",
    "Computer science",
    "Automation",
    "Cognition",
    "Access to information",
    "Transparency",
    "Evaluation"
  ]
}

📖 Handling Long Documents

For documents longer than ~3000 characters:

  1. ✂️ Chunk the document with overlapping segments (e.g., 3000 chars with 500 overlap)
  2. 🔄 Process each chunk separately
  3. 📊 Aggregate results by keyword frequency
  4. 🏆 Return top N keywords (e.g., top 10)

Example implementation:

from collections import Counter

def tag_long_document(text, model, tokenizer, chunk_size=3000, overlap=500, max_keywords=10):
    # Split into chunks
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start = end - overlap
        if start >= len(text) - overlap:
            break

    # Extract keywords from each chunk
    all_keywords = []
    for chunk in chunks:
        keywords = extract_keywords(chunk, model, tokenizer)  # Your extraction function
        all_keywords.extend(keywords)

    # Rank by frequency
    counts = Counter(kw.lower() for kw in all_keywords)
    top = [kw for kw, _ in counts.most_common(max_keywords)]
    return top

✅ Validation

For best results, validate extracted keywords against the UNESCO Thesaurus vocabulary:

🎓 Training Details

Dataset Statistics

  • Total Examples: 600 (480 train / 60 validation / 60 test)
  • Data Source: Ollama synthetic generation (lfm2.5-thinking model)
  • Quality: 99.7% generation success rate
  • Average Text Length: 272 words per example
  • Average Keywords: 5.5 per example
  • Dataset: unesco-data-ai/unesco-thesaurus-sft

Training Configuration

Parameter Value
🧠 Base model LiquidAI/LFM2.5-1.2B-Instruct
📚 Method Supervised Fine-Tuning (SFT)
🛠️ Library TRL 0.27.1
⚡ Hardware HuggingFace Jobs (a10g-large GPU, 22GB VRAM)
⏱️ Training Time ~1.5 hours
📊 Epochs 3
🎯 Batch Size 1 (gradient accumulation: 16)
📈 Learning Rate 2e-5
📏 Max Sequence Length 1024 tokens

📦 Framework Versions

  • TRL: 0.27.1
  • Transformers: 5.0.0
  • PyTorch: 2.10.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

📊 Performance

Evaluated on 10 test examples:

Metric Score Interpretation
F1 Score 0.361 Moderate performance; functional but needs improvement
Precision 0.364 36% of predicted keywords are correct
Recall 0.368 37% of ground truth keywords are captured
Valid Predictions 10/10 Model generates valid JSON output 100% of the time

Strengths

  • 100% valid predictions: Always produces well-formed JSON output
  • Semantic understanding: Captures general domain correctly
  • Reasonable keyword count: Generates 3-8 keywords appropriately
  • No hallucinations: All predicted keywords are valid UNESCO terms

Known Issues

  • ⚠️ Specificity gap: Often predicts general terms instead of specific ones
  • ⚠️ Moderate recall: Misses ~63% of ground truth keywords
  • ⚠️ Some duplicates: Occasionally repeats keywords

🔧 Graph-Based Post-Processing

This model can be enhanced with graph-based refinement to remove parent-child redundancy:

# Tag with graph-based post-processing
python scripts/tag_document.py \
  --file document.pdf \
  --use-graph \
  --refine-strategy balanced \
  --max-keywords 10

Benefits:

  • Removes redundant parent-child terms (e.g., keeps "Child psychology" and removes "Educational psychology")
  • Preserves sibling terms (e.g., keeps both "Educational psychology" and "Educational philosophy")
  • Improves precision by focusing on specific terms

⚠️ Limitations

  • Training data averaged 272 words per example (synthetic data)
  • Works best with chunk sizes of 2000-4000 characters
  • May generate keywords not in the UNESCO Thesaurus (validation recommended)
  • Optimized for English text only
  • Production use recommended after additional training (target F1 > 0.50)

📖 Citation

@misc{unesco-tagger-v3-2026,
    title = {UNESCO Thesaurus Keyword Tagger v3},
    author = {UNESCO Data & AI},
    year = {2026},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3}}
}

🔗 Resources

📜 License

See LICENSE for details.

Downloads last month
7
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Finetuned
(93)
this model

Dataset used to train unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3

Evaluation results