Instructions to use unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3") model = AutoModelForCausalLM.from_pretrained("unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
- SGLang
How to use unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3 with Docker Model Runner:
docker model run hf.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
🏷️ lfm2.5-1.2b-unesco-tagger-v3
Fine-tuned model for extracting UNESCO Thesaurus keywords from documents.
Model Version: v3 (600 training examples) Performance: F1=0.361, Precision=0.364, Recall=0.368 Training Data: unesco-data-ai/unesco-thesaurus-sft
📋 Model Description
This model is fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct to automatically tag documents with keywords from the UNESCO Thesaurus.
✨ Use Cases:
- 📚 Document classification and indexing
- 🗂️ Metadata extraction for digital libraries
- 🔍 Knowledge organization and discovery
- 🏛️ Automated tagging for UNESCO/UNESDOC documents
🚀 Usage
Basic Example
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"LiquidAI/LFM2.5-1.2B-Instruct",
trust_remote_code=True
)
# Prepare input
text = """The UNESCO Recommendation on the Ethics of Artificial Intelligence
addresses ethical issues related to AI systems throughout their life cycle,
including research, design, development, deployment, and use."""
prompt = f"Extract UNESCO Thesaurus keywords from this text:\n\n{text}"
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
# Output: ["Artificial intelligence", "Ethics of science", "Human rights", ...]
📝 Input Format
Prompt template:
Extract UNESCO Thesaurus keywords from this text:
{document_text}
Output format: JSON array of keywords
["Keyword1", "Keyword2", "Keyword3"]
🌍 Real-World Example
📄 Document: UNESCO Recommendation on the Ethics of Artificial Intelligence (44 pages, ~103,000 characters)
⚙️ Method: Document chunked into 41 segments, keywords extracted and ranked by frequency
🎯 Keywords extracted:
{
"keywords": [
"Artificial intelligence",
"Ethics of science",
"Computer applications",
"Human rights",
"Computer science",
"Automation",
"Cognition",
"Access to information",
"Transparency",
"Evaluation"
]
}
📖 Handling Long Documents
For documents longer than ~3000 characters:
- ✂️ Chunk the document with overlapping segments (e.g., 3000 chars with 500 overlap)
- 🔄 Process each chunk separately
- 📊 Aggregate results by keyword frequency
- 🏆 Return top N keywords (e.g., top 10)
Example implementation:
from collections import Counter
def tag_long_document(text, model, tokenizer, chunk_size=3000, overlap=500, max_keywords=10):
# Split into chunks
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start = end - overlap
if start >= len(text) - overlap:
break
# Extract keywords from each chunk
all_keywords = []
for chunk in chunks:
keywords = extract_keywords(chunk, model, tokenizer) # Your extraction function
all_keywords.extend(keywords)
# Rank by frequency
counts = Counter(kw.lower() for kw in all_keywords)
top = [kw for kw, _ in counts.most_common(max_keywords)]
return top
✅ Validation
For best results, validate extracted keywords against the UNESCO Thesaurus vocabulary:
🎓 Training Details
Dataset Statistics
- Total Examples: 600 (480 train / 60 validation / 60 test)
- Data Source: Ollama synthetic generation (lfm2.5-thinking model)
- Quality: 99.7% generation success rate
- Average Text Length: 272 words per example
- Average Keywords: 5.5 per example
- Dataset: unesco-data-ai/unesco-thesaurus-sft
Training Configuration
| Parameter | Value |
|---|---|
| 🧠 Base model | LiquidAI/LFM2.5-1.2B-Instruct |
| 📚 Method | Supervised Fine-Tuning (SFT) |
| 🛠️ Library | TRL 0.27.1 |
| ⚡ Hardware | HuggingFace Jobs (a10g-large GPU, 22GB VRAM) |
| ⏱️ Training Time | ~1.5 hours |
| 📊 Epochs | 3 |
| 🎯 Batch Size | 1 (gradient accumulation: 16) |
| 📈 Learning Rate | 2e-5 |
| 📏 Max Sequence Length | 1024 tokens |
📦 Framework Versions
- TRL: 0.27.1
- Transformers: 5.0.0
- PyTorch: 2.10.0
- Datasets: 4.5.0
- Tokenizers: 0.22.2
📊 Performance
Evaluated on 10 test examples:
| Metric | Score | Interpretation |
|---|---|---|
| F1 Score | 0.361 | Moderate performance; functional but needs improvement |
| Precision | 0.364 | 36% of predicted keywords are correct |
| Recall | 0.368 | 37% of ground truth keywords are captured |
| Valid Predictions | 10/10 | Model generates valid JSON output 100% of the time |
Strengths
- ✅ 100% valid predictions: Always produces well-formed JSON output
- ✅ Semantic understanding: Captures general domain correctly
- ✅ Reasonable keyword count: Generates 3-8 keywords appropriately
- ✅ No hallucinations: All predicted keywords are valid UNESCO terms
Known Issues
- ⚠️ Specificity gap: Often predicts general terms instead of specific ones
- ⚠️ Moderate recall: Misses ~63% of ground truth keywords
- ⚠️ Some duplicates: Occasionally repeats keywords
🔧 Graph-Based Post-Processing
This model can be enhanced with graph-based refinement to remove parent-child redundancy:
# Tag with graph-based post-processing
python scripts/tag_document.py \
--file document.pdf \
--use-graph \
--refine-strategy balanced \
--max-keywords 10
Benefits:
- Removes redundant parent-child terms (e.g., keeps "Child psychology" and removes "Educational psychology")
- Preserves sibling terms (e.g., keeps both "Educational psychology" and "Educational philosophy")
- Improves precision by focusing on specific terms
⚠️ Limitations
- Training data averaged 272 words per example (synthetic data)
- Works best with chunk sizes of 2000-4000 characters
- May generate keywords not in the UNESCO Thesaurus (validation recommended)
- Optimized for English text only
- Production use recommended after additional training (target F1 > 0.50)
📖 Citation
@misc{unesco-tagger-v3-2026,
title = {UNESCO Thesaurus Keyword Tagger v3},
author = {UNESCO Data & AI},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3}}
}
🔗 Resources
- Model: unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
- Dataset: unesco-data-ai/unesco-thesaurus-sft
- Training Logs: unesco-data-ai/trackio
- Base Model: LiquidAI/LFM2.5-1.2B-Instruct
📜 License
See LICENSE for details.
- Downloads last month
- 7
Model tree for unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
Base model
LiquidAI/LFM2.5-1.2B-BaseDataset used to train unesco-data-ai/lfm2.5-1.2b-unesco-tagger-v3
Evaluation results
- F1 Score on UNESCO Thesaurus SFTtest set self-reported0.361
- Precision on UNESCO Thesaurus SFTtest set self-reported0.364
- Recall on UNESCO Thesaurus SFTtest set self-reported0.368