Qwen3-4B Telugu Transliteration

Model Summary

pavanmantha/qwen3-4b-telugu-transliteration is a full fine-tune of Qwen3-4B-Instruct for the task of Telugu-to-Roman (Latin script) transliteration. Given a sentence written in Telugu script, the model outputs its phonetically accurate Roman-alphabet equivalent. It was trained for 1 epoch on ~35k instruction-formatted Telugu–Romanized pairs using DDP across 3× NVIDIA L40S GPUs.


Model Details

Model Description

  • Developed by: Pavan Kumar Mantha (pavanmantha)
  • Model type: Causal Language Model — full fine-tune (no LoRA/PEFT)
  • Language(s): Telugu (te) → Roman/Latin (en script)
  • License: Apache 2.0 (inherited from Qwen3 base)
  • Finetuned from: Qwen/Qwen3-4B-Instruct-2507
  • Total parameters: 4.02B (all trainable)
  • Model dtype: bfloat16

Model Sources


Uses

Direct Use

This model is intended for converting Telugu script text into its Roman (Latin alphabet) transliteration. Useful for:

  • Search and indexing pipelines that need phonetic normalization
  • Text-to-speech preprocessing
  • Keyboard input systems for Telugu speakers
  • Cross-script NLP research

Downstream Use

Can be integrated into larger NLP pipelines as a preprocessing step — for example, feeding transliterated output into downstream models that work better with Latin-script text.

Out-of-Scope Use

  • This model is not a translation model. It performs transliteration (phonetic conversion), not semantic translation from Telugu to English.
  • Not suited for zero-shot tasks unrelated to transliteration.
  • May produce degraded output for heavily domain-specific or rare vocabulary not represented in the training data.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "pavanmantha/qwen3-4b-telugu-transliteration"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def transliterate(telugu_text: str) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are a Telugu transliteration expert. Convert the given Telugu text written in Telugu script into its Roman (Latin alphabet) transliteration accurately.",
        },
        {
            "role": "user",
            "content": telugu_text,
        },
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=False,
            temperature=None,
            top_p=None,
        )
    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

# Example
text = "మానవ పరిణామ ప్రక్రియ యొక్క అవలోకనాన్ని అందించండి."
print(transliterate(text))
# → "manava parinama prakriya yokka avalokananni andinchandi."

Training Details

Training Data

  • Dataset: pavanmantha/telugu_transliteration_40k
  • Total samples: 43,614
  • Train split: 34,891 samples (80%)
  • Validation split: 8,723 samples (20%), stratified with seed=42
  • Columns: text (Telugu script), text_transliterated (Roman script)

Sample:

text text_transliterated
మానవ పరిణామ ప్రక్రియ యొక్క అవలోకనాన్ని అందించండి. manava parinama prakriya yokka avalokananni andinchandi.

Training Procedure

Each sample was formatted as a 3-turn chat using Qwen3's chat template:

System : "You are a Telugu transliteration expert..."
User   : <Telugu script text>
Assistant: <Roman transliteration>

Labels were masked to -100 for the system + user prompt tokens so that the cross-entropy loss is computed only on the assistant response (the transliterated output). On average, 20.06% of tokens per sample were unmasked (non-masked ratio: 63 / 314 tokens for the representative sample).

Preprocessing

  • Tokenizer: Qwen3 BPE tokenizer, padding_side="right"
  • Max sequence length: 512 tokens
  • Padding: dynamic, padded to multiple of 8
  • Label padding: -100
  • Tokenization workers: 4

Training Hyperparameters

Hyperparameter Value
Training regime bf16 mixed precision
Learning rate 2e-6
LR scheduler Linear with warmup
Warmup steps 50
Per-device batch size 4
Gradient accumulation steps 8
Effective batch size 96 (4 × 8 × 3 GPUs)
Epochs 1
Steps per epoch 363
Total steps 364
Weight decay 0.01
Max grad norm 1.0
Gradient checkpointing ✅ (use_reentrant=False)
Optimizer AdamW (HF Trainer default)

Training Metrics

All values extracted from the TensorBoard event log.

Step Train Loss Grad Norm Learning Rate Epoch
50 2.4207 4.3750 2.00e-6 0.138
100 1.1400 1.2578 2.00e-6 0.275
150 0.7542 0.8164 1.50e-6 0.413
200 0.6720 0.7148 1.50e-6 0.550
250 0.6414 0.6563 1.00e-6 0.688
300 0.6273 0.5820 5.00e-7 0.825
350 0.6220 0.6758 ~0 0.963
364 (final) 1.000

Final training summary (step 364):

Metric Value
Train loss (epoch avg) 0.9693
Train runtime 1,433.7 s (~23.9 min)
Samples/second 24.34
Steps/second 0.254
Total FLOPs 2.57 × 10¹⁷

The loss dropped rapidly from 2.42 → 0.62 in the first 350 steps, demonstrating fast and stable convergence for this transliteration task.


Evaluation

Testing Data

The validation set consists of 8,723 held-out samples from the same pavanmantha/telugu_transliteration_40k dataset, split with seed=42.

Metrics

The primary training objective is cross-entropy loss (on unmasked assistant tokens only). Evaluation loss was tracked every 500 steps via eval_strategy="steps". Additional task-specific metrics (CER, WER, exact match) can be computed post-hoc using the inference snippet above.


Bias, Risks, and Limitations

  • The model is trained on a single dataset and may not generalize well to specialized domains (legal, medical, technical Telugu).
  • Transliteration conventions vary across regions and systems (e.g., ISO 15919 vs. informal conventions); the model reflects the conventions present in the training data.
  • Rare or archaic Telugu vocabulary may be poorly handled.
  • The model carries any biases inherent in the underlying Qwen3-4B-Instruct base model.

Recommendations

Users should validate outputs against a known-good reference for high-stakes applications. For best results, inputs should be clean, well-formed Telugu script sentences within the general domain of the training data.


Environmental Impact

  • Hardware: 3× NVIDIA L40S (46 GB each)
  • Training time: ~23.9 minutes (1,433.7 seconds)
  • Cloud provider: Private compute cluster
  • Compute region: India
  • Estimated CO₂: Minimal — single short epoch on 3 GPUs

Carbon emissions can be estimated using the Machine Learning Impact calculator.


Technical Specifications

Model Architecture

  • Architecture: Qwen3ForCausalLM (decoder-only transformer)
  • Parameters: 4.02B
  • Dtype: bfloat16
  • Attention: Standard scaled dot-product (Flash Attention 2 not used in this run)
  • Training strategy: Full fine-tune with DDP (DistributedDataParallel) across 3 GPUs
  • use_cache: Disabled during training

Compute Infrastructure

  • GPUs: 3× NVIDIA L40S, 46 GB VRAM each
  • Driver: 550.127.05 | CUDA 12.8
  • Framework: PyTorch + HuggingFace Transformers + Accelerate
  • Launcher: torchrun --nproc_per_node=3

Software Versions

Library Role
transformers Model, Trainer, TrainingArguments
torch DDP, autocast, gradient checkpointing
accelerate Distributed backend
datasets Data loading and tokenization
tokenizers Qwen3 BPE tokenizer

Citation

If you use this model, please cite the base model and dataset:

@misc{qwen3-4b-telugu-transliteration,
  author       = {Pavan Kumar Mantha},
  title        = {Qwen3-4B Telugu Transliteration},
  year         = {2025},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/pavanmantha/qwen3-4b-telugu-transliteration}
}

Model Card Authors

Pavan Kumar Mantha — Distinguished AI Architect, PhD researcher in Generative AI (IIITDM Kurnool), MTech Data Science (BITS Pilani).

Model Card Contact

Hugging Face profile

Downloads last month
6
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pavanmantha/qwen3-4b-telugu-transliteration

Finetuned
(1465)
this model

Dataset used to train pavanmantha/qwen3-4b-telugu-transliteration