Qwen3-4B Telugu Transliteration

Model Summary

pavanmantha/qwen3-4b-telugu-transliteration is a full fine-tune of Qwen3-4B-Instruct for the task of Telugu-to-Roman (Latin script) transliteration. Given a sentence written in Telugu script, the model outputs its phonetically accurate Roman-alphabet equivalent. It was trained for 1 epoch on ~35k instruction-formatted Telugu–Romanized pairs using DDP across 3× NVIDIA L40S GPUs.

Model Details

Model Description

Developed by: Pavan Kumar Mantha (pavanmantha)
Model type: Causal Language Model — full fine-tune (no LoRA/PEFT)
Language(s): Telugu (te) → Roman/Latin (en script)
License: Apache 2.0 (inherited from Qwen3 base)
Finetuned from: Qwen/Qwen3-4B-Instruct-2507
Total parameters: 4.02B (all trainable)
Model dtype: bfloat16

Model Sources

Repository: pavanmantha/qwen3-4b-telugu-transliteration
Training dataset: pavanmantha/telugu_transliteration_40k

Uses

Direct Use

This model is intended for converting Telugu script text into its Roman (Latin alphabet) transliteration. Useful for:

Search and indexing pipelines that need phonetic normalization
Text-to-speech preprocessing
Keyboard input systems for Telugu speakers
Cross-script NLP research

Downstream Use

Can be integrated into larger NLP pipelines as a preprocessing step — for example, feeding transliterated output into downstream models that work better with Latin-script text.

Out-of-Scope Use

This model is not a translation model. It performs transliteration (phonetic conversion), not semantic translation from Telugu to English.
Not suited for zero-shot tasks unrelated to transliteration.
May produce degraded output for heavily domain-specific or rare vocabulary not represented in the training data.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "pavanmantha/qwen3-4b-telugu-transliteration"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def transliterate(telugu_text: str) -> str:
    messages = [
        {
            "role": "system",
            "content": "You are a Telugu transliteration expert. Convert the given Telugu text written in Telugu script into its Roman (Latin alphabet) transliteration accurately.",
        },
        {
            "role": "user",
            "content": telugu_text,
        },
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=False,
            temperature=None,
            top_p=None,
        )
    generated = outputs[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

# Example
text = "మానవ పరిణామ ప్రక్రియ యొక్క అవలోకనాన్ని అందించండి."
print(transliterate(text))
# → "manava parinama prakriya yokka avalokananni andinchandi."

Training Details

Training Data

Dataset: pavanmantha/telugu_transliteration_40k
Total samples: 43,614
Train split: 34,891 samples (80%)
Validation split: 8,723 samples (20%), stratified with seed=42
Columns: text (Telugu script), text_transliterated (Roman script)

Sample:

text	text_transliterated
మానవ పరిణామ ప్రక్రియ యొక్క అవలోకనాన్ని అందించండి.	manava parinama prakriya yokka avalokananni andinchandi.

Training Procedure

Each sample was formatted as a 3-turn chat using Qwen3's chat template:

System : "You are a Telugu transliteration expert..."
User   : <Telugu script text>
Assistant: <Roman transliteration>

Labels were masked to -100 for the system + user prompt tokens so that the cross-entropy loss is computed only on the assistant response (the transliterated output). On average, 20.06% of tokens per sample were unmasked (non-masked ratio: 63 / 314 tokens for the representative sample).

Preprocessing

Tokenizer: Qwen3 BPE tokenizer, padding_side="right"
Max sequence length: 512 tokens
Padding: dynamic, padded to multiple of 8
Label padding: -100
Tokenization workers: 4

Training Hyperparameters

Hyperparameter	Value
Training regime	bf16 mixed precision
Learning rate	2e-6
LR scheduler	Linear with warmup
Warmup steps	50
Per-device batch size	4
Gradient accumulation steps	8
Effective batch size	96 (4 × 8 × 3 GPUs)
Epochs	1
Steps per epoch	363
Total steps	364
Weight decay	0.01
Max grad norm	1.0
Gradient checkpointing	✅ (`use_reentrant=False`)
Optimizer	AdamW (HF Trainer default)

Training Metrics

All values extracted from the TensorBoard event log.

Step	Train Loss	Grad Norm	Learning Rate	Epoch
50	2.4207	4.3750	2.00e-6	0.138
100	1.1400	1.2578	2.00e-6	0.275
150	0.7542	0.8164	1.50e-6	0.413
200	0.6720	0.7148	1.50e-6	0.550
250	0.6414	0.6563	1.00e-6	0.688
300	0.6273	0.5820	5.00e-7	0.825
350	0.6220	0.6758	~0	0.963
364 (final)	—	—	—	1.000

Final training summary (step 364):

Metric	Value
Train loss (epoch avg)	0.9693
Train runtime	1,433.7 s (~23.9 min)
Samples/second	24.34
Steps/second	0.254
Total FLOPs	2.57 × 10¹⁷

The loss dropped rapidly from 2.42 → 0.62 in the first 350 steps, demonstrating fast and stable convergence for this transliteration task.

Evaluation

Testing Data

The validation set consists of 8,723 held-out samples from the same pavanmantha/telugu_transliteration_40k dataset, split with seed=42.

Metrics

The primary training objective is cross-entropy loss (on unmasked assistant tokens only). Evaluation loss was tracked every 500 steps via eval_strategy="steps". Additional task-specific metrics (CER, WER, exact match) can be computed post-hoc using the inference snippet above.

Bias, Risks, and Limitations

The model is trained on a single dataset and may not generalize well to specialized domains (legal, medical, technical Telugu).
Transliteration conventions vary across regions and systems (e.g., ISO 15919 vs. informal conventions); the model reflects the conventions present in the training data.
Rare or archaic Telugu vocabulary may be poorly handled.
The model carries any biases inherent in the underlying Qwen3-4B-Instruct base model.

Recommendations

Users should validate outputs against a known-good reference for high-stakes applications. For best results, inputs should be clean, well-formed Telugu script sentences within the general domain of the training data.

Environmental Impact

Hardware: 3× NVIDIA L40S (46 GB each)
Training time: ~23.9 minutes (1,433.7 seconds)
Cloud provider: Private compute cluster
Compute region: India
Estimated CO₂: Minimal — single short epoch on 3 GPUs

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Technical Specifications

Model Architecture

Architecture: Qwen3ForCausalLM (decoder-only transformer)
Parameters: 4.02B
Dtype: bfloat16
Attention: Standard scaled dot-product (Flash Attention 2 not used in this run)
Training strategy: Full fine-tune with DDP (DistributedDataParallel) across 3 GPUs
use_cache: Disabled during training

Compute Infrastructure

GPUs: 3× NVIDIA L40S, 46 GB VRAM each
Driver: 550.127.05 | CUDA 12.8
Framework: PyTorch + HuggingFace Transformers + Accelerate
Launcher: torchrun --nproc_per_node=3

Software Versions

Library	Role
`transformers`	Model, Trainer, TrainingArguments
`torch`	DDP, autocast, gradient checkpointing
`accelerate`	Distributed backend
`datasets`	Data loading and tokenization
`tokenizers`	Qwen3 BPE tokenizer

Citation

If you use this model, please cite the base model and dataset:

@misc{qwen3-4b-telugu-transliteration,
  author       = {Pavan Kumar Mantha},
  title        = {Qwen3-4B Telugu Transliteration},
  year         = {2025},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/pavanmantha/qwen3-4b-telugu-transliteration}
}

Model Card Authors

Pavan Kumar Mantha — Distinguished AI Architect, PhD researcher in Generative AI (IIITDM Kurnool), MTech Data Science (BITS Pilani).

Model Card Contact

Hugging Face profile

Downloads last month: 6

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for pavanmantha/qwen3-4b-telugu-transliteration

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1465)

this model

pavanmantha
/

qwen3-4b-telugu-transliteration