Small100 · Singlish → Sinhala Transliteration
Fine-tuned SMaLL-100 for Singlish (Romanised Sinhala) → Sinhala script transliteration using a two-phase LoRA training strategy.
Training Strategy
| Phase | Dataset | Steps | LR |
|---|---|---|---|
| Phase 1 | Phonetic corpus (400,000 pairs) | 9,000 | 0.0003 |
| Phase 2 | Adhoc curated (10,000 × 10) | 4,000 | 6e-05 |
LoRA config: r=32, alpha=64, dropout 0.05
Target modules: q_proj, k_proj, v_proj, out_proj
Results — IndoNLP 2025 (Test 1 + Test 2 combined)
| Metric | Phase 1 | Final |
|---|---|---|
| BLEU-char | 83.5015 | 83.0713 |
| WER | 0.2880 | 0.3385 |
| CER | 0.1017 | 0.0994 |
| ExactMatch | 0.1359 | 0.0772 |
Inference
import sys, requests, os
tok_url = 'https://huggingface.co/Pudamya/small100-singlish-sinhala-transliteration-2phase/resolve/main/tokenization_small100.py'
with open('tokenization_small100.py', 'wb') as f:
f.write(requests.get(tok_url).content)
sys.path.insert(0, '.')
from tokenization_small100 import SMALL100Tokenizer
from transformers import M2M100ForConditionalGeneration
repo = 'Pudamya/small100-singlish-sinhala-transliteration-2phase'
tokenizer = SMALL100Tokenizer.from_pretrained(repo)
tokenizer.src_lang = 'en'
tokenizer.tgt_lang = 'si'
model = M2M100ForConditionalGeneration.from_pretrained(repo)
model.eval()
inputs = ['mage nama pudamya', 'oya kohomada', 'api yamu']
enc = tokenizer(inputs, return_tensors='pt', padding=True, truncation=True, max_length=128)
out = model.generate(**enc, num_beams=5, max_length=128, length_penalty=0.9)
print(tokenizer.batch_decode(out, skip_special_tokens=True))
Evaluation Plots
See the eval/ folder for training loss curves and metric bar charts.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Pudamya/small100-singlish-sinhala-transliteration-2phase
Base model
alirezamsh/small100