Small100 · Singlish → Sinhala Transliteration

Fine-tuned SMaLL-100 for Singlish (Romanised Sinhala) → Sinhala script transliteration using a two-phase LoRA training strategy.

Training Strategy

Phase Dataset Steps LR
Phase 1 Phonetic corpus (400,000 pairs) 9,000 0.0003
Phase 2 Adhoc curated (10,000 × 10) 4,000 6e-05

LoRA config: r=32, alpha=64, dropout 0.05
Target modules: q_proj, k_proj, v_proj, out_proj

Results — IndoNLP 2025 (Test 1 + Test 2 combined)

Metric Phase 1 Final
BLEU-char 83.5015 83.0713
WER 0.2880 0.3385
CER 0.1017 0.0994
ExactMatch 0.1359 0.0772

Inference

import sys, requests, os

tok_url = 'https://huggingface.co/Pudamya/small100-singlish-sinhala-transliteration-2phase/resolve/main/tokenization_small100.py'
with open('tokenization_small100.py', 'wb') as f:
    f.write(requests.get(tok_url).content)

sys.path.insert(0, '.')
from tokenization_small100 import SMALL100Tokenizer
from transformers import M2M100ForConditionalGeneration

repo = 'Pudamya/small100-singlish-sinhala-transliteration-2phase'
tokenizer = SMALL100Tokenizer.from_pretrained(repo)
tokenizer.src_lang = 'en'
tokenizer.tgt_lang = 'si'

model = M2M100ForConditionalGeneration.from_pretrained(repo)
model.eval()

inputs = ['mage nama pudamya', 'oya kohomada', 'api yamu']
enc = tokenizer(inputs, return_tensors='pt', padding=True, truncation=True, max_length=128)
out = model.generate(**enc, num_beams=5, max_length=128, length_penalty=0.9)
print(tokenizer.batch_decode(out, skip_special_tokens=True))

Evaluation Plots

See the eval/ folder for training loss curves and metric bar charts.

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pudamya/small100-singlish-sinhala-transliteration-2phase

Adapter
(4)
this model