Quran Phoneme-Level Tokenizer
This is a custom tokenizer for Quranic Arabic phoneme-level transcription, including diacritics (harakat).
It is designed for Whisper phoneme-level fine-tuning or other speech-to-text models.
Features
- Handles Quran transliteration in Buckwalter format (e.g.,
bi,{ll~ahi,r~aHiym). - Preserves diacritics (fatha, kasra, damma, shadda, sukun).
- Outputs phoneme-level tokens suitable for speech recognition fine-tuning.
- Includes special tokens:
<pad>,<s>,</s>,<unk>.
How to use
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("bahriddin/quran-phoneme-tokenizer")
# Encode phoneme text
phoneme_text = "b_i s_sukun m_i"
inputs = tokenizer(phoneme_text)
# Decode
decoded = tokenizer.decode(inputs["input_ids"])
print(decoded)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support