You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Quran Phoneme-Level Tokenizer

This is a custom tokenizer for Quranic Arabic phoneme-level transcription, including diacritics (harakat).
It is designed for Whisper phoneme-level fine-tuning or other speech-to-text models.


Features

  • Handles Quran transliteration in Buckwalter format (e.g., bi, {ll~ahi, r~aHiym).
  • Preserves diacritics (fatha, kasra, damma, shadda, sukun).
  • Outputs phoneme-level tokens suitable for speech recognition fine-tuning.
  • Includes special tokens: <pad>, <s>, </s>, <unk>.

How to use

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("bahriddin/quran-phoneme-tokenizer")

# Encode phoneme text
phoneme_text = "b_i s_sukun m_i"
inputs = tokenizer(phoneme_text)

# Decode
decoded = tokenizer.decode(inputs["input_ids"])
print(decoded)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support