| ---
|
| license: cc-by-4.0
|
| ---
|
| # Whisper-Large-v2-hindi
|
|
|
| This is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), fine-tuned on the following datasets:
|
| | Dataset | Hours (Hi) | License | Source |
|
| |----------------------------------------|------------|-----------------------------------|------------------------------------------------------------------------|
|
| | **Shrutilipi** | ~1,558 h | CC BY 4.0 | [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) |
|
| | **IITM Madras SpringLab** | ~900 h | CC BY 4.0 | [SpringLab](https://asr.iitm.ac.in/dataset) |
|
| | **Common Voice 11.0 (Mozilla)** | ~20 h | CC 0 1.0 (public domain) | [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) |
|
| | **IndicSUPERB** | 150 h | Apache License 2.0 | [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) |
|
| | **snow-mountain** | 67.6 h | CC BY-SA 4.0 | [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) |
|
| | **yodas** | ~200 h | CC BY 3.0 | [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) |
|
| | **IndicVoices-R_Hindi** | 75 h | CC BY 4.0 | [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |
|
| | **Lahaja** | 12.5 h | CC BY 4.0 | [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) |
|
| | **fleurs** | 30.0 h | CC BY 4.0 | [google/fleurs](https://huggingface.co/datasets/google/fleurs) |
|
|
|
| The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.
|
|
|
| ## How to use
|
| The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
|
|
|
| ```python
|
| >>> import torch
|
| >>> from transformers import pipeline
|
| >>> from datasets import load_dataset
|
|
|
| >>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
|
|
| >>> asr_pipe = pipe(
|
| >>> "automatic-speech-recognition",
|
| >>> model="collabora/whisper-large-v2-hindi",
|
| >>> chunk_length_s=30,
|
| >>> device=device
|
| >>> )
|
|
|
| >>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
|
| >>> sample = ds[0]["audio"]
|
| >>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
|
| {'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
|
| ```
|
|
|
| ## Intended Use
|
| - The model is designed for high quality transcription in Hindi.
|
| - And is suitable for academic use in ASR related tasks.
|
|
|
| ## Limitations
|
| - May not perform well on noisy or low-quality audio.
|
| - Focused primarily on Hindi.
|
|
|
| ### Model Performance
|
| Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
|
| ```
|
| 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
| ```
|
|
|
| After whisper normalization:
|
| ```
|
| 'कषतरफल बढन स उतप दन बढ'
|
| ```
|
|
|
| So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
|
| ```
|
| 'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
|
| ```
|
|
|
| `openai-whisper/large-v2` baseline results on `google/fleurs -- hindi`:
|
| ```
|
| Word Error Rate (WER) with whisper norm: 21.45 %
|
| Word Error Rate (WER) with indic norm: 38.46 %
|
| ```
|
|
|
| The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
|
| ```
|
| Word Error Rate (WER) with whisper norm: 5.33 %
|
| Word Error Rate (WER) with indic norm: 13.06 %
|
| ```
|
|
|
| Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.
|
|
|
| ### Acknowledgments
|
|
|
| We thank the contributors and organizations behind the datasets:
|
|
|
| - [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.
|
|
|
| - [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.
|
|
|
| - [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.
|
|
|
|
|
| ### BibTeX entry and citation info
|
|
|
| #### Model Citation
|
| ```bibtex
|
| @misc{whisper-large-v2-hindi,
|
| title = {Whisper-Large-v2 Fine-Tuned on Hindi},
|
| author = {Collabora Ltd.},
|
| year = {2025},
|
| publisher = {Hugging Face},
|
| note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
|
| howpublished = {\url{https://huggingface.co/collabora/whisper-large-v2-hindi/}},
|
| }
|
| ```
|
|
|
| #### IndicNLP Library Citation
|
| ```
|
| @misc{kunchukuttan2020indicnlp,
|
| author = "Anoop Kunchukuttan",
|
| title = "{The IndicNLP Library}",
|
| year = "2020",
|
| howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
|
| }
|
| ```
|
|
|
| #### AI4Bharat - Shrutilipi dataset
|
| ```bibtex
|
| @misc{https://doi.org/10.48550/arxiv.2208.12666,
|
| doi = {10.48550/ARXIV.2208.12666},
|
| url = {https://arxiv.org/abs/2208.12666},
|
| author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
|
| title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
|
| publisher = {arXiv},
|
| year = {2022},
|
| copyright = {arXiv.org perpetual, non-exclusive license}
|
| }
|
| ```
|
| |