Update README

d591b46 about 1 year ago

6.9 kB

	---
	license: cc-by-4.0
	---
	# Whisper-Large-v2-hindi

	This is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), fine-tuned on the following datasets:
	\| Dataset \| Hours (Hi) \| License \| Source \|
	\|----------------------------------------\|------------\|-----------------------------------\|------------------------------------------------------------------------\|
	\| Shrutilipi \| ~1,558 h \| CC BY 4.0 \| [ai4bharat/shrutilipi](https://huggingface.co/datasets/ai4bharat/Shrutilipi) \|
	\| IITM Madras SpringLab \| ~900 h \| CC BY 4.0 \| [SpringLab](https://asr.iitm.ac.in/dataset) \|
	\| Common Voice 11.0 (Mozilla) \| ~20 h \| CC 0 1.0 (public domain) \| [mozilla/commonvoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) \|
	\| IndicSUPERB \| 150 h \| Apache License 2.0 \| [ai4bharat/indic-superb](https://github.com/AI4Bharat/IndicSUPERB) \|
	\| snow-mountain \| 67.6 h \| CC BY-SA 4.0 \| [bridgeconn/snow-mountain](https://huggingface.co/datasets/bridgeconn/snow-mountain/) \|
	\| yodas \| ~200 h \| CC BY 3.0 \| [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) \|
	\| IndicVoices-R_Hindi \| 75 h \| CC BY 4.0 \| [SPRINGLab/IndicVoices-R_Hindi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) \|
	\| Lahaja \| 12.5 h \| CC BY 4.0 \| [ai4bharat/lahaja](https://ai4bharat.iitm.ac.in/datasets/lahaja) \|
	\| fleurs \| 30.0 h \| CC BY 4.0 \| [google/fleurs](https://huggingface.co/datasets/google/fleurs) \|

	The model is trained on around 3000 hours of hindi speech & optimized for ASR tasks in hindi, with a particular focus on high-accuracy transcription.

	## How to use
	The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:

	```python
	>>> import torch
	>>> from transformers import pipeline
	>>> from datasets import load_dataset

	>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

	>>> asr_pipe = pipe(
	>>> "automatic-speech-recognition",
	>>> model="collabora/whisper-large-v2-hindi",
	>>> chunk_length_s=30,
	>>> device=device
	>>> )

	>>> ds = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="validation")
	>>> sample = ds[0]["audio"]
	>>> prediction = asr_pipe(sample.copy(), return_timestamps=True)
	{'text': ' हमने उस उम्मीदवार को चुना।', 'chunks': [{'timestamp': (0.0, 4.42), 'text': ' हमने उस उम्मीदवार को चुना।'}]}
	```

	## Intended Use
	- The model is designed for high quality transcription in Hindi.
	- And is suitable for academic use in ASR related tasks.

	## Limitations
	- May not perform well on noisy or low-quality audio.
	- Focused primarily on Hindi.

	### Model Performance
	Whisper Normalization is counter-productive for hindi since it takes the meaning out of a sentence for e.g. consider the hindi phrase:
	```
	'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
	```

	After whisper normalization:
	```
	'कषतरफल बढन स उतप दन बढ'
	```

	So, we use [indic-normalization](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/4cead0ae6c78fe9a19a51ef679f586206df9c476/indicnlp/normalize/indic_normalize.py#L325) for evaluation. Indic-norm produces the below output:
	```
	'क्षेत्रफल बढ़ने से उत्पादन बढ़ा।'
	```

	`openai-whisper/large-v2` baseline results on `google/fleurs -- hindi`:
	```
	Word Error Rate (WER) with whisper norm: 21.45 %
	Word Error Rate (WER) with indic norm: 38.46 %
	```

	The model achieves the following benchmarks on the held out test set `google/fleurs -- hindi`:
	```
	Word Error Rate (WER) with whisper norm: 5.33 %
	Word Error Rate (WER) with indic norm: 13.06 %
	```

	Indic normalization retains diacritics and complex characters in Hindi text, which can increase the Word Error Rate (WER) when compared to Whisper's default normalization but produces more semantically accurate transcriptions.

	### Acknowledgments

	We thank the contributors and organizations behind the datasets:

	- [AI4Bharat](https://ai4bharat.iitm.ac.in/datasets/shrutilipi) for the Shrutilipi dataset.

	- [IIT Madras SpringLab](https://asr.iitm.ac.in/dataset) for their springx-hindi dataset.

	- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) by Anoop Kunchukuttan for providing normalization tools that were crucial for evaluation.


	### BibTeX entry and citation info

	#### Model Citation
	```bibtex
	@misc{whisper-large-v2-hindi,
	title = {Whisper-Large-v2 Fine-Tuned on Hindi},
	author = {Collabora Ltd.},
	year = {2025},
	publisher = {Hugging Face},
	note = {Fine-tuned using Shrutilipi and IITM Madras SpringLab datasets},
	howpublished = {\url{https://huggingface.co/collabora/whisper-large-v2-hindi/}},
	}
	```

	#### IndicNLP Library Citation
	```
	@misc{kunchukuttan2020indicnlp,
	author = "Anoop Kunchukuttan",
	title = "{The IndicNLP Library}",
	year = "2020",
	howpublished={\url{https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf}}
	}
	```

	#### AI4Bharat - Shrutilipi dataset
	```bibtex
	@misc{https://doi.org/10.48550/arxiv.2208.12666,
	doi = {10.48550/ARXIV.2208.12666},
	url = {https://arxiv.org/abs/2208.12666},
	author = {Bhogale, Kaushal Santosh and Raman, Abhigyan and Javed, Tahir and Doddapaneni, Sumanth and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
	title = {Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages},
	publisher = {arXiv},
	year = {2022},
	copyright = {arXiv.org perpetual, non-exclusive license}
	}
	```