mozilla-foundation/common_voice_17_0
Updated • 5.91k • 19
Fine-tuned model for Cantonese (yue) speech recognition.
| Metric | Value |
|---|---|
| CER (no punctuation) | 7.79% |
| CER (raw) | 9.72% |
| Eval Loss | 0.2025 |
| Best Step | 3500 |
| Best Epoch | 19.03 |
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 500 | 2.03 | 0.8699 | 11.57% | 16.47% |
| 1000 | 5.02 | 0.4013 | 9.15% | 13.10% |
| 1500 | 8.02 | 0.2469 | 8.45% | 10.73% |
| 2000 | 11.01 | 0.2211 | 8.07% | 9.88% |
| 2500 | 14.00 | 0.2110 | 7.95% | 9.93% |
| 3000 | 16.03 | 0.2056 | 7.85% | 9.80% |
| 3500 | 19.03 | 0.2025 | 7.79% | 9.72% |
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test4-combined
tensorboard --logdir whisper-large-v3-yue-test4-combined/runs
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test4-combined")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test4-combined")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
input_features = processor(
audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)