whisper-large-v3-yue-test4-combined

Fine-tuned model for Cantonese (yue) speech recognition.

Evaluation Results

Metric Value
CER (no punctuation) 7.79%
CER (raw) 9.72%
Eval Loss 0.2025
Best Step 3500
Best Epoch 19.03

Training History

Step Epoch Eval Loss CER (nopunct) CER (raw)
500 2.03 0.8699 11.57% 16.47%
1000 5.02 0.4013 9.15% 13.10%
1500 8.02 0.2469 8.45% 10.73%
2000 11.01 0.2211 8.07% 9.88%
2500 14.00 0.2110 7.95% 9.93%
3000 16.03 0.2056 7.85% 9.80%
3500 19.03 0.2025 7.79% 9.72%

Training Details

  • Dataset: mozilla-foundation/common_voice_17_0 (yue)
  • Language: Cantonese (yue)
  • Task: Automatic Speech Recognition (ASR)
  • Architecture: Encoder-Decoder (Seq2Seq)
  • Metric: Character Error Rate (CER)
  • Total training steps: 3500

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test4-combined
tensorboard --logdir whisper-large-v3-yue-test4-combined/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test4-combined")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test4-combined")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Downloads last month
2
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train awong-dev/whisper-large-v3-yue-test4-combined

Evaluation results