Post
1326
š Four Qwen3-ASR (0.6B and 1.7B) Fine-Tunes for Portuguese and Dutch.
Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection.
š Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b
Headline numbers ā Common Voice 22 test, with the zero-shot baseline.
šµš¹ Qwen3-ASR-1.7B-PT ā 12.91% ā 8.50% WER (-34%)
šµš¹ Qwen3-ASR-0.6B-PT ā 18.26% ā 11.85% WER (-35%)
š³š± Qwen3-ASR-1.7B-NL ā 6.68% ā 5.28% WER (-21%)
š³š± Qwen3-ASR-0.6B-NL ā 12.46% ā 8.31% WER (-33%)
The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters ā relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems.
The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller ā Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions.
Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts.
š¦ Full pipeline ā synthetic data generation, WAVe filtering, training scripts, evaluation protocol ā is open-source:
github.com/yuriyvnv/TTS-Augmented-ASR
@hf-audio .
#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice
Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection.
š Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b
Headline numbers ā Common Voice 22 test, with the zero-shot baseline.
šµš¹ Qwen3-ASR-1.7B-PT ā 12.91% ā 8.50% WER (-34%)
šµš¹ Qwen3-ASR-0.6B-PT ā 18.26% ā 11.85% WER (-35%)
š³š± Qwen3-ASR-1.7B-NL ā 6.68% ā 5.28% WER (-21%)
š³š± Qwen3-ASR-0.6B-NL ā 12.46% ā 8.31% WER (-33%)
The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters ā relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems.
The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller ā Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions.
Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts.
š¦ Full pipeline ā synthetic data generation, WAVe filtering, training scripts, evaluation protocol ā is open-source:
github.com/yuriyvnv/TTS-Augmented-ASR
@hf-audio .
#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice