Yuriy Perezhohin PRO

yuriyvnv

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

posted an update 6 days ago

🔊 Four Qwen3-ASR (0.6B and 1.7B) Fine-Tunes for Portuguese and Dutch. Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection. 🔗 Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b Headline numbers — Common Voice 22 test, with the zero-shot baseline. 🇵🇹 Qwen3-ASR-1.7B-PT — 12.91% → 8.50% WER (-34%) 🇵🇹 Qwen3-ASR-0.6B-PT — 18.26% → 11.85% WER (-35%) 🇳🇱 Qwen3-ASR-1.7B-NL — 6.68% → 5.28% WER (-21%) 🇳🇱 Qwen3-ASR-0.6B-NL — 12.46% → 8.31% WER (-33%) The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters — relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems. The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller — Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions. Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts. 📦 Full pipeline — synthetic data generation, WAVe filtering, training scripts, evaluation protocol — is open-source: github.com/yuriyvnv/TTS-Augmented-ASR @hf-audio . #asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

updated a model 6 days ago

yuriyvnv/Qwen3-ASR-1.7B-NL

updated a collection 6 days ago

Qwen-ASR for Portuguese & Dutch, 1.7B & 0.6B

View all activity

Organizations

Posts 4

Post

1326

🔊 Four Qwen3-ASR (0.6B and 1.7B) Fine-Tunes for Portuguese and Dutch.

Both the 1.7B and 0.6B variants of Alibaba's Qwen3-ASR, fine-tuned for European Portuguese and Dutch and bundled in a single collection.

🔗 Collection: https://huggingface.co/collections/yuriyvnv/qwen-asr-for-portuguese-and-dutch-17b-and-06b

Headline numbers — Common Voice 22 test, with the zero-shot baseline.
🇵🇹 Qwen3-ASR-1.7B-PT — 12.91% → 8.50% WER (-34%)
🇵🇹 Qwen3-ASR-0.6B-PT — 18.26% → 11.85% WER (-35%)
🇳🇱 Qwen3-ASR-1.7B-NL — 6.68% → 5.28% WER (-21%)
🇳🇱 Qwen3-ASR-0.6B-NL — 12.46% → 8.31% WER (-33%)

The 0.6B variants are the more interesting half of the release. They give up only a few WER points compared to the 1.7B at a third of the parameters — relevant for edge hardware, CPU inference, or anywhere keeping inference cost down. The Dutch 0.6B in particular lands at 8.3% WER on CV22, competitive with much larger systems.

The Dutch 1.7B started from a strong 6.7% zero-shot, so the absolute gain is smaller — Qwen already handles Dutch well, and the fine-tune mostly sharpens it on Common Voice's casing and punctuation conventions.

Training stuck close to Qwen's official SFT recipe (lr 2e-5, linear schedule, 2% warmup, bf16, gradient checkpointing on a single H100). The data is the differentiator: Common Voice 22 train + validation augmented with synthetic OpenAI-TTS speech, filtered by the WAVe multimodal embedding model that scores clips at the word level and drops the ones that don't align well with their transcripts.

📦 Full pipeline — synthetic data generation, WAVe filtering, training scripts, evaluation protocol — is open-source:
github.com/yuriyvnv/TTS-Augmented-ASR
@hf-audio .
#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

Post

627

🎙️Parakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian — among the first community fine-tunes of this architecture for the aforementioned languages

📊 Results on Common Voice 17 test sets:

🇸🇮 Slovenian: 50.49% → 11.56% WER (-77%)
🇵🇹 Portuguese: 15.86% → 10.71% WER (-32%)
🇪🇪 Estonian: 27.15% → 21.03% WER (-23%)
🇳🇱 Dutch: 5.99% → 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)

🔗 Models:
🇳🇱 yuriyvnv/parakeet-tdt-0.6b-dutch
🇵🇹 yuriyvnv/parakeet-tdt-0.6b-portuguese
🇪🇪 yuriyvnv/parakeet-tdt-0.6b-estonian
🇸🇮 yuriyvnv/parakeet-tdt-0.6b-slovenian

🏗️ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice