reazonspeech-nemo-v2

reazonspeech-nemo-v2 is an automatic speech recognition model trained on ReazonSpeech v2.0 corpus.

This model supports inference of long-form Japanese audio clips up to several hours.

Model Architecture

The model features an improved Conformer architecture from Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.

Subword-based RNN-T model. The total parameter count is 619M.
Encoder uses Longformer attention with local context size of 256, and has a single global token.
Decoder has a vocabulary space of 3000 tokens constructed by SentencePiece unigram tokenizer.

We trained this model for 1 million steps using AdamW optimizer following Noam annealing schedule.

Usage

We recommend to use this model through our reazonspeech library.

from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model()
ret = transcribe(model, audio)
print(ret.text)

License

Apaceh Licence 2.0

Downloads last month: 1,330

Model tree for reazon-research/reazonspeech-nemo-v2

Quantizations

1 model

Collection including reazon-research/reazonspeech-nemo-v2

ReazonSpeech ASR

Collection

Official releases of ReazonSpeech ASR models • 6 items • Updated Oct 18, 2025 • 1

Papers for reazon-research/reazonspeech-nemo-v2

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Paper • 2305.05084 • Published May 8, 2023 • 4

Longformer: The Long-Document Transformer

Paper • 2004.05150 • Published Apr 10, 2020 • 4