Adding numpy input and quick starter code
Browse filesSigned-off-by: taejinp <[email protected]>
README.md
CHANGED
|
@@ -247,13 +247,31 @@ NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
|
|
| 247 |
[NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
|
| 248 |
|
| 249 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
## How to Use this Model
|
| 251 |
|
| 252 |
The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
| 253 |
|
| 254 |
### Loading the Model
|
| 255 |
|
| 256 |
-
```
|
| 257 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 258 |
|
| 259 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
|
@@ -268,15 +286,26 @@ diar_model.eval()
|
|
| 268 |
|
| 269 |
### Input Format
|
| 270 |
Input to Sortformer can be an individual audio file:
|
| 271 |
-
```
|
| 272 |
audio_input="/path/to/multispeaker_audio1.wav"
|
| 273 |
```
|
| 274 |
or a list of paths to audio files:
|
| 275 |
-
```
|
| 276 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
| 277 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
or a jsonl manifest file:
|
| 279 |
-
```
|
| 280 |
audio_input="/path/to/multispeaker_manifest.json"
|
| 281 |
```
|
| 282 |
where each line is a dictionary containing the following fields:
|
|
@@ -316,7 +345,7 @@ For clarity on the metrics used in the table:
|
|
| 316 |
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
| 317 |
|
| 318 |
To set streaming configuration, use:
|
| 319 |
-
```
|
| 320 |
diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
|
| 321 |
diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
|
| 322 |
diar_model.sortformer_modules.fifo_len = FIFO_SIZE
|
|
@@ -327,14 +356,34 @@ diar_model.sortformer_modules._check_streaming_parameters()
|
|
| 327 |
|
| 328 |
### Getting Diarization Results
|
| 329 |
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
| 330 |
-
```
|
| 331 |
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
| 332 |
```
|
| 333 |
To obtain tensors of speaker activity probabilities, use:
|
| 334 |
-
```
|
| 335 |
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
| 336 |
```
|
|
|
|
| 337 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 338 |
|
| 339 |
### Input
|
| 340 |
|
|
|
|
| 247 |
[NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
|
| 248 |
|
| 249 |
|
| 250 |
+
## 🚀 Quick Start: Run Diarization Now
|
| 251 |
+
Here is a short example script that loads the model, runs diarization on a WAV file, and prints the results:
|
| 252 |
+
|
| 253 |
+
```python
|
| 254 |
+
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 255 |
+
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
|
| 256 |
+
diar_model.eval()
|
| 257 |
+
|
| 258 |
+
diar_model.sortformer_modules.chunk_len = 340
|
| 259 |
+
diar_model.sortformer_modules.chunk_right_context = 40
|
| 260 |
+
diar_model.sortformer_modules.fifo_len = 40
|
| 261 |
+
diar_model.sortformer_modules.spkcache_update_period = 300
|
| 262 |
+
|
| 263 |
+
predicted_segments = diar_model.diarize(audio=["/path/to/your/audio.wav"], batch_size=1)
|
| 264 |
+
|
| 265 |
+
for segment in predicted_segments[0]:
|
| 266 |
+
print(segment)
|
| 267 |
+
```
|
| 268 |
## How to Use this Model
|
| 269 |
|
| 270 |
The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
| 271 |
|
| 272 |
### Loading the Model
|
| 273 |
|
| 274 |
+
```python
|
| 275 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 276 |
|
| 277 |
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
|
|
|
| 286 |
|
| 287 |
### Input Format
|
| 288 |
Input to Sortformer can be an individual audio file:
|
| 289 |
+
```python
|
| 290 |
audio_input="/path/to/multispeaker_audio1.wav"
|
| 291 |
```
|
| 292 |
or a list of paths to audio files:
|
| 293 |
+
```python
|
| 294 |
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
| 295 |
```
|
| 296 |
+
or a numpy array (single or list):
|
| 297 |
+
```python
|
| 298 |
+
import numpy as np
|
| 299 |
+
audio_input = np.random.randn(16000 * 10).astype(np.float32) # 10 sec at 16kHz
|
| 300 |
+
# or a list of arrays
|
| 301 |
+
audio_input = [audio_array1, audio_array2]
|
| 302 |
+
diar_model.diarize(audio=audio_input, batch_size=2, sample_rate=16000)
|
| 303 |
+
```
|
| 304 |
+
Note: When using numpy arrays, you **MUST** specify a correct `sample_rate` in `diar_model.diarize()` function.
|
| 305 |
+
Default `sample_rate` is `16000`.
|
| 306 |
+
|
| 307 |
or a jsonl manifest file:
|
| 308 |
+
```python
|
| 309 |
audio_input="/path/to/multispeaker_manifest.json"
|
| 310 |
```
|
| 311 |
where each line is a dictionary containing the following fields:
|
|
|
|
| 345 |
* **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
|
| 346 |
|
| 347 |
To set streaming configuration, use:
|
| 348 |
+
```python
|
| 349 |
diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
|
| 350 |
diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
|
| 351 |
diar_model.sortformer_modules.fifo_len = FIFO_SIZE
|
|
|
|
| 356 |
|
| 357 |
### Getting Diarization Results
|
| 358 |
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
| 359 |
+
```python
|
| 360 |
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
| 361 |
```
|
| 362 |
To obtain tensors of speaker activity probabilities, use:
|
| 363 |
+
```python
|
| 364 |
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
| 365 |
```
|
| 366 |
+
Note that if you are feeding a list of numpy arrays, you **MUST** provide the `sample_rate` in integer format.
|
| 367 |
|
| 368 |
+
```python
|
| 369 |
+
predicted_segments, predicted_probs = diar_model.diarize(audio=[np_array1, np_array2], batch_size=2, sample_rate=16000)
|
| 370 |
+
```
|
| 371 |
+
|
| 372 |
+
## 🔬 For more detailed evaluations (DER)
|
| 373 |
+
|
| 374 |
+
If you need to perform a comprehensive evaluation and calculate the **Diarization Error Rate (DER)** across different parameter settings, use the NeMo example script [e2e_diarize_speech.py](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py).
|
| 375 |
+
This script allows you to test the streaming behavior of the model by adjusting key parameters like `chunk_len`, `fifo_len`, and `spkcache_update_period`.
|
| 376 |
+
```bash
|
| 377 |
+
python ${NEMO_ROOT}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
|
| 378 |
+
model_path="/path/to/diar_sortformer_4spk_v1.nemo" \
|
| 379 |
+
dataset_manifest="/path/to/diarization_manifest.json" \
|
| 380 |
+
batch_size=1 \
|
| 381 |
+
spkcache_len=188 \
|
| 382 |
+
spkcache_update_period=300 \
|
| 383 |
+
fifo_len=40 \
|
| 384 |
+
chunk_len=340 \
|
| 385 |
+
chunk_right_context=40
|
| 386 |
+
```
|
| 387 |
|
| 388 |
### Input
|
| 389 |
|