Taejin commited on
Commit
6cd80b8
·
1 Parent(s): 39f2907

Adding numpy input and quick starter code

Browse files

Signed-off-by: taejinp <[email protected]>

Files changed (1) hide show
  1. README.md +56 -7
README.md CHANGED
@@ -247,13 +247,31 @@ NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
247
  [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
248
 
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  ## How to Use this Model
251
 
252
  The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
253
 
254
  ### Loading the Model
255
 
256
- ```python3
257
  from nemo.collections.asr.models import SortformerEncLabelModel
258
 
259
  # load model from Hugging Face model card directly (You need a Hugging Face token)
@@ -268,15 +286,26 @@ diar_model.eval()
268
 
269
  ### Input Format
270
  Input to Sortformer can be an individual audio file:
271
- ```python3
272
  audio_input="/path/to/multispeaker_audio1.wav"
273
  ```
274
  or a list of paths to audio files:
275
- ```python3
276
  audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
277
  ```
 
 
 
 
 
 
 
 
 
 
 
278
  or a jsonl manifest file:
279
- ```python3
280
  audio_input="/path/to/multispeaker_manifest.json"
281
  ```
282
  where each line is a dictionary containing the following fields:
@@ -316,7 +345,7 @@ For clarity on the metrics used in the table:
316
  * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
317
 
318
  To set streaming configuration, use:
319
- ```python3
320
  diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
321
  diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
322
  diar_model.sortformer_modules.fifo_len = FIFO_SIZE
@@ -327,14 +356,34 @@ diar_model.sortformer_modules._check_streaming_parameters()
327
 
328
  ### Getting Diarization Results
329
  To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
330
- ```python3
331
  predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
332
  ```
333
  To obtain tensors of speaker activity probabilities, use:
334
- ```python3
335
  predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
336
  ```
 
337
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
338
 
339
  ### Input
340
 
 
247
  [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
248
 
249
 
250
+ ## 🚀 Quick Start: Run Diarization Now
251
+ Here is a short example script that loads the model, runs diarization on a WAV file, and prints the results:
252
+
253
+ ```python
254
+ from nemo.collections.asr.models import SortformerEncLabelModel
255
+ diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
256
+ diar_model.eval()
257
+
258
+ diar_model.sortformer_modules.chunk_len = 340
259
+ diar_model.sortformer_modules.chunk_right_context = 40
260
+ diar_model.sortformer_modules.fifo_len = 40
261
+ diar_model.sortformer_modules.spkcache_update_period = 300
262
+
263
+ predicted_segments = diar_model.diarize(audio=["/path/to/your/audio.wav"], batch_size=1)
264
+
265
+ for segment in predicted_segments[0]:
266
+ print(segment)
267
+ ```
268
  ## How to Use this Model
269
 
270
  The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
271
 
272
  ### Loading the Model
273
 
274
+ ```python
275
  from nemo.collections.asr.models import SortformerEncLabelModel
276
 
277
  # load model from Hugging Face model card directly (You need a Hugging Face token)
 
286
 
287
  ### Input Format
288
  Input to Sortformer can be an individual audio file:
289
+ ```python
290
  audio_input="/path/to/multispeaker_audio1.wav"
291
  ```
292
  or a list of paths to audio files:
293
+ ```python
294
  audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
295
  ```
296
+ or a numpy array (single or list):
297
+ ```python
298
+ import numpy as np
299
+ audio_input = np.random.randn(16000 * 10).astype(np.float32) # 10 sec at 16kHz
300
+ # or a list of arrays
301
+ audio_input = [audio_array1, audio_array2]
302
+ diar_model.diarize(audio=audio_input, batch_size=2, sample_rate=16000)
303
+ ```
304
+ Note: When using numpy arrays, you **MUST** specify a correct `sample_rate` in `diar_model.diarize()` function.
305
+ Default `sample_rate` is `16000`.
306
+
307
  or a jsonl manifest file:
308
+ ```python
309
  audio_input="/path/to/multispeaker_manifest.json"
310
  ```
311
  where each line is a dictionary containing the following fields:
 
345
  * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
346
 
347
  To set streaming configuration, use:
348
+ ```python
349
  diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
350
  diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
351
  diar_model.sortformer_modules.fifo_len = FIFO_SIZE
 
356
 
357
  ### Getting Diarization Results
358
  To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
359
+ ```python
360
  predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
361
  ```
362
  To obtain tensors of speaker activity probabilities, use:
363
+ ```python
364
  predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
365
  ```
366
+ Note that if you are feeding a list of numpy arrays, you **MUST** provide the `sample_rate` in integer format.
367
 
368
+ ```python
369
+ predicted_segments, predicted_probs = diar_model.diarize(audio=[np_array1, np_array2], batch_size=2, sample_rate=16000)
370
+ ```
371
+
372
+ ## 🔬 For more detailed evaluations (DER)
373
+
374
+ If you need to perform a comprehensive evaluation and calculate the **Diarization Error Rate (DER)** across different parameter settings, use the NeMo example script [e2e_diarize_speech.py](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py).
375
+ This script allows you to test the streaming behavior of the model by adjusting key parameters like `chunk_len`, `fifo_len`, and `spkcache_update_period`.
376
+ ```bash
377
+ python ${NEMO_ROOT}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
378
+ model_path="/path/to/diar_sortformer_4spk_v1.nemo" \
379
+ dataset_manifest="/path/to/diarization_manifest.json" \
380
+ batch_size=1 \
381
+ spkcache_len=188 \
382
+ spkcache_update_period=300 \
383
+ fifo_len=40 \
384
+ chunk_len=340 \
385
+ chunk_right_context=40
386
+ ```
387
 
388
  ### Input
389