nvidia
/

diar_streaming_sortformer_4spk-v2

@@ -247,13 +247,31 @@ NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
 [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
 ## How to Use this Model
 The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 ### Loading the Model
-```python3
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model from Hugging Face model card directly (You need a Hugging Face token)
@@ -268,15 +286,26 @@ diar_model.eval()
 ### Input Format
 Input to Sortformer can be an individual audio file:
-```python3
 audio_input="/path/to/multispeaker_audio1.wav"
 ```
 or a list of paths to audio files:
-```python3
 audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 ```
 or a jsonl manifest file:
-```python3
 audio_input="/path/to/multispeaker_manifest.json"
 ```
 where each line is a dictionary containing the following fields:
@@ -316,7 +345,7 @@ For clarity on the metrics used in the table:
 * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
 To set streaming configuration, use:
-```python3
 diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
 diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
 diar_model.sortformer_modules.fifo_len = FIFO_SIZE
@@ -327,14 +356,34 @@ diar_model.sortformer_modules._check_streaming_parameters()
 ### Getting Diarization Results
 To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
-```python3
 predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
 ```
 To obtain tensors of speaker activity probabilities, use:
-```python3
 predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
 ```
 ### Input

 [NeMo Documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html)<br>
+## 🚀 Quick Start: Run Diarization Now
+Here is a short example script that loads the model, runs diarization on a WAV file, and prints the results:
+```python
+from nemo.collections.asr.models import SortformerEncLabelModel
+diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
+diar_model.eval()
+diar_model.sortformer_modules.chunk_len = 340
+diar_model.sortformer_modules.chunk_right_context = 40
+diar_model.sortformer_modules.fifo_len = 40
+diar_model.sortformer_modules.spkcache_update_period = 300
+predicted_segments = diar_model.diarize(audio=["/path/to/your/audio.wav"], batch_size=1)
+for segment in predicted_segments[0]:
+    print(segment)
+```
 ## How to Use this Model
 The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 ### Loading the Model
+```python
 from nemo.collections.asr.models import SortformerEncLabelModel
 # load model from Hugging Face model card directly (You need a Hugging Face token)
 ### Input Format
 Input to Sortformer can be an individual audio file:
+```python
 audio_input="/path/to/multispeaker_audio1.wav"
 ```
 or a list of paths to audio files:
+```python
 audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 ```
+or a numpy array (single or list):
+```python
+import numpy as np
+audio_input = np.random.randn(16000 * 10).astype(np.float32)  # 10 sec at 16kHz
+# or a list of arrays
+audio_input = [audio_array1, audio_array2]
+diar_model.diarize(audio=audio_input, batch_size=2, sample_rate=16000)
+```
+Note: When using numpy arrays, you **MUST** specify a correct `sample_rate` in `diar_model.diarize()` function.
+Default `sample_rate` is `16000`.
 or a jsonl manifest file:
+```python
 audio_input="/path/to/multispeaker_manifest.json"
 ```
 where each line is a dictionary containing the following fields:
 * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
 To set streaming configuration, use:
+```python
 diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
 diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
 diar_model.sortformer_modules.fifo_len = FIFO_SIZE
 ### Getting Diarization Results
 To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
+```python
 predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
 ```
 To obtain tensors of speaker activity probabilities, use:
+```python
 predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
 ```
+Note that if you are feeding a list of numpy arrays, you **MUST** provide the `sample_rate` in integer format.
+```python
+predicted_segments, predicted_probs = diar_model.diarize(audio=[np_array1, np_array2], batch_size=2, sample_rate=16000)
+```
+## 🔬 For more detailed evaluations (DER)
+If you need to perform a comprehensive evaluation and calculate the **Diarization Error Rate (DER)** across different parameter settings, use the NeMo example script [e2e_diarize_speech.py](https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py).
+This script allows you to test the streaming behavior of the model by adjusting key parameters like `chunk_len`, `fifo_len`, and `spkcache_update_period`.
+```bash
+python ${NEMO_ROOT}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
+    model_path="/path/to/diar_sortformer_4spk_v1.nemo" \
+    dataset_manifest="/path/to/diarization_manifest.json" \
+    batch_size=1 \
+    spkcache_len=188 \
+    spkcache_update_period=300 \
+    fifo_len=40 \
+    chunk_len=340 \
+    chunk_right_context=40
+```
 ### Input