nvidia
/

diar_streaming_sortformer_4spk-v2

Automatic Speech Recognition

speaker-diarization

speaker-recognition

Model card Files Files and versions

taejinp commited on Jun 5

Commit

3f7996f

·

verified ·

1 Parent(s): cace78d

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -114,12 +114,12 @@ img {
 This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 <div align="center">
-    <img src="sortformer_intro.png" width="750" />
 </div>
 Streaming Sortformer approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
-    <img src="streaming_sortformer_ani.gif" width="1400" />
 </div>
 Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
@@ -129,7 +129,7 @@ Sortformer resolves permutation problem in diarization following the arrival-tim
 Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate speaker-cache. At each step, speaker cache is filtered to only retain the high-quality speaker cache vectors.
 <div align="center">
-    <img src="streaming_steps.png" width="1400" />
 </div>
@@ -138,7 +138,7 @@ Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fas
 and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
 <div align="center">
-    <img src="sortformer-v1-model.png" width="450" />
 </div>

 This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 <div align="center">
+    <img src="figures/sortformer_intro.png" width="750" />
 </div>
 Streaming Sortformer approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
+    <img src="figures/streaming_sortformer_ani.gif" width="1400" />
 </div>
 Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
 Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate speaker-cache. At each step, speaker cache is filtered to only retain the high-quality speaker cache vectors.
 <div align="center">
+    <img src="figures/streaming_steps.png" width="1400" />
 </div>
 and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
 <div align="center">
+    <img src="figures/sortformer-v1-model.png" width="450" />
 </div>