MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. โข 29 items โข Updated โข 1
MLX-compatible weights for Silero VAD v5, converted from the official JIT model.
Silero VAD v5 is a lightweight (~309K params) voice activity detection model that processes 512-sample chunks (32ms @ 16kHz) with sub-millisecond latency. It outputs a speech probability between 0 and 1 for each chunk, with LSTM state carried across chunks for streaming operation.
Architecture: STFT โ 4รConv1d+ReLU encoder โ LSTM(128) โ Conv1d decoder โ sigmoid
import SpeechVAD
// Load model
let vad = try await SileroVADModel.fromPretrained()
// Streaming: process 512-sample chunks
let prob = vad.processChunk(samples) // โ 0.0...1.0
// Batch: detect speech segments in complete audio
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for seg in segments {
print("Speech: \(seg.startTime)s - \(seg.endTime)s")
}
Part of speech-swift.
python3 scripts/convert_silero_vad.py --upload
Converts the official Silero VAD v5 JIT model via torch.hub, transposes Conv1d weights for MLX channels-last format, sums LSTM biases (bias_ih + bias_hh), and saves as safetensors.
| JIT Key | MLX Key | Shape |
|---|---|---|
_model.stft.forward_basis_buffer |
stft.weight |
[258, 256, 1] |
_model.encoder.{i}.reparam_conv.weight |
encoder.{i}.weight |
varies |
_model.encoder.{i}.reparam_conv.bias |
encoder.{i}.bias |
varies |
_model.decoder.rnn.weight_ih |
lstm.Wx |
[512, 128] |
_model.decoder.rnn.weight_hh |
lstm.Wh |
[512, 128] |
_model.decoder.rnn.bias_ih + bias_hh |
lstm.bias |
[512] |
_model.decoder.decoder.2.weight |
decoder.weight |
[1, 1, 128] |
_model.decoder.decoder.2.bias |
decoder.bias |
[1] |
The original Silero VAD model is released under the MIT License.
Quantized