WeSpeaker ResNet34-LM β MLX (Fixed)
MLX-native speaker embedding model for Apple Silicon, converted from Wespeaker/wespeaker-voxceleb-resnet34-LM.
Why This Exists
The existing mlx-community conversion has two bugs that produce incorrect embeddings (cosine similarity β 0 vs ONNX reference):
Conv2d bias: MLX
nn.Conv2ddefaults tobias=True, but WeSpeaker usesbias=False(standard ResNet with BatchNorm). This creates 36 extra uninitialized parameters.Pooling dimension ordering (critical): The TSTP pooling flattens in the wrong order. PyTorch flattens as
(C, F')but the MLX version flattens as(F', C). Both produce shape(B, 5120)β the FC layer accepts it without error β but the values are scrambled.
This conversion fixes both issues and is verified against the ONNX reference.
Verification
| Speaker | mlx-community (broken) | This model |
|---|---|---|
| Speaker A | cosine = -0.07 | cosine = 0.999994 β |
| Speaker B | cosine = 0.06 | cosine = 0.999993 β |
| Speaker C | cosine = -0.00 | cosine = 0.999996 β |
| Speaker D | cosine = 0.02 | cosine = 0.999994 β |
Tested on 4 speakers from a 93-minute Chinese business meeting.
Performance
| Backend | Latency / segment | 48 segments | Device |
|---|---|---|---|
| ONNX (CPU) | 181ms | 8.7s | M5 Max |
| MLX (Metal) | 17ms | 0.8s | M5 Max |
10.8x faster than ONNX on Apple Silicon.
Model Details
- Architecture: ResNet34 with Large-Margin finetuning
- Parameters: 6.6M
- Embedding dim: 256
- Training data: VoxCeleb1 + VoxCeleb2
- Input: 80-dim log Mel filterbank features (16kHz, 25ms window, 10ms shift)
- Weights size: 25MB
Usage
import mlx.core as mx
import numpy as np
from resnet_embedding import ResNet34Embedding
# Load model
model = ResNet34Embedding()
weights = np.load("weights.npz")
for key in weights.files:
path = key.split(".")
module = model
for attr in path[:-1]:
if attr.isdigit():
module = module[int(attr)]
elif attr == "layers":
module = module.layers
else:
module = getattr(module, attr)
setattr(module, path[-1], mx.array(weights[key]))
model.eval()
# Extract embedding from fbank features (T, 80)
embedding = model(mx.array(fbank[np.newaxis, :, :])) # β (1, 256)
See example_usage.py for a complete example with audio loading and fbank computation.
Conversion
Converted directly from the official PyTorch weights using convert.py:
pip install torch numpy huggingface_hub
python convert.py --model Wespeaker/wespeaker-voxceleb-resnet34-LM --output weights.npz
Key conversion steps:
- Download PyTorch
avg_model - Transpose Conv2d weights:
(O, I, H, W)β(O, H, W, I) - Remap key names for MLX
nn.Sequential - Save as NumPy
.npz
Files
| File | Description |
|---|---|
weights.npz |
MLX model weights (25MB) |
resnet_embedding.py |
Model architecture (MLX) |
config.json |
Model configuration |
example_usage.py |
Full example with audio loading |
convert.py |
PyTorch β MLX conversion script |
License
Apache 2.0 (same as the original WeSpeaker model).
Citation
@inproceedings{wang2023wespeaker,
title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
booktitle={ICASSP 2023},
year={2023}
}
- Downloads last month
- 50
Model tree for Landon41/wespeaker-voxceleb-resnet34-LM-mlx
Base model
Wespeaker/wespeaker-voxceleb-resnet34-LM