WeSpeaker ResNet34-LM — MLX (Fixed)

MLX-native speaker embedding model for Apple Silicon, converted from Wespeaker/wespeaker-voxceleb-resnet34-LM.

Why This Exists

The existing mlx-community conversion has two bugs that produce incorrect embeddings (cosine similarity ≈ 0 vs ONNX reference):

Conv2d bias: MLX nn.Conv2d defaults to bias=True, but WeSpeaker uses bias=False (standard ResNet with BatchNorm). This creates 36 extra uninitialized parameters.
Pooling dimension ordering (critical): The TSTP pooling flattens in the wrong order. PyTorch flattens as (C, F') but the MLX version flattens as (F', C). Both produce shape (B, 5120) — the FC layer accepts it without error — but the values are scrambled.

This conversion fixes both issues and is verified against the ONNX reference.

Verification

Speaker	mlx-community (broken)	This model
Speaker A	cosine = -0.07	cosine = 0.999994 ✅
Speaker B	cosine = 0.06	cosine = 0.999993 ✅
Speaker C	cosine = -0.00	cosine = 0.999996 ✅
Speaker D	cosine = 0.02	cosine = 0.999994 ✅

Tested on 4 speakers from a 93-minute Chinese business meeting.

Performance

Backend	Latency / segment	48 segments	Device
ONNX (CPU)	181ms	8.7s	M5 Max
MLX (Metal)	17ms	0.8s	M5 Max

10.8x faster than ONNX on Apple Silicon.

Model Details

Architecture: ResNet34 with Large-Margin finetuning
Parameters: 6.6M
Embedding dim: 256
Training data: VoxCeleb1 + VoxCeleb2
Input: 80-dim log Mel filterbank features (16kHz, 25ms window, 10ms shift)
Weights size: 25MB

Usage

import mlx.core as mx
import numpy as np
from resnet_embedding import ResNet34Embedding

# Load model
model = ResNet34Embedding()
weights = np.load("weights.npz")
for key in weights.files:
    path = key.split(".")
    module = model
    for attr in path[:-1]:
        if attr.isdigit():
            module = module[int(attr)]
        elif attr == "layers":
            module = module.layers
        else:
            module = getattr(module, attr)
    setattr(module, path[-1], mx.array(weights[key]))
model.eval()

# Extract embedding from fbank features (T, 80)
embedding = model(mx.array(fbank[np.newaxis, :, :]))  # → (1, 256)

See example_usage.py for a complete example with audio loading and fbank computation.

Conversion

Converted directly from the official PyTorch weights using convert.py:

pip install torch numpy huggingface_hub
python convert.py --model Wespeaker/wespeaker-voxceleb-resnet34-LM --output weights.npz

Key conversion steps:

Download PyTorch avg_model
Transpose Conv2d weights: (O, I, H, W) → (O, H, W, I)
Remap key names for MLX nn.Sequential
Save as NumPy .npz

Files

File	Description
`weights.npz`	MLX model weights (25MB)
`resnet_embedding.py`	Model architecture (MLX)
`config.json`	Model configuration
`example_usage.py`	Full example with audio loading
`convert.py`	PyTorch → MLX conversion script

License

Apache 2.0 (same as the original WeSpeaker model).

Citation

@inproceedings{wang2023wespeaker,
  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
  booktitle={ICASSP 2023},
  year={2023}
}

Downloads last month: 50

Model tree for Landon41/wespeaker-voxceleb-resnet34-LM-mlx

Base model

Wespeaker/wespeaker-voxceleb-resnet34-LM

Finetuned

(1)

this model