WeSpeaker ResNet34-LM β€” MLX (Fixed)

MLX-native speaker embedding model for Apple Silicon, converted from Wespeaker/wespeaker-voxceleb-resnet34-LM.

Why This Exists

The existing mlx-community conversion has two bugs that produce incorrect embeddings (cosine similarity β‰ˆ 0 vs ONNX reference):

  1. Conv2d bias: MLX nn.Conv2d defaults to bias=True, but WeSpeaker uses bias=False (standard ResNet with BatchNorm). This creates 36 extra uninitialized parameters.

  2. Pooling dimension ordering (critical): The TSTP pooling flattens in the wrong order. PyTorch flattens as (C, F') but the MLX version flattens as (F', C). Both produce shape (B, 5120) β€” the FC layer accepts it without error β€” but the values are scrambled.

This conversion fixes both issues and is verified against the ONNX reference.

Verification

Speaker mlx-community (broken) This model
Speaker A cosine = -0.07 cosine = 0.999994 βœ…
Speaker B cosine = 0.06 cosine = 0.999993 βœ…
Speaker C cosine = -0.00 cosine = 0.999996 βœ…
Speaker D cosine = 0.02 cosine = 0.999994 βœ…

Tested on 4 speakers from a 93-minute Chinese business meeting.

Performance

Backend Latency / segment 48 segments Device
ONNX (CPU) 181ms 8.7s M5 Max
MLX (Metal) 17ms 0.8s M5 Max

10.8x faster than ONNX on Apple Silicon.

Model Details

  • Architecture: ResNet34 with Large-Margin finetuning
  • Parameters: 6.6M
  • Embedding dim: 256
  • Training data: VoxCeleb1 + VoxCeleb2
  • Input: 80-dim log Mel filterbank features (16kHz, 25ms window, 10ms shift)
  • Weights size: 25MB

Usage

import mlx.core as mx
import numpy as np
from resnet_embedding import ResNet34Embedding

# Load model
model = ResNet34Embedding()
weights = np.load("weights.npz")
for key in weights.files:
    path = key.split(".")
    module = model
    for attr in path[:-1]:
        if attr.isdigit():
            module = module[int(attr)]
        elif attr == "layers":
            module = module.layers
        else:
            module = getattr(module, attr)
    setattr(module, path[-1], mx.array(weights[key]))
model.eval()

# Extract embedding from fbank features (T, 80)
embedding = model(mx.array(fbank[np.newaxis, :, :]))  # β†’ (1, 256)

See example_usage.py for a complete example with audio loading and fbank computation.

Conversion

Converted directly from the official PyTorch weights using convert.py:

pip install torch numpy huggingface_hub
python convert.py --model Wespeaker/wespeaker-voxceleb-resnet34-LM --output weights.npz

Key conversion steps:

  1. Download PyTorch avg_model
  2. Transpose Conv2d weights: (O, I, H, W) β†’ (O, H, W, I)
  3. Remap key names for MLX nn.Sequential
  4. Save as NumPy .npz

Files

File Description
weights.npz MLX model weights (25MB)
resnet_embedding.py Model architecture (MLX)
config.json Model configuration
example_usage.py Full example with audio loading
convert.py PyTorch β†’ MLX conversion script

License

Apache 2.0 (same as the original WeSpeaker model).

Citation

@inproceedings{wang2023wespeaker,
  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
  booktitle={ICASSP 2023},
  year={2023}
}
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Landon41/wespeaker-voxceleb-resnet34-LM-mlx

Finetuned
(1)
this model