ericleigh007's picture
Upload MiniCPM-o-4_5-BNB-Int8 quantized model
89c14ba verified
metadata
language:
  - en
  - zh
license: other
license_name: minicpm-model-license
tags:
  - multimodal
  - vision
  - audio
  - tts
  - voice-cloning
  - bitsandbytes
  - 8-bit
  - quantized
base_model: openbmb/MiniCPM-o-4_5
library_name: transformers
pipeline_tag: any-to-any

MiniCPM-o 4.5 β€” INT8 (bitsandbytes)

8-bit bitsandbytes quantization of openbmb/MiniCPM-o-4_5 with selective module skipping for audio/vision quality preservation.

Highlights

  • Full multimodal capability preserved: text, vision, audio input, TTS output, voice cloning
  • Only LLM transformer layers are quantized β€” audio encoder (Whisper), vision encoder (SigLIP), TTS decoder, and projection layers remain in bf16
  • TTS weight_norm layers are explicitly skipped (they crash under bitsandbytes quantization)
  • Tested with 293 unit + integration tests β€” all passing
  • Benchmark-validated: text quality identical to bf16, audio quality within natural variation

VRAM Requirements

Precision VRAM (loaded) Peak VRAM Load Time
bf16 (baseline) 21.0 GB 21.3 GB 16.3s
8-bit (this repo) 14.5 GB 14.8 GB 20.5s

Benchmarked on NVIDIA RTX PRO 6000 Blackwell (96 GB). Your load times will vary.

Quick Start

from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "ericleigh007/MiniCPM-o-4_5-BNB-Int8"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval()
model.init_tts()

Quantization Details

Method: bitsandbytes 8-bit linear

Modules kept in bf16 (not quantized):

Module Reason
lm_head Output projection β€” standard practice
apm Whisper audio encoder β€” small, quality-sensitive
tts TTS decoder β€” uses weight_norm, incompatible with bitsandbytes
vpm SigLIP vision encoder β€” small, quality-sensitive
resampler Vision resampler β€” small
audio_projection_layer Audio-to-LLM projector β€” small
audio_avg_pooler Audio pooling layer β€” small

Benchmark Results

Text quality is identical to bf16 (echo prompts return exact matches, math returns correct answers). Audio quality shows minor spectral variation within the range of normal run-to-run differences.

Full benchmark report with spectrograms: OmniChat Quantization Benchmarks

License

Same as the base model: MiniCPM Model License

Credits


Exported on 2026-03-11