Upload MiniCPM-o-4_5-BNB-Int8 quantized model

89c14ba verified 3 months ago

3.26 kB

language:
  - en
  - zh
license: other
license_name: minicpm-model-license
tags:
  - multimodal
  - vision
  - audio
  - tts
  - voice-cloning
  - bitsandbytes
  - 8-bit
  - quantized
base_model: openbmb/MiniCPM-o-4_5
library_name: transformers
pipeline_tag: any-to-any

MiniCPM-o 4.5 — INT8 (bitsandbytes)

8-bit bitsandbytes quantization of openbmb/MiniCPM-o-4_5 with selective module skipping for audio/vision quality preservation.

Highlights

Full multimodal capability preserved: text, vision, audio input, TTS output, voice cloning
Only LLM transformer layers are quantized — audio encoder (Whisper), vision encoder (SigLIP), TTS decoder, and projection layers remain in bf16
TTS weight_norm layers are explicitly skipped (they crash under bitsandbytes quantization)
Tested with 293 unit + integration tests — all passing
Benchmark-validated: text quality identical to bf16, audio quality within natural variation

VRAM Requirements

Precision	VRAM (loaded)	Peak VRAM	Load Time
bf16 (baseline)	21.0 GB	21.3 GB	16.3s
8-bit (this repo)	14.5 GB	14.8 GB	20.5s

Benchmarked on NVIDIA RTX PRO 6000 Blackwell (96 GB). Your load times will vary.

Quick Start

from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "ericleigh007/MiniCPM-o-4_5-BNB-Int8"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval()
model.init_tts()

Quantization Details

Method: bitsandbytes 8-bit linear

Modules kept in bf16 (not quantized):

Module	Reason
`lm_head`	Output projection — standard practice
`apm`	Whisper audio encoder — small, quality-sensitive
`tts`	TTS decoder — uses `weight_norm`, incompatible with bitsandbytes
`vpm`	SigLIP vision encoder — small, quality-sensitive
`resampler`	Vision resampler — small
`audio_projection_layer`	Audio-to-LLM projector — small
`audio_avg_pooler`	Audio pooling layer — small

Benchmark Results

Text quality is identical to bf16 (echo prompts return exact matches, math returns correct answers). Audio quality shows minor spectral variation within the range of normal run-to-run differences.

Full benchmark report with spectrograms: OmniChat Quantization Benchmarks

License

Same as the base model: MiniCPM Model License

Credits

Base model by OpenBMB
Quantization and testing by ericleigh007
Part of the OmniChat project

Exported on 2026-03-11