--- language: - en - zh license: other license_name: minicpm-model-license tags: - multimodal - vision - audio - tts - voice-cloning - bitsandbytes - 8-bit - quantized base_model: openbmb/MiniCPM-o-4_5 library_name: transformers pipeline_tag: any-to-any --- # MiniCPM-o 4.5 — INT8 (bitsandbytes) > **8-bit bitsandbytes quantization** of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) > with selective module skipping for audio/vision quality preservation. ## Highlights - Full multimodal capability preserved: **text, vision, audio input, TTS output, voice cloning** - Only LLM transformer layers are quantized — audio encoder (Whisper), vision encoder (SigLIP), TTS decoder, and projection layers remain in bf16 - TTS `weight_norm` layers are explicitly skipped (they crash under bitsandbytes quantization) - **Tested with 293 unit + integration tests** — all passing - Benchmark-validated: text quality identical to bf16, audio quality within natural variation ## VRAM Requirements | Precision | VRAM (loaded) | Peak VRAM | Load Time | |-----------|--------------|-----------|-----------| | bf16 (baseline) | 21.0 GB | 21.3 GB | 16.3s | | **8-bit (this repo)** | **14.5 GB** | **14.8 GB** | **20.5s** | Benchmarked on NVIDIA RTX PRO 6000 Blackwell (96 GB). Your load times will vary. ## Quick Start ```python from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig import torch model_name = "ericleigh007/MiniCPM-o-4_5-BNB-Int8" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained( model_name, trust_remote_code=True, attn_implementation="sdpa", torch_dtype=torch.bfloat16, device_map="auto", init_vision=True, init_audio=True, init_tts=True, ) model.eval() model.init_tts() ``` ## Quantization Details **Method:** bitsandbytes 8-bit linear **Modules kept in bf16** (not quantized): | Module | Reason | |--------|--------| | `lm_head` | Output projection — standard practice | | `apm` | Whisper audio encoder — small, quality-sensitive | | `tts` | TTS decoder — uses `weight_norm`, incompatible with bitsandbytes | | `vpm` | SigLIP vision encoder — small, quality-sensitive | | `resampler` | Vision resampler — small | | `audio_projection_layer` | Audio-to-LLM projector — small | | `audio_avg_pooler` | Audio pooling layer — small | ## Benchmark Results Text quality is **identical** to bf16 (echo prompts return exact matches, math returns correct answers). Audio quality shows minor spectral variation within the range of normal run-to-run differences. Full benchmark report with spectrograms: [OmniChat Quantization Benchmarks](https://github.com/ericleigh007/OmniChat) ## License Same as the base model: [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md) ## Credits - Base model by [OpenBMB](https://huggingface.co/openbmb) - Quantization and testing by [ericleigh007](https://huggingface.co/ericleigh007) - Part of the [OmniChat](https://github.com/ericleigh007/OmniChat) project --- *Exported on 2026-03-11*