---
language:
  - en
  - zh
license: other
license_name: minicpm-model-license
tags:
  - multimodal
  - vision
  - audio
  - tts
  - voice-cloning
  - bitsandbytes
  - 8-bit
  - quantized
base_model: openbmb/MiniCPM-o-4_5
library_name: transformers
pipeline_tag: any-to-any
---

# MiniCPM-o 4.5 — INT8 (bitsandbytes)

> **8-bit bitsandbytes quantization** of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5)
> with selective module skipping for audio/vision quality preservation.

## Highlights

- Full multimodal capability preserved: **text, vision, audio input, TTS output, voice cloning**
- Only LLM transformer layers are quantized — audio encoder (Whisper), vision encoder (SigLIP),
  TTS decoder, and projection layers remain in bf16
- TTS `weight_norm` layers are explicitly skipped (they crash under bitsandbytes quantization)
- **Tested with 293 unit + integration tests** — all passing
- Benchmark-validated: text quality identical to bf16, audio quality within natural variation

## VRAM Requirements

| Precision | VRAM (loaded) | Peak VRAM | Load Time |
|-----------|--------------|-----------|-----------|
| bf16 (baseline) | 21.0 GB | 21.3 GB | 16.3s |
| **8-bit (this repo)** | **14.5 GB** | **14.8 GB** | **20.5s** |

Benchmarked on NVIDIA RTX PRO 6000 Blackwell (96 GB). Your load times will vary.

## Quick Start

```python
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "ericleigh007/MiniCPM-o-4_5-BNB-Int8"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval()
model.init_tts()
```

## Quantization Details

**Method:** bitsandbytes 8-bit linear

**Modules kept in bf16** (not quantized):
| Module | Reason |
|--------|--------|
| `lm_head` | Output projection — standard practice |
| `apm` | Whisper audio encoder — small, quality-sensitive |
| `tts` | TTS decoder — uses `weight_norm`, incompatible with bitsandbytes |
| `vpm` | SigLIP vision encoder — small, quality-sensitive |
| `resampler` | Vision resampler — small |
| `audio_projection_layer` | Audio-to-LLM projector — small |
| `audio_avg_pooler` | Audio pooling layer — small |

## Benchmark Results

Text quality is **identical** to bf16 (echo prompts return exact matches, math returns correct answers).
Audio quality shows minor spectral variation within the range of normal run-to-run differences.

Full benchmark report with spectrograms: [OmniChat Quantization Benchmarks](https://github.com/ericleigh007/OmniChat)

## License

Same as the base model: [MiniCPM Model License](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)

## Credits

- Base model by [OpenBMB](https://huggingface.co/openbmb)
- Quantization and testing by [ericleigh007](https://huggingface.co/ericleigh007)
- Part of the [OmniChat](https://github.com/ericleigh007/OmniChat) project

---
*Exported on 2026-03-11*