Error when loading model quantized with BitsAndBytesConfig for inference

#67

by deathknight0 - opened Apr 10, 2025

Discussion

deathknight0

Apr 10, 2025

•

edited Apr 10, 2025

Thank you for the great model!

I have trouble loading this model using BitsAndBytesConfig for inference. The script that I used to load is the same as that in the model card, under ' Loading the Model Locally', but with the added keyword arg 'quantization_config'. For convenience, the script that works is:

model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
_attn_implementation='flash_attention_2'
).cuda()

And the scipt that throws an error:

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
quantization_config=nf4_config,
_attn_implementation='flash_attention_2'
).cuda()

The error message I get:

It seems the model will return 'None' when quantization_config is passed.

I think there's another thread with a similar issue when fine tuning.

Some help would be much appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment