Instructions to use microsoft/Phi-4-multimodal-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-multimodal-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Error when loading model quantized with BitsAndBytesConfig for inference
Thank you for the great model!
I have trouble loading this model using BitsAndBytesConfig for inference. The script that I used to load is the same as that in the model card, under ' Loading the Model Locally', but with the added keyword arg 'quantization_config'. For convenience, the script that works is:
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
_attn_implementation='flash_attention_2'
).cuda()
And the scipt that throws an error:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
quantization_config=nf4_config,
_attn_implementation='flash_attention_2'
).cuda()
The error message I get:
It seems the model will return 'None' when quantization_config is passed.
I think there's another thread with a similar issue when fine tuning.
Some help would be much appreciated.
