Instructions to use zai-org/GLM-4.6V with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.6V with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="zai-org/GLM-4.6V") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("zai-org/GLM-4.6V") model = AutoModelForImageTextToText.from_pretrained("zai-org/GLM-4.6V") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zai-org/GLM-4.6V with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.6V" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.6V
- SGLang
How to use zai-org/GLM-4.6V with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.6V" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.6V" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.6V", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use zai-org/GLM-4.6V with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.6V
vLLM load error
TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
same cant get this to load on either vllm or sglang latest sglang gives different error ( return super().getattribute(key)
AttributeError: 'Glm4vMoeConfig' object has no attribute 'rope_scaling'
)
You probably need to update your transformers library -- I had the same as well and after updating to 5.0.0rc0 I am good to run it.
(glm46v) mbelleau@aibeast:/mnt/vault/llm/glm46v$ uv pip freeze | grep -e torch -e vllm -e transformers
torch==2.9.1
torchaudio==2.9.0+cu130
torchvision==0.24.1
transformers==5.0.0rc0
vllm==0.12.0
Thanks, that was indeed the issue. even though i had it in my requirements it didnt install for some reason. After installing transformers v5 i was able to run but had a lot of memory issues.
on my 8xL4 (24gb each) this worked
!vllm serve zai-org/GLM-4.6V-FP8 --served-model-name glm-4.6v --host 0.0.0.0 --port 1234 --max-model-len 32000 --tensor-parallel-size 8 --distributed-executor-backend mp --max-num-seqs 4 --enable-expert-parallel --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --enforce-eager --mm-encoder-tp-mode data --mm-processor-cache-type shm --gpu_memory_utilization 0.8 --kv-cache-dtype fp8_e4m3 --mm-processor-cache-gb 1 --limit-mm-per-prompt '{"image":2, "video":0}'
Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.
Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.
try this installation:
pip install transformers==5.0.0rc1 --upgrade --no-deps
If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v
then run that docker image according to normal vllm commands.
If using vllm docker, build like so
# ./vllm/dockerfile FROM vllm/vllm-openai:nightly RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46vthen run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!
If using vllm docker, build like so
# ./vllm/dockerfile FROM vllm/vllm-openai:nightly RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46vthen run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!
Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :) AWQ or NVFP4 variants
Edit: added that this works in the awq/nvfp4 variants on a single 6000 blackwell system.
Glad it works for you. With
--kv-cache-dtype fp8you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)
How?
- RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
- This model is trained up to 131072 context size not 256k
Glad it works for you. With
--kv-cache-dtype fp8you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)How?
- RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
- This model is trained up to 131072 context size not 256k
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
And for 256k here you go.
--max-model-len 262144 \
--rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92
Quantizing MoEs
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under
llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:
- Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include
torch.nn.Linearlayers in its MoE blocks, a requirement for LLM Compressor to run quantization.- Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
--max-model-len 262144 \ --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87
I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up
Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92
Quantizing MoEs
To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under
llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:
- Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include
torch.nn.Linearlayers in its MoE blocks, a requirement for LLM Compressor to run quantization.- Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
--max-model-len 262144 \ --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87
This setup is my daily driver for web-dev, long context with my project size and the Multi-model capabilites have worked great.
I notice that NVFP4 models are much more performant than AWQ models. The model cyankiwi/GLM-4.6V-AWQ-4bit seems to work fine. (at least until my next GPU arrives :)