Instructions to use zai-org/GLM-4.6V with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.6V with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="zai-org/GLM-4.6V")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("zai-org/GLM-4.6V")
model = AutoModelForImageTextToText.from_pretrained("zai-org/GLM-4.6V")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.6V with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.6V"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6V",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.6V

SGLang

How to use zai-org/GLM-4.6V with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.6V" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6V",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.6V" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.6V",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.6V with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.6V
```

vLLM load error

by srinivasbilla - opened Dec 8, 2025

Discussion

srinivasbilla

Dec 8, 2025

TypeError: Invalid type of HuggingFace processor. Expected type: <class 'transformers.processing_utils.ProcessorMixin'>, but found type: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>

chriswritescode

Dec 8, 2025

same cant get this to load on either vllm or sglang latest sglang gives different error ( return super().getattribute(key)
AttributeError: 'Glm4vMoeConfig' object has no attribute 'rope_scaling'
)

malaiwah

Dec 9, 2025

You probably need to update your transformers library -- I had the same as well and after updating to 5.0.0rc0 I am good to run it.

(glm46v) mbelleau@aibeast:/mnt/vault/llm/glm46v$ uv pip freeze | grep -e torch -e vllm -e transformers
torch==2.9.1
torchaudio==2.9.0+cu130
torchvision==0.24.1
transformers==5.0.0rc0
vllm==0.12.0

srinivasbilla

Dec 9, 2025

Thanks, that was indeed the issue. even though i had it in my requirements it didnt install for some reason. After installing transformers v5 i was able to run but had a lot of memory issues.
on my 8xL4 (24gb each) this worked

!vllm serve zai-org/GLM-4.6V-FP8 --served-model-name glm-4.6v --host 0.0.0.0 --port 1234 --max-model-len 32000 --tensor-parallel-size 8 --distributed-executor-backend mp --max-num-seqs 4 --enable-expert-parallel --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --enforce-eager --mm-encoder-tp-mode data --mm-processor-cache-type shm --gpu_memory_utilization 0.8 --kv-cache-dtype fp8_e4m3 --mm-processor-cache-gb 1 --limit-mm-per-prompt '{"image":2, "video":0}'

srinivasbilla changed discussion status to closed Dec 9, 2025

mratsim

Dec 9, 2025

Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.

nmitchko

Dec 14, 2025

•

edited Dec 14, 2025

Because vllm==0.12.0 depends on transformers>=4.56.0,<5 and your project depends on transformers==5.0.0rc0, we can conclude that vllm==0.12.0 and your project are incompatible.
And because your project depends on vllm==0.12.0, we can conclude that your project's requirements are unsatisfiable.

try this installation:

pip install transformers==5.0.0rc1 --upgrade --no-deps

If using vllm docker, build like so

# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system

docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.

Felladrin

Dec 15, 2025

If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.

This was useful! Thank you for sharing!

nmitchko

Dec 17, 2025

•

edited Dec 17, 2025

If using vllm docker, build like so
# ./vllm/dockerfile
FROM vllm/vllm-openai:nightly
RUN uv pip install transformers==5.0.0rc0 --upgrade --no-deps --system
RUN uv pip install huggingface-hub --upgrade --no-deps --system
docker build . -t vllm/vllm-openai:glm46v

then run that docker image according to normal vllm commands.
This was useful! Thank you for sharing!

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :) AWQ or NVFP4 variants

Edit: added that this works in the awq/nvfp4 variants on a single 6000 blackwell system.

mratsim

Dec 17, 2025

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)

How?

RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB
This model is trained up to 131072 context size not 256k

nmitchko

Dec 17, 2025

Glad it works for you. With --kv-cache-dtype fp8 you can easily fit this on a single RTX 6000 PRO with tokens to burns. 256k context works on a single GPU :)

How?

RTX Pro 6000 is 96 GiB while the FP8 version of the model takes 110 GiB

This model is trained up to 131072 context size not 256k

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

And for 256k here you go.

    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'

mratsim

Dec 18, 2025

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92

Quantizing MoEs

To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:

Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include torch.nn.Linear layers in its MoE blocks, a requirement for LLM Compressor to run quantization.

Ensures experts are quantized correctly as not all experts are activated during calibration

And for 256k here you go.

    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'

Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87

nmitchko

Dec 18, 2025

I meant for the NVFP4 or AWQ variants. Will update my original comment.... got two threads mixed up

Ah I see, however I suspect there is a significant quality degradation because from what I've seen all of them are using LLM compressor and LLM compressor does not include a "calibrate all experts" flag for GLM4.x yet which is necessary so that all experts are quantized properly, see: https://github.com/vllm-project/llm-compressor/blob/0.9.0/examples/quantization_w4a4_fp4/README.md?plain=1#L85-L92

Quantizing MoEs

To quantize MoEs, MoE calibration is now handled automatically by the pipeline. An example quantizing Llama4 can be found under llama4_example.py. The pipeline automatically applies the appropriate MoE calibration context which:

Linearizes the model to enable quantization and execution in vLLM. This is required as the native model definition does not include torch.nn.Linear layers in its MoE blocks, a requirement for LLM Compressor to run quantization.

Ensures experts are quantized correctly as not all experts are activated during calibration
And for 256k here you go.
    --max-model-len 262144 \
    --rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":131072 }'
Ah yeah, I try to avoid that because that degrades performance on small contexts and the perf on long context is also meh even without that, see https://contextarena.ai/ and https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87

This setup is my daily driver for web-dev, long context with my project size and the Multi-model capabilites have worked great.

I notice that NVFP4 models are much more performant than AWQ models. The model cyankiwi/GLM-4.6V-AWQ-4bit seems to work fine. (at least until my next GPU arrives :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment