Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7

SGLang

How to use zai-org/GLM-4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7
```

Request Details on GPU ad memory requirements

#16

by DragoZatch - opened Dec 26, 2025

Discussion

DragoZatch

Dec 26, 2025

I would i check on if anyone have tried to run the model on the GPU's and find out how much GPU memory is required for this model. Also wanted to know the max memory requirement for full scale and full context length support .

ZHANGYUXUAN-zR

Z.ai org Dec 27, 2025

You can check the minimum deployment requirements on our GitHub.

krustik

Dec 27, 2025

This comment has been hidden (marked as Off-Topic)

YYYAMS

Dec 30, 2025

I'm testing a distributed cluster to run this full-weights on consumer cards (pooling 4090s) to bypass the VRAM limit. let me know if you want to run a test job.

DragoZatch

Jan 1

•

edited Jan 1

Hi @YYYAMS That could be helpful if you share any example test runs where you were able to load the model on to the cluster you have.

I am looking into realisting example of model load case.It will be a greate help if anyone has run this model could share the details.

krustik

Jan 3

Guys, how my comment here exactly was off-topic if it's directly adviced to use BF16 version equal to original and the size of model files = amount of RAM/VRAM needed. (this strange attention i've got after Mojo language mentioned by Ai as replace to Python?)

I've tested BF16 of 4.7 (uses 756Gb RAM), it's unfortunately failed my code test which 4.5 Q8/BF16 succeeded. But it's different, in literature test of famous Ingo Swann it described future more precise in details, boldly.
Like:
4.7 BF16

I see the structures. The cities have become fluid. Where once there were rigid skyscrapers, I now perceive buildings that breathe—smart glass and biological matrices that adjust to the light. They are wrapped in digital skins, advertising not products, but experiences. The cities are no longer distinct islands; the urban sprawl has melted into the countryside. There is no "wild" left untouched by the human hand. Everywhere, the drone traffic. Like swarms of locusts, but disciplined. Silent. Watching. Delivering.

4.5 BF16

Dimensionals (Man-Made): The old verticals of the 20th-century city are still there, but they are now competing with a new kind of structure: horizontal. Vast, elevated hyperloops span between cities. The skyline is dominated by buildings with seamless, almost featureless skins—covered in photovoltaic film or acting as display surfaces. The old, chaotic beauty of the Bowery is gone, replaced by a landscape that feels designed, optimized, and surveilled. The "structure" gestalt is one of a panopticon, both physical and digital. Everything is connected. Everything is monitored. The "fractured vertical" feeling is the split between the gleaming, promised future and the crumbling reality beneath it.

i like 4.5 more frankly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment