Instructions to use thoughtworks/GLM-4.7-Flash-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-Flash-Eagle3")# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3") model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thoughtworks/GLM-4.7-Flash-Eagle3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/GLM-4.7-Flash-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3
- SGLang
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/GLM-4.7-Flash-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thoughtworks/GLM-4.7-Flash-Eagle3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Docker Model Runner:
docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3
EAGLE3 Draft Model for GLM-4.7-Flash
An EAGLE3 draft model that accelerates inference for zai-org/GLM-4.7-Flash (30B MoE, ~3B active) through speculative decoding.
1.66x mean speedup at B=1 across 4 benchmarks on a single H200.
Results
Verified 2026-04-12 on 1x NVIDIA H200 144GB, TP=1, FlashInfer, temp=0, max_tokens=512.
B=1 (single request)
| Dataset | Baseline tok/s | Eagle3 tok/s | Speedup | Accept Rate | Accept Length |
|---|---|---|---|---|---|
| HumanEval (75) | 130.2 | 231.8 | 1.78x | 57.1% | 3.42 |
| Terminal-Bench (112) | 128.0 | 220.2 | 1.72x | 62.9% | 3.77 |
| MT-Bench (154) | 129.2 | 207.1 | 1.60x | 47.9% | 2.88 |
| SWEBench-Verified (75) | 127.4 | 194.4 | 1.53x | 51.7% | 3.10 |
| Mean | 128.7 | 213.4 | 1.66x | 54.9% | 3.29 |
B=32 (32 concurrent requests)
| Dataset | Baseline tok/s | Eagle3 tok/s | Speedup |
|---|---|---|---|
| SWEBench-Verified | 1,415.3 | 1,830.4 | 1.29x |
| HumanEval | 1,595.8 | 1,851.5 | 1.16x |
| MT-Bench | 1,489.9 | 1,627.9 | 1.09x |
| Terminal-Bench | 1,479.4 | 1,614.0 | 1.09x |
| Mean | 1,495.1 | 1,731.0 | 1.16x |
Protocol: B=1: 5 warmup + 20 measured (sequential). B=32: 15 warmup + 60 measured (32 concurrent). Metrics from server-side Prometheus.
Architecture
| Parameter | Value |
|---|---|
| Type | LlamaForCausalLMEagle3 |
| Hidden Size | 2048 |
| Heads / KV Heads | 16 / 4 (GQA) |
| Head Dimension | 128 |
| Intermediate Size | 8192 |
| Layers | 1 |
| Vocab Size | 154,880 (draft: 32,000) |
| Size | 278 MB |
Training
54K samples (45% ShareGPT, 35% UltraChat, 20% PerfectBlend). 3 epochs, LR=1e-4, max_length=1024, batch_size=1. Trained with --target-model-backend sglang via SpecForge.
Best training accuracy (acc_0): 79.2%. Note: training accuracy does not predict inference accept rate — there is a 30-60pp gap.
Usage
Benchmarked with our SGLang fork (tails-mpt/sglang, commit 63291f7f51). Upstream SGLang may produce different speedups due to scheduling overhead differences.
python -m sglang.launch_server \
--model-path zai-org/GLM-4.7-Flash \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 6 \
--speculative-eagle-topk 4 \
--tp 1 --trust-remote-code --port 30000 \
--enable-metrics --mem-fraction-static 0.65
Pinned dependencies: sgl-kernel 0.3.18.post2, flashinfer 0.6.6, torch 2.9.1+cu126.
Verify accept rate > 0% after startup to confirm the draft model loaded correctly.
Citation
@article{li2025eagle3,
title={EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and others},
journal={arXiv preprint arXiv:2503.01840},
year={2025}
}
- Downloads last month
- 97