Instructions to use thoughtworks/GLM-4.7-Flash-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-Flash-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/GLM-4.7-Flash-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3

SGLang

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3
```

EAGLE3 Draft Model for GLM-4.7-Flash

An EAGLE3 draft model that accelerates inference for zai-org/GLM-4.7-Flash (30B MoE, ~3B active) through speculative decoding.

1.66x mean speedup at B=1 across 4 benchmarks on a single H200.

Results

Verified 2026-04-12 on 1x NVIDIA H200 144GB, TP=1, FlashInfer, temp=0, max_tokens=512.

B=1 (single request)

Dataset	Baseline tok/s	Eagle3 tok/s	Speedup	Accept Rate	Accept Length
HumanEval (75)	130.2	231.8	1.78x	57.1%	3.42
Terminal-Bench (112)	128.0	220.2	1.72x	62.9%	3.77
MT-Bench (154)	129.2	207.1	1.60x	47.9%	2.88
SWEBench-Verified (75)	127.4	194.4	1.53x	51.7%	3.10
Mean	128.7	213.4	1.66x	54.9%	3.29

B=32 (32 concurrent requests)

Dataset	Baseline tok/s	Eagle3 tok/s	Speedup
SWEBench-Verified	1,415.3	1,830.4	1.29x
HumanEval	1,595.8	1,851.5	1.16x
MT-Bench	1,489.9	1,627.9	1.09x
Terminal-Bench	1,479.4	1,614.0	1.09x
Mean	1,495.1	1,731.0	1.16x

Protocol: B=1: 5 warmup + 20 measured (sequential). B=32: 15 warmup + 60 measured (32 concurrent). Metrics from server-side Prometheus.

Architecture

Parameter	Value
Type	LlamaForCausalLMEagle3
Hidden Size	2048
Heads / KV Heads	16 / 4 (GQA)
Head Dimension	128
Intermediate Size	8192
Layers	1
Vocab Size	154,880 (draft: 32,000)
Size	278 MB

Training

54K samples (45% ShareGPT, 35% UltraChat, 20% PerfectBlend). 3 epochs, LR=1e-4, max_length=1024, batch_size=1. Trained with --target-model-backend sglang via SpecForge.

Best training accuracy (acc_0): 79.2%. Note: training accuracy does not predict inference accept rate — there is a 30-60pp gap.

Usage

Benchmarked with our SGLang fork (tails-mpt/sglang, commit 63291f7f51). Upstream SGLang may produce different speedups due to scheduling overhead differences.

python -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 4 \
  --tp 1 --trust-remote-code --port 30000 \
  --enable-metrics --mem-fraction-static 0.65

Pinned dependencies: sgl-kernel 0.3.18.post2, flashinfer 0.6.6, torch 2.9.1+cu126.

Verify accept rate > 0% after startup to confirm the draft model loaded correctly.

Citation

@article{li2025eagle3,
  title={EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and others},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Downloads last month: 97

Safetensors

Model size

0.1B params

Tensor type

I64

BF16

BOOL

Paper for thoughtworks/GLM-4.7-Flash-Eagle3

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3, 2025 • 9