Instructions to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct")
model = LlamaForCausalLMEagle3.from_pretrained("JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

SGLang

How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with Docker Model Runner:
```
docker model run hf.co/JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct
```

SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

This is an EAGLE3 draft model for speculative decoding with Qwen/Qwen3-Coder-30B-A3B-Instruct.

Model Description

EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality.

Key Features

Target Model: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active)
Draft Model Size: ~350MB (single transformer layer)
Training Data: OpenPromptContainer (OPC) regenerated dataset
Training Steps: 295,000 (Epoch 1)
Framework: Trained with SpecForge

Training Metrics

Metric	Value
First Token Accuracy (acc_0)	88.19%
Average Accuracy (7 positions)	85.19%
Training Epochs	1+ (295k steps)

Usage

With SGLang

import sglang as sgl

# Launch with EAGLE3 speculative decoding
llm = sgl.Engine(
    model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    speculative_algorithm="EAGLE",
    speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
    speculative_num_steps=5,
    speculative_eagle_topk=8,
    speculative_num_draft_tokens=64,
)

# Generate text
output = llm.generate("Write a Python function to sort a list:")
print(output)

With SGLang Server

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --speculative-algorithm EAGLE \
    --speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 64 \
    --tp 8

Model Architecture

The EAGLE3 draft model is a lightweight transformer that:

Shares embeddings with the target model
Uses a single transformer layer (hidden_size=2048, intermediate_size=12288)
Predicts multiple future tokens autoregressively
Uses the target model's hidden states as input

{
  "architectures": ["LlamaForCausalLMEagle3"],
  "hidden_size": 2048,
  "intermediate_size": 12288,
  "num_attention_heads": 32,
  "num_key_value_heads": 4,
  "num_hidden_layers": 1,
  "vocab_size": 151936
}

Training Details

Framework: SpecForge with SGLang backend
Hardware: 4x NVIDIA H200 GPUs (TP=4)
Batch Size: 1 per GPU
Learning Rate: 1e-4 with cosine annealing
Max Sequence Length: 4096
Attention Backend: FlexAttention

Citation

If you use this model, please cite:

@article{li2024eagle,
  title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2401.15077},
  year={2024}
}

@misc{sglang2024,
  title={SGLang: Efficient Execution of Structured Language Model Programs},
  author={Zheng, Lianmin and others},
  year={2024},
  url={https://github.com/sgl-project/sglang}
}

License

This model is released under the Apache 2.0 License, following the base model's license.

Downloads last month: 12

Safetensors

Model size

0.2B params

Tensor type

I64

BF16

BOOL

Model tree for JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

Base model

Qwen/Qwen3-Coder-30B-A3B-Instruct

Finetuned

(57)

this model

Paper for JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Paper • 2401.15077 • Published Jan 26, 2024 • 20