Instructions to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct") model = LlamaForCausalLMEagle3.from_pretrained("JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct
- SGLang
How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct with Docker Model Runner:
docker model run hf.co/JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct
SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct
This is an EAGLE3 draft model for speculative decoding with Qwen/Qwen3-Coder-30B-A3B-Instruct.
Model Description
EAGLE3 (Efficient Auto-regressive Language model Generation with Learned Embeddings) is a speculative decoding technique that uses a lightweight draft model to predict future tokens, which are then verified by the target model in parallel. This can significantly accelerate inference speed (2-3x) without any loss in output quality.
Key Features
- Target Model: Qwen3-Coder-30B-A3B-Instruct (30B parameters, 3B active)
- Draft Model Size: ~350MB (single transformer layer)
- Training Data: OpenPromptContainer (OPC) regenerated dataset
- Training Steps: 295,000 (Epoch 1)
- Framework: Trained with SpecForge
Training Metrics
| Metric | Value |
|---|---|
| First Token Accuracy (acc_0) | 88.19% |
| Average Accuracy (7 positions) | 85.19% |
| Training Epochs | 1+ (295k steps) |
Usage
With SGLang
import sglang as sgl
# Launch with EAGLE3 speculative decoding
llm = sgl.Engine(
model_path="Qwen/Qwen3-Coder-30B-A3B-Instruct",
speculative_algorithm="EAGLE",
speculative_draft_model_path="sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct",
speculative_num_steps=5,
speculative_eagle_topk=8,
speculative_num_draft_tokens=64,
)
# Generate text
output = llm.generate("Write a Python function to sort a list:")
print(output)
With SGLang Server
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-30B-A3B-Instruct \
--speculative-algorithm EAGLE \
--speculative-draft-model-path sgl-project/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--tp 8
Model Architecture
The EAGLE3 draft model is a lightweight transformer that:
- Shares embeddings with the target model
- Uses a single transformer layer (hidden_size=2048, intermediate_size=12288)
- Predicts multiple future tokens autoregressively
- Uses the target model's hidden states as input
{
"architectures": ["LlamaForCausalLMEagle3"],
"hidden_size": 2048,
"intermediate_size": 12288,
"num_attention_heads": 32,
"num_key_value_heads": 4,
"num_hidden_layers": 1,
"vocab_size": 151936
}
Training Details
- Framework: SpecForge with SGLang backend
- Hardware: 4x NVIDIA H200 GPUs (TP=4)
- Batch Size: 1 per GPU
- Learning Rate: 1e-4 with cosine annealing
- Max Sequence Length: 4096
- Attention Backend: FlexAttention
Citation
If you use this model, please cite:
@article{li2024eagle,
title={EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
journal={arXiv preprint arXiv:2401.15077},
year={2024}
}
@misc{sglang2024,
title={SGLang: Efficient Execution of Structured Language Model Programs},
author={Zheng, Lianmin and others},
year={2024},
url={https://github.com/sgl-project/sglang}
}
License
This model is released under the Apache 2.0 License, following the base model's license.
- Downloads last month
- 12
Model tree for JinnP/SGLang-EAGLE3-Qwen3-Coder-30B-A3B-Instruct
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct