joerowell commited on
Commit
e5916e1
·
1 Parent(s): 7ffaee5

Add DFlash speculative decoding section

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -147,6 +147,24 @@ VLLM_USE_DEEP_GEMM=0 vllm serve \
147
 
148
  See the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2) for additional deployment guidance.
149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  #### Transformers
151
 
152
  Laguna XS.2 is supported in Transformers `v5.7.0` and later ([huggingface/transformers#45673](https://github.com/huggingface/transformers/pull/45673)).
 
147
 
148
  See the [vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2) for additional deployment guidance.
149
 
150
+ #### Speculative decoding (DFlash)
151
+
152
+ For lower latency, serve Laguna XS.2 with the [Laguna-XS.2 DFlash speculator](https://huggingface.co/poolside/Laguna-XS.2-speculator.dflash) — a 5-layer Llama-style draft model that proposes up to 7 tokens per step at ~70% per-position acceptance on coding tasks.
153
+
154
+ > [!NOTE]
155
+ > DFlash support landed in vLLM via [vllm-project/vllm#41880](https://github.com/vllm-project/vllm/pull/41880) and is available in the nightly wheels above. `VLLM_USE_DEEP_GEMM=0` is required: DeepGEMM is currently incompatible with the DFlash draft path.
156
+
157
+ ```shell
158
+ VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
159
+ --trust-remote-code \
160
+ --enable-auto-tool-choice \
161
+ --tool-call-parser poolside_v1 \
162
+ --reasoning-parser poolside_v1 \
163
+ --speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
164
+ ```
165
+
166
+ See the [DFlash section of the vLLM recipes page](https://recipes.vllm.ai/poolside/Laguna-XS.2) for the full recipe.
167
+
168
  #### Transformers
169
 
170
  Laguna XS.2 is supported in Transformers `v5.7.0` and later ([huggingface/transformers#45673](https://github.com/huggingface/transformers/pull/45673)).