Deep Dive Notes · May 2026

Understanding
DFlash

Complete technical notes — EAGLE-3 architecture, how diffusion accesses a model's internals, DFlash code walkthrough, GPU usage, and copy-paste commands to run it yourself.

Why is LLM inference slow?

Language models generate text by producing one token at a time, where every new token depends on all previous tokens. This is called autoregressive generation. A token is roughly one word or word fragment — "Paris" = 1 token, "unbelievable" = 3 tokens.

The core problem: each token needs a full forward pass through the entire model. For a 27B parameter model, that means 27 billion multiplications, every single token. And you can't skip ahead — token 7 depends on tokens 1–6 already existing.

🧠 What is a forward pass?

A forward pass is when data travels through the entire neural network from input to output. Each transformer layer does matrix multiplications on all the token representations. For a 32-layer model, that's 32 rounds of heavy math, outputting one token at the end. You then feed that token back in and do it again for the next token.

The Sequential Bottleneck
Your prompt — processed in ONE fast parallel pass (like reading the whole sentence at once)
The
capital
of
France
is
Response — one full model forward pass per token, sequential, can't parallelize
pass 1
Paris
pass 2
.
pass 3
It
pass 4
is
pass 5
known
→ ...

GPUs are built for doing millions of operations in parallel. But here, each pass waits for the previous one. You're using maybe 5% of your GPU's capability.


How Draft Models Work

The insight: use a cheap fast model to guess ahead, then let the big model verify all guesses simultaneously in one parallel pass.

Draft-Verify Cycle Example
Draft model guesses 6 tokens:
Paris
is
a
beautiful
historic
city
in
Europe
Big model verifies all 6 in ONE pass — accepts/rejects:
Paris
is
a
beautiful
historic
city
in
Europe
Result: 3 accepted + big model's correction "capital":
Paris
is
a
beautiful
historic
capital

4 new tokens for ~cost of 1 big model pass. Net speedup.


EAGLE-3: Architecture Deep Dive

EAGLE-3 is the previous state-of-the-art for speculative decoding. Instead of a completely separate small model, it attaches a tiny "draft head" directly to the big model and feeds it the big model's internal representations. Let's understand every component.

What is a Hidden State?

This is the most important concept in this whole document. You must understand it to understand EAGLE-3 and DFlash.

A transformer model has many layers stacked on top of each other — think of them as processing stages. After each layer, every token position has a hidden state: a large vector of numbers. For an 8B model this is 4096 numbers. For a 27B model it might be 7168 numbers.

🧠 Hidden States — the model's "thoughts" at each layer

A hidden state is NOT just the token identity. It's a rich numerical encoding of meaning in context. The hidden state of the word "bank" in "I deposited money at the bank" looks completely different from "bank" in "I sat by the river bank" — even though it's the same word. The model has already resolved the ambiguity. Critically: later hidden states implicitly encode information about what token logically comes next. This is what draft models exploit.

Hidden States Inside a Transformer — What They Are
Input
Token IDs
[seq_len]
↓ embedding
Layer 0 (embedding)
h₀
[seq_len, 4096]
Layer 8 (early)
h₈ — syntax
"is this a noun?"
Layer 16 (middle)
h₁₆ — semantics
"what does it mean?"
Layer 32 (final)
h₃₂ — prediction
"what comes next?"
↓ LM Head
Output
Token logits
[vocab_size ≈ 100k]

Each hidden state hₙ is a vector of floats, one per layer, one per token position. A sequence of 512 tokens in a 32-layer 8B model generates 32 × 512 hidden state vectors — each 4096 floats wide.

Early layers learn local patterns — grammar, part-of-speech. Middle layers learn semantics — what words mean in context. Late layers are "prediction-ready" — they encode what the next token should be.

EAGLE-3 extracts 3 specific hidden states (early, mid, late) and fuses them. DFlash extracts hidden states from many layers uniformly and injects them even more deeply into the drafter.

EAGLE-3 Three Components

EAGLE-3 is built from three chained pieces. Here is the full architecture:

EAGLE-3 Complete Architecture
Target Model (e.g. Llama-3.1-8B)
32 Transformer Layers
full forward pass
↓ extract 3 layers
h_low + h_mid + h_high
3 hidden states
3 × 4096 = 12,288 dims
Component 1: Encoder (FC Layer)
Feature Fusion
12,288 → 4096
Concatenates + compresses
3 hidden states into one
4096-dim fused vector
Component 2: Decoder
1 Transformer Layer
autoregressive loop
↓ K times (1 per draft token)
Component 3: Output
Draft Token Tree
5–7 tokens, branching

Component 1: The Feature Fusion (FC Layer)

EAGLE-1 only used the final layer's hidden state. EAGLE-3 improves this with tri-layer fusion. It samples hidden states from three points in the target model:

📐 Concrete example — Llama-3.1-8B (hidden dim = 4096, 32 layers)

h_low (layer 8): encodes syntax and local token context → 4096 floats
h_mid (layer 16): encodes semantic meaning in context → 4096 floats
h_high (layer 32, final): encodes next-token prediction signals → 4096 floats

Concatenate: [h_low || h_mid || h_high] = 12,288 floats
FC layer compresses: 12,288 → 4096 floats
The FC layer is learned — it learns which signals from which layers matter most for predicting future tokens.

Why three layers? Because different things matter. Early layers know grammar. Middle layers know meaning. Late layers know what's likely next. Together they give the drafter a holistic picture of what the big model is "thinking."

Component 2: The Draft Decoder (1 transformer layer)

The decoder is a single standard transformer decoder layer — it's tiny, about 277 MB. It receives two things:

It autoregressively produces one draft token, then feeds that back in to produce the next. For 7 draft tokens, this loop runs 7 times.

The Draft Tree (not just a sequence)

EAGLE-3 doesn't output just a single chain of tokens. At each step it considers the top-N most probable tokens and creates branching paths. The target model verifies the entire tree in one pass and accepts the longest matching path.

EAGLE-3 Draft Tree (simplified) — Multiple Paths Explored
[After "Paris is a..."]
┌────────────────┴────────────────┐
[beautiful...]              [historic...]
┌──────┴──────┐         ┌──────┴──────┐
[city]  [capital]     [city]  [capital]

The big model verifies all 4 leaf paths simultaneously. If it agrees with "historic capital" the model accepts 2 tokens. This tree structure gives EAGLE-3 better acceptance rates than a simple linear draft.

EAGLE-3's Hard Ceiling

Despite being clever, EAGLE-3 has a fundamental limitation baked into its design: the decoder generates one token at a time. Seven draft tokens = seven sequential decoder runs. The latency stacks up.

To compensate, EAGLE-3 keeps its decoder at exactly one transformer layer. Adding a second layer would double the per-step compute, making the sequential cost too high. So EAGLE-3 is permanently stuck with a very shallow, limited draft model.

⚠️ EAGLE-3's Architectural Bottleneck

Drafting is still sequential (N steps for N tokens) → can't parallelize
Decoder forced to 1 layer → low model capacity → lower acceptance rate
Maximum practical speedup: ~3–4× (hard ceiling, hard to push beyond)

python · eagle-3 draft loop (showing the bottleneck)
# EAGLE-3: generates K draft tokens in K sequential steps
# Each step depends on the previous — cannot be parallelized
draft_tokens = []
current_input = fused_features  # 4096-dim vector from target model

for step in range(K):   # K = 7 → 7 sequential decoder runs
    logits = eagle_decoder_layer(current_input)  # 1 transformer layer
    next_token = sample(logits)
    draft_tokens.append(next_token)
    current_input = embed(next_token)  # feed back for next step
    # ↑ each iteration WAITS for the previous one — pure sequential

# Result: 7 tokens from 7 sequential passes through 1 decoder layer
# DFlash will replace this entire loop with ONE parallel pass
vllm · run eagle-3 (for comparison)
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
    "num_speculative_tokens": 3,
    "method": "eagle3",
    "draft_tensor_parallel_size": 1}'
# Draft head is tiny (~277 MB), co-deployed on same GPU as target
# num_speculative_tokens: start at 3-5 for most tasks

Diffusion Models: From Images to Text

Part 1 — Image Diffusion (the original idea)

Diffusion models were invented for images first. Here's how they work conceptually:

Imagine you take a clear photo of a cat and add random noise to it — like TV static. Each step you add more noise until the image is completely unrecognizable. Now train a neural network to do the reverse: given a noisy image, predict how to make it slightly cleaner. This network learns the "denoising direction."

At inference time (generating a new image): start with pure noise, apply the denoising network 20–50 times, and a clear image emerges. The crucial property: the network processes all pixels simultaneously at each step — not one pixel at a time. This is massively parallel.

Part 2 — Text Diffusion (dLLMs)

For text, you can't add Gaussian noise to pixels — you have discrete tokens. Instead, text diffusion uses masking: randomly mask tokens (replace with [?]) and train the network to predict what should fill each masked position.

The key property carries over: the model fills ALL masked positions at once in one forward pass. This is called masked language modeling at generation time.

Text Diffusion — Iterative Denoising (all positions updated in parallel each step)
Step 0: all tokens masked
[?]
[?]
[?]
[?]
[?]
[?]
Step 1: model fills highest-confidence positions (all positions evaluated simultaneously)
Paris
[?]
a
[?]
city
[?]
Step 2: fills remaining — final output
Paris
is
a
beautiful
city
.
⚠️ Why pure text diffusion models aren't used as main models

Autoregressive models (GPT, Llama, Qwen) condition each token on all previous tokens in strict order. This produces very precise, coherent token probability distributions. Text diffusion models generate tokens in parallel but sacrifice that strict sequential conditioning — their outputs are less precise, and they can't exactly match an autoregressive model's distribution. Fast, but lower quality. DFlash sidesteps this by only using diffusion for drafting, not final output.

Part 3 — How the Drafter Accesses LLM Internals

This is the most technical part. How exactly does a diffusion draft model "look inside" the big model?

There are two mechanisms: feature extraction (pulling hidden states out) and KV injection (pushing them into every draft layer). Together they make the drafter's predictions much better than if it ran alone.

Mechanism 1 — Feature Extraction

During every big model forward pass, you can request the hidden states from every layer. This is just a flag you pass to the model. Once you have them, you sample uniformly across all layers (not just 3 like EAGLE-3), project them with a small linear layer, and fuse them into one conditioning vector:

python · extracting hidden states from a HuggingFace model
import torch

# Forward pass with output_hidden_states=True
# This returns the hidden state tensor at every layer — no extra compute!
outputs = big_model(
    input_ids=token_ids,                # [batch, seq_len]
    output_hidden_states=True          # the magic flag
)

# outputs.hidden_states is a tuple of tensors, one per layer:
# hidden_states[0]  → shape [batch, seq_len, 4096]  (embedding layer)
# hidden_states[8]  → shape [batch, seq_len, 4096]  (layer 8 = early)
# hidden_states[16] → shape [batch, seq_len, 4096]  (layer 16 = mid)
# hidden_states[32] → shape [batch, seq_len, 4096]  (layer 32 = final)

num_layers = len(outputs.hidden_states)            # e.g. 33 (0 to 32)

# Sample uniformly across layers (DFlash samples more than EAGLE-3)
step = num_layers // 6
sampled = [outputs.hidden_states[i][0, -1, :]   # last token position
           for i in range(0, num_layers, step)]   # sample every 5th layer
# sampled: list of 6 tensors, each [4096]

stacked = torch.stack(sampled, dim=0)             # [6, 4096]
fused = feature_projector(stacked)                # learned FC → [4096]
# fused is the distilled "essence" of what the big model knows about context

Mechanism 2 — KV Cache Injection

In a standard transformer layer, attention is computed like this: Query vectors from the current token ask "what am I looking for?" and Key/Value vectors from all other tokens answer "here's what I have, and here's what I carry." The attention scores determine how much each token's Value contributes to the output.

DFlash injects the big model's fused features directly into the Key and Value projections of every draft layer. This means the drafter "attends to" the big model's knowledge at every layer of its computation — not just at the input as in EAGLE-3.

python · standard attention vs. DFlash KV injection
########################################
# Standard transformer attention
########################################
d = 64  # head dimension
Q = linear_q(hidden)       # [seq, d] — "what am I looking for?"
K = linear_k(hidden)       # [seq, d] — "what do I have as keys?"
V = linear_v(hidden)       # [seq, d] — "what info do I carry?"

scores = softmax(Q @ K.T / d**0.5)  # [seq, seq] attention weights
output = scores @ V                   # [seq, d] attended output

########################################
# DFlash attention with KV injection
########################################
# The big model's fused features are projected to K and V space
K_from_big = linear_k_inject(fused_features)   # [1, d]
V_from_big = linear_v_inject(fused_features)   # [1, d]

# Concatenate draft model's own KV with injected KV from big model
K_combined = torch.cat([linear_k(hidden), K_from_big], dim=0)  # [seq+1, d]
V_combined = torch.cat([linear_v(hidden), V_from_big], dim=0)  # [seq+1, d]

# Now every draft token attends to BOTH draft context AND big model knowledge
scores = softmax(Q @ K_combined.T / d**0.5)  # [seq, seq+1]
output = scores @ V_combined                    # [seq, d]
# This is done at EVERY draft layer — not just the first

Why non-causal attention enables parallel generation

Normally, attention is causal: token 5 can only see tokens 1–5, not 6 onward. This enforces left-to-right ordering, which is essential for autoregressive generation.

DFlash's draft model uses non-causal (bidirectional) attention among draft positions. All 16 [?] mask tokens see each other. This is what lets all 16 draft tokens be generated simultaneously — they inform each other's predictions during the single forward pass.

python · causal mask vs. bidirectional mask
# CAUSAL mask — standard autoregressive (EAGLE-3's decoder uses this)
# 1 = can attend, 0 = blocked
causal = torch.tensor([
    [1, 0, 0, 0],  # pos 0: sees only itself
    [1, 1, 0, 0],  # pos 1: sees 0,1
    [1, 1, 1, 0],  # pos 2: sees 0,1,2
    [1, 1, 1, 1],  # pos 3: sees all
])
# Each position must be generated one at a time — SEQUENTIAL

# BIDIRECTIONAL (non-causal) mask — DFlash draft model uses this
bidirectional = torch.tensor([
    [1, 1, 1, 1],  # draft[?] 0 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 1 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 2 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 3 sees all draft positions
])
# All positions computed at once — ONE FORWARD PASS for 16 tokens

DFlash: Full Technical Architecture

DFlash combines three insights: (1) use diffusion for parallel draft generation, (2) condition the diffusion drafter deeply on the big model's multi-layer hidden states, (3) inject those features into every draft layer's KV cache so the drafter is continuously informed at every computation step.

The Complete Pipeline

DFlash — One Full Inference Cycle
Step 1
Big model forward pass
normal generation step, output_hidden_states=True
Step 2
Extract hidden states from N layers uniformly
h₁, h₅, h₁₀, h₁₅, h₂₀, h₂₇
Step 3
Feature fusion projection
N×d_model → d_model
Step 4
Draft model: inject fused features into all K/V layers, generate 16 [?] tokens simultaneously (non-causal attention, one forward pass)
Step 5
Big model verifies 16 drafts in one pass
accept / reject per position
Step 6
Repeat

Full Code Walkthrough (Annotated Pseudocode)

python · complete DFlash inference cycle with explanations
##########################################################
# STEP 1: Run big model forward pass
# Identical to normal generation — just add output_hidden_states=True
##########################################################
big_outputs = big_model(
    input_ids=current_tokens,           # [1, seq_len]
    use_cache=True,                   # reuse KV cache from previous tokens
    output_hidden_states=True          # ← DFlash needs these
)
# big_outputs.hidden_states: tuple of (num_layers+1) tensors
# each tensor shape: [1, seq_len, hidden_size]
last_token_logits = big_outputs.logits[0, -1]   # [vocab_size]

##########################################################
# STEP 2: Extract hidden states from multiple layers
##########################################################
all_hidden = big_outputs.hidden_states            # tuple of num_layers tensors
num_layers = len(all_hidden)                     # e.g. 33 for 32-layer model

# Sample uniformly across all layers (more coverage than EAGLE-3's 3 layers)
extract_layers = list(range(1, num_layers, num_layers // 6))  # e.g. [1,6,11,17,22,27]
sampled = torch.stack(
    [all_hidden[i][0, -1, :] for i in extract_layers], dim=0
)   # [num_extracted, hidden_size]  e.g. [6, 4096]

##########################################################
# STEP 3: Fuse into one conditioning vector
##########################################################
# feature_projector is a small learned linear layer: [num_extracted, d] → [d]
# It learns which layer's signals matter most for predicting future tokens
fused_ctx = feature_projector(sampled.flatten())   # [hidden_size]

##########################################################
# STEP 4: Draft model generates K tokens in ONE parallel pass
##########################################################
K = 16   # number of tokens to draft simultaneously

# Initialize K mask token embeddings — these are what get "denoised"
# Think of them as K [MASK] tokens, all unknown
mask_embeds = mask_embedding.expand(K, -1)        # [K, hidden_size]

# The draft model's forward pass:
# - Input: K mask token embeddings
# - KV injection: fused_ctx goes into Key and Value of EVERY draft layer
# - Attention mask: non-causal (all K positions attend to all K positions)
# - Output: logits for all K positions simultaneously
draft_logits = draft_model.forward(
    input_embeds=mask_embeds,           # [K, hidden_size]
    kv_injection_context=fused_ctx,     # injected into every layer's KV
    attention_mask="bidirectional"     # non-causal: all see all
)                                       # output: [K, vocab_size]

# Sample draft tokens from logits — ALL K at once, no loop needed
draft_tokens = torch.argmax(draft_logits, dim=-1)   # [K]
# ↑ This is the entire architectural difference from EAGLE-3.
# EAGLE-3 needs a for-loop with K iterations. DFlash produces all K in one call.

##########################################################
# STEP 5: Big model verifies all K drafts in one pass
##########################################################
# Concatenate existing context with K draft tokens
verify_ids = torch.cat([current_tokens, draft_tokens.unsqueeze(0)], dim=1)
verify_out = big_model(verify_ids)                  # one forward pass
verify_logits = verify_out.logits[0]                # [seq_len+K, vocab_size]

##########################################################
# STEP 6: Accept/reject loop
##########################################################
accepted = []
seq_len = current_tokens.shape[1]

for i in range(K):
    # What would the big model have produced at position seq_len+i?
    big_model_token = torch.argmax(verify_logits[seq_len + i - 1])
    if big_model_token == draft_tokens[i]:
        accepted.append(draft_tokens[i])
    else:
        accepted.append(big_model_token)  # use big model's correction
        break                              # stop here — can't trust later drafts

# Typical result: 8–12 tokens accepted
# Net throughput: 8–12 tokens for cost of ~2 big model passes
# vs baseline: 8–12 tokens for cost of 8–12 big model passes
# Speedup: 4–6×
current_tokens = torch.cat([current_tokens, torch.tensor(accepted).unsqueeze(0)], dim=1)
✅ The key difference from EAGLE-3 in one line

EAGLE-3: for step in range(K): draft_token = decoder(prev_token) — a loop with K sequential iterations.

DFlash: draft_tokens = draft_model(mask_embeds) — one call produces all K tokens. No loop. That's it.


Using the Same Model as Its Own Drafter

The DFlash draft model for Qwen3.6-35B-A3B is called z-lab/Qwen3.6-35B-A3B-DFlash. It uses the same Qwen3-style architecture as the target — same hidden dimensions, same tokenizer, same vocabulary. But it is not a full copy. Here is exactly what it is.

What is shared vs. uniquely trained

Target Model vs DFlash Drafter — Component Breakdown
TARGET: Qwen3.6-35B-A3B
Embedding Layer
~1.5 GB · vocabulary → vectors
62 Transformer Layers
~62 GB · the "brain"
LM Head (unembedding)
~1.5 GB · vectors → token probs
Total: ~35B params / ~70 GB
vs
DRAFTER: Qwen3.6-35B-A3B-DFlash
Embedding Layer
↑ SHARED — pointer, 0 extra RAM
Feature Projector
~0.2 GB · newly trained
3–5 Draft Transformer Layers
~2.5 GB · newly trained, same architecture
LM Head (unembedding)
↑ SHARED — pointer, 0 extra RAM
Total extra: ~4 GB
💡 Why sharing embeddings and LM Head saves significant memory

The embedding layer (token ID → 4096-dim vector) and LM Head (4096-dim vector → 100k-dim vocab probabilities) are each ~1.5 GB for a large model. When the draft model reuses them as pointers to the same memory, not copies, you save ~3 GB and also guarantee the draft model speaks the same "language" as the target — same token representations, same vocabulary logits.

Why the same architecture type is used

The draft model uses Qwen3-style transformer layers because:

GPU Memory Breakdown

Let's look at what actually occupies GPU memory when running DFlash with Qwen3.5-27B on a single A100 80GB.

❌ Without DFlash
Target model weights (BF16)~56 GB
KV cache (8k context)~8 GB
Draft model0 GB
Activations / overhead~4 GB
Throughput: ~20 tok/s (single user)
✅ With DFlash
Target model weights (BF16)~56 GB
KV cache (target + draft)~11 GB
Draft model extra params~4 GB
Activations / overhead~4 GB
Throughput: ~80–100 tok/s (single user)

Compute cost breakdown per cycle

Time budget — generating 12 tokens (DFlash vs baseline)
BASELINE — 12 tokens, no DFlash
12 × big model passes12.0 units
Total: 12.0 units of compute
DFLASH — 16 drafted, 12 accepted
1× big model (step N)1.0 unit
Feature extract + fuse0.05 unit
Draft model (16 tokens, 1 pass)0.2 units
1× big model verify (16 tokens)1.1 units
Total: ~2.35 units → 12 tokens → 5.1× faster

What happens to the KV cache when drafts are rejected

When the big model rejects a draft token, the KV cache entries for that position are simply discarded — they're never committed to the main context. The KV cache only grows when tokens are accepted. This is what makes speculative decoding "lossless": the final KV state is bit-for-bit identical to if you'd generated those tokens normally.

The Qwen3.6 DeltaNet (GDN) Complication

Qwen3.6 mixes Gated Delta Networks (a linear recurrent attention mechanism) with standard full attention. GDN layers maintain a running recurrent state — like a hidden memory that updates as each token is processed. This breaks standard speculative decoding.

python · the GDN rollback problem and solution
# Standard transformer KV cache — rolling back is trivial:
kv_cache[position] = (K_vector, V_vector)   # stored separately per position
# If rejected: just don't include that position in future attention. Easy.

########################################
# GDN / DeltaNet — stateful, hard to roll back:
########################################
# GDN maintains a recurrent state S that updates multiplicatively:
S_new = S_old * gate + delta_value   # S encodes compressed history
# After processing a draft token: S_old is gone, replaced by S_new
# If we reject that token, we can't undo S_new → S_old automatically
# (matrix multiplication is not trivially reversible)

########################################
# SGLang's solution: extra_buffer strategy
########################################
# Before starting the speculative draft phase:
S_checkpoint = copy(S)                # save current recurrent state

# Run draft model, run verification...
num_accepted = verify_and_count(draft_tokens)

# If any tokens rejected:
if num_accepted < K:
    S = S_checkpoint                    # restore clean state
    S = advance_state(S, accepted_tokens)  # replay only accepted tokens
# Result: S is exactly what it would have been without speculation

# This is why you need --mamba-scheduler-strategy extra_buffer in SGLang
# for Qwen3.6 models with DFlash

Benchmarks

Autoregressive baseline
EAGLE-3 (state-of-art before DFlash)3–4×
DFlash on code / math / structured textup to 6.2×
DFlash with thinking/reasoning mode≈4.5×
EAGLE-3
DFlash
Draft method
Sequential (K steps)
Parallel (1 pass)
Draft model depth
1 layer (forced)
3–5 layers
Tokens drafted per cycle
5–7
15–16
Max speedup over baseline
3–4×
6.2×
vs EAGLE-3
2.5× faster
Extra VRAM needed
~280 MB
~4 GB
Output quality
Lossless
Lossless

Supported Models

All DFlash draft models are published by z-lab at https://huggingface.co/z-lab

Target ModelDraft ModelvLLMSGLangMLX
Qwen/Qwen3-8Bz-lab/Qwen3-8B-DFlash-b16
Qwen/Qwen3.5-27Bz-lab/Qwen3.5-27B-DFlash
Qwen/Qwen3.5-35B-A3Bz-lab/Qwen3.5-35B-A3B-DFlashExp
Qwen/Qwen3.6-35B-A3Bz-lab/Qwen3.6-35B-A3B-DFlashPatchedPR branch
Qwen/Qwen3-Coder-30B-A3Bz-lab/Qwen3-Coder-30B-A3B-DFlash
meta-llama/Llama-3.1 familyz-lab/Llama-3.1-*-DFlash
Kimi-K2.5 (coming soon)Preview on z-lab HFSoonSoon

How to Run DFlash

Option A — vLLM stable (Qwen3.5-27B, easiest)

shell · install + launch
# Install
pip install -U vllm --torch-backend=auto

# Launch OpenAI-compatible server on localhost:8000
vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash",
    "model": "z-lab/Qwen3.5-27B-DFlash",
    "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768 \
  --speculative-disable-by-batch-size 32   # auto-disable at high concurrency

Option B — SGLang (Qwen3.5-35B-A3B, best for agents)

shell · sglang launch
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \  # GDN state rollback
  --trust-remote-code

Option C — Qwen3.6 (patched vLLM)

shell · qwen3.6 + dflash
# Install patched build until main vLLM release includes it
pip install vllm
pip install -U --torch-backend=auto \
  "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

vllm serve Qwen/Qwen3.6-35B-A3B \
  --speculative-config '{"method": "dflash",
    "model": "z-lab/Qwen3.6-35B-A3B-DFlash",
    "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

Option D — HuggingFace Transformers (quick experiments only)

python · transformers backend
from transformers import AutoModel, AutoModelForCausalLM

# Load draft model (shares embedding + LM head with target automatically)
draft = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16",
    trust_remote_code=True, dtype="auto", device_map="cuda:0"
).eval()

# Load target model
target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0"
).eval()

Calling the server (after vLLM or SGLang launch)

python · openai-compatible client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Completely normal OpenAI API call — DFlash is transparent
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Write a binary search in Python"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
# Output is 100% identical to non-DFlash. Just arrives faster.

When to Use DFlash

✅ DFlash
⚡ EAGLE-3
— Standard
Concurrent users
1–32
1–50
Any
Output length
Long (code, essays)
Medium-long
Any
Content type
Code, math, structured
Most types
Any
Extra VRAM
~4 GB more
~280 MB more
0
💡 Decision rule from the community

if batch_size > 32 or output_tokens < 50 → standard decoding
elif DFlash checkpoint exists and acceptance_rate > 0.7 → DFlash
else → EAGLE-3

Everything in One Place

LLMs are slow because generation is one token per full model pass — sequential, can't skip ahead.

Draft models solve this by guessing ahead and verifying all guesses in one parallel big-model pass.

EAGLE-3 attaches a tiny 1-layer decoder to the big model, fusing hidden states from 3 layers (early/mid/late) via a learned FC layer. It drafts autoregressively — one token per step, forced to stay at 1 layer. Ceiling: ~4×.

Diffusion means generating all token positions simultaneously (one forward pass) instead of sequentially. Fast, but lower quality when used alone as the main model.

DFlash uses a 3–5 layer diffusion draft model conditioned on the big model's hidden states from many layers. Features are injected into every draft layer's Key-Value cache. Non-causal (bidirectional) attention lets all 16 draft tokens be generated simultaneously in one pass. Big model still verifies everything — output quality is identical. Speedup: up to 6.2×, 2.5× faster than EAGLE-3.

The draft model shares embedding and LM Head weights with the target (zero extra RAM for those). Its 3–5 new transformer layers use the same architecture as the target so hidden dimensions match and initialization is clean. Total extra VRAM: ~4 GB.

Run it with vLLM (stable, Qwen3.5 and older) or SGLang (recommended for MoE and agents). For Qwen3.6 you need a patched vLLM for now. All draft models are at https://huggingface.co/z-lab.