Deep Dive Notes · May 2026

Understanding
DFlash

Complete technical notes — EAGLE-3 architecture, how diffusion accesses a model's internals, DFlash code walkthrough, GPU usage, and copy-paste commands to run it yourself.

Foundation · 01

Why is LLM inference slow?

Language models generate text by producing one token at a time, where every new token depends on all previous tokens. This is called autoregressive generation. A token is roughly one word or word fragment — "Paris" = 1 token, "unbelievable" = 3 tokens.

The core problem: each token needs a full forward pass through the entire model. For a 27B parameter model, that means 27 billion multiplications, every single token. And you can't skip ahead — token 7 depends on tokens 1–6 already existing.

🧠 What is a forward pass?

A forward pass is when data travels through the entire neural network from input to output. Each transformer layer does matrix multiplications on all the token representations. For a 32-layer model, that's 32 rounds of heavy math, outputting one token at the end. You then feed that token back in and do it again for the next token.

The Sequential Bottleneck

Your prompt — processed in ONE fast parallel pass (like reading the whole sentence at once)

The

capital

France

Response — one full model forward pass per token, sequential, can't parallelize

pass 1

Paris

→

pass 2

→

pass 3

→

pass 4

→

pass 5

known

→ ...

GPUs are built for doing millions of operations in parallel. But here, each pass waits for the previous one. You're using maybe 5% of your GPU's capability.

Foundation · 02

How Draft Models Work

The insight: use a cheap fast model to guess ahead, then let the big model verify all guesses simultaneously in one parallel pass.

1
Draft model guesses K tokens fastA tiny model (much smaller than the main model) predicts the next 5–16 tokens. It's cheaper because it has far fewer layers. It's wrong sometimes, but wrong drafts get caught.
2
Big model verifies all K drafts in one passThe transformer can read all K draft tokens simultaneously — just like it reads your prompt in one pass. This single verification step is the cost of ~1 normal token generation.
3
Accept correct tokens, fix first wrong oneAny draft token the big model agrees with is accepted. At the first disagreement, the big model's correction is used and drafting restarts.
4
Output is mathematically losslessYou only accept what the big model would have produced anyway. The output distribution is identical to running the big model alone. Speed, no quality cost.

Draft-Verify Cycle Example

Draft model guesses 6 tokens:

Paris

beautiful

historic

city

Europe

Big model verifies all 6 in ONE pass — accepts/rejects:

Paris

beautiful

historic

city

Europe

Result: 3 accepted + big model's correction "capital":

Paris

beautiful

historic

capital

4 new tokens for ~cost of 1 big model pass. Net speedup.

EAGLE-3 · 03

EAGLE-3: Architecture Deep Dive

EAGLE-3 is the previous state-of-the-art for speculative decoding. Instead of a completely separate small model, it attaches a tiny "draft head" directly to the big model and feeds it the big model's internal representations. Let's understand every component.

What is a Hidden State?

This is the most important concept in this whole document. You must understand it to understand EAGLE-3 and DFlash.

A transformer model has many layers stacked on top of each other — think of them as processing stages. After each layer, every token position has a hidden state: a large vector of numbers. For an 8B model this is 4096 numbers. For a 27B model it might be 7168 numbers.

🧠 Hidden States — the model's "thoughts" at each layer

A hidden state is NOT just the token identity. It's a rich numerical encoding of meaning in context. The hidden state of the word "bank" in "I deposited money at the bank" looks completely different from "bank" in "I sat by the river bank" — even though it's the same word. The model has already resolved the ambiguity. Critically: later hidden states implicitly encode information about what token logically comes next. This is what draft models exploit.

Hidden States Inside a Transformer — What They Are

Input

Token IDs

[seq_len]

↓ embedding

Layer 0 (embedding)

h₀

[seq_len, 4096]

↓

Layer 8 (early)

h₈ — syntax

"is this a noun?"

↓

Layer 16 (middle)

h₁₆ — semantics

"what does it mean?"

↓

Layer 32 (final)

h₃₂ — prediction

"what comes next?"

↓ LM Head

Output

Token logits

[vocab_size ≈ 100k]

Each hidden state hₙ is a vector of floats, one per layer, one per token position. A sequence of 512 tokens in a 32-layer 8B model generates 32 × 512 hidden state vectors — each 4096 floats wide.

Early layers learn local patterns — grammar, part-of-speech. Middle layers learn semantics — what words mean in context. Late layers are "prediction-ready" — they encode what the next token should be.

EAGLE-3 extracts 3 specific hidden states (early, mid, late) and fuses them. DFlash extracts hidden states from many layers uniformly and injects them even more deeply into the drafter.

EAGLE-3 Three Components

EAGLE-3 is built from three chained pieces. Here is the full architecture:

EAGLE-3 Complete Architecture

Target Model (e.g. Llama-3.1-8B)

32 Transformer Layers

full forward pass

↓ extract 3 layers

h_low + h_mid + h_high

3 hidden states

3 × 4096 = 12,288 dims

→

Component 1: Encoder (FC Layer)

Feature Fusion

12,288 → 4096

Concatenates + compresses
3 hidden states into one
4096-dim fused vector

→

Component 2: Decoder

1 Transformer Layer

autoregressive loop

↓ K times (1 per draft token)

Component 3: Output

Draft Token Tree

5–7 tokens, branching

Component 1: The Feature Fusion (FC Layer)

EAGLE-1 only used the final layer's hidden state. EAGLE-3 improves this with tri-layer fusion. It samples hidden states from three points in the target model:

📐 Concrete example — Llama-3.1-8B (hidden dim = 4096, 32 layers)

h_low (layer 8): encodes syntax and local token context → 4096 floats
h_mid (layer 16): encodes semantic meaning in context → 4096 floats
h_high (layer 32, final): encodes next-token prediction signals → 4096 floats

Concatenate: [h_low || h_mid || h_high] = 12,288 floats
FC layer compresses: 12,288 → 4096 floats
The FC layer is learned — it learns which signals from which layers matter most for predicting future tokens.

Why three layers? Because different things matter. Early layers know grammar. Middle layers know meaning. Late layers know what's likely next. Together they give the drafter a holistic picture of what the big model is "thinking."

Component 2: The Draft Decoder (1 transformer layer)

The decoder is a single standard transformer decoder layer — it's tiny, about 277 MB. It receives two things:

The 4096-dim fused feature from Component 1
The embedding of the last generated draft token (as input to the next step)

It autoregressively produces one draft token, then feeds that back in to produce the next. For 7 draft tokens, this loop runs 7 times.

The Draft Tree (not just a sequence)

EAGLE-3 doesn't output just a single chain of tokens. At each step it considers the top-N most probable tokens and creates branching paths. The target model verifies the entire tree in one pass and accepts the longest matching path.

EAGLE-3 Draft Tree (simplified) — Multiple Paths Explored

[After "Paris is a..."]
┌────────────────┴────────────────┐
[beautiful...]              [historic...]
┌──────┴──────┐         ┌──────┴──────┐
[city]  [capital]     [city]  [capital]

The big model verifies all 4 leaf paths simultaneously. If it agrees with "historic capital" the model accepts 2 tokens. This tree structure gives EAGLE-3 better acceptance rates than a simple linear draft.

EAGLE-3's Hard Ceiling

Despite being clever, EAGLE-3 has a fundamental limitation baked into its design: the decoder generates one token at a time. Seven draft tokens = seven sequential decoder runs. The latency stacks up.

To compensate, EAGLE-3 keeps its decoder at exactly one transformer layer. Adding a second layer would double the per-step compute, making the sequential cost too high. So EAGLE-3 is permanently stuck with a very shallow, limited draft model.

⚠️ EAGLE-3's Architectural Bottleneck

Drafting is still sequential (N steps for N tokens) → can't parallelize
Decoder forced to 1 layer → low model capacity → lower acceptance rate
Maximum practical speedup: ~3–4× (hard ceiling, hard to push beyond)

python · eagle-3 draft loop (showing the bottleneck)

# EAGLE-3: generates K draft tokens in K sequential steps
# Each step depends on the previous — cannot be parallelized
draft_tokens = []
current_input = fused_features  # 4096-dim vector from target model

for step in range(K):   # K = 7 → 7 sequential decoder runs
    logits = eagle_decoder_layer(current_input)  # 1 transformer layer
    next_token = sample(logits)
    draft_tokens.append(next_token)
    current_input = embed(next_token)  # feed back for next step
    # ↑ each iteration WAITS for the previous one — pure sequential

# Result: 7 tokens from 7 sequential passes through 1 decoder layer
# DFlash will replace this entire loop with ONE parallel pass

vllm · run eagle-3 (for comparison)

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
    "num_speculative_tokens": 3,
    "method": "eagle3",
    "draft_tensor_parallel_size": 1}'
# Draft head is tiny (~277 MB), co-deployed on same GPU as target
# num_speculative_tokens: start at 3-5 for most tasks

Diffusion · 04

Diffusion Models: From Images to Text

Part 1 — Image Diffusion (the original idea)

Diffusion models were invented for images first. Here's how they work conceptually:

Imagine you take a clear photo of a cat and add random noise to it — like TV static. Each step you add more noise until the image is completely unrecognizable. Now train a neural network to do the reverse: given a noisy image, predict how to make it slightly cleaner. This network learns the "denoising direction."

At inference time (generating a new image): start with pure noise, apply the denoising network 20–50 times, and a clear image emerges. The crucial property: the network processes all pixels simultaneously at each step — not one pixel at a time. This is massively parallel.

Part 2 — Text Diffusion (dLLMs)

For text, you can't add Gaussian noise to pixels — you have discrete tokens. Instead, text diffusion uses masking: randomly mask tokens (replace with [?]) and train the network to predict what should fill each masked position.

The key property carries over: the model fills ALL masked positions at once in one forward pass. This is called masked language modeling at generation time.

Text Diffusion — Iterative Denoising (all positions updated in parallel each step)

Step 0: all tokens masked

[?]

Step 1: model fills highest-confidence positions (all positions evaluated simultaneously)

Paris

[?]

city

[?]

Step 2: fills remaining — final output

Paris

beautiful

city

⚠️ Why pure text diffusion models aren't used as main models

Autoregressive models (GPT, Llama, Qwen) condition each token on all previous tokens in strict order. This produces very precise, coherent token probability distributions. Text diffusion models generate tokens in parallel but sacrifice that strict sequential conditioning — their outputs are less precise, and they can't exactly match an autoregressive model's distribution. Fast, but lower quality. DFlash sidesteps this by only using diffusion for drafting, not final output.

Part 3 — How the Drafter Accesses LLM Internals

This is the most technical part. How exactly does a diffusion draft model "look inside" the big model?

There are two mechanisms: feature extraction (pulling hidden states out) and KV injection (pushing them into every draft layer). Together they make the drafter's predictions much better than if it ran alone.

Mechanism 1 — Feature Extraction

During every big model forward pass, you can request the hidden states from every layer. This is just a flag you pass to the model. Once you have them, you sample uniformly across all layers (not just 3 like EAGLE-3), project them with a small linear layer, and fuse them into one conditioning vector:

python · extracting hidden states from a HuggingFace model

import torch

# Forward pass with output_hidden_states=True
# This returns the hidden state tensor at every layer — no extra compute!
outputs = big_model(
    input_ids=token_ids,                # [batch, seq_len]
    output_hidden_states=True          # the magic flag
)

# outputs.hidden_states is a tuple of tensors, one per layer:
# hidden_states[0]  → shape [batch, seq_len, 4096]  (embedding layer)
# hidden_states[8]  → shape [batch, seq_len, 4096]  (layer 8 = early)
# hidden_states[16] → shape [batch, seq_len, 4096]  (layer 16 = mid)
# hidden_states[32] → shape [batch, seq_len, 4096]  (layer 32 = final)

num_layers = len(outputs.hidden_states)            # e.g. 33 (0 to 32)

# Sample uniformly across layers (DFlash samples more than EAGLE-3)
step = num_layers // 6
sampled = [outputs.hidden_states[i][0, -1, :]   # last token position
           for i in range(0, num_layers, step)]   # sample every 5th layer
# sampled: list of 6 tensors, each [4096]

stacked = torch.stack(sampled, dim=0)             # [6, 4096]
fused = feature_projector(stacked)                # learned FC → [4096]
# fused is the distilled "essence" of what the big model knows about context

Mechanism 2 — KV Cache Injection

In a standard transformer layer, attention is computed like this: Query vectors from the current token ask "what am I looking for?" and Key/Value vectors from all other tokens answer "here's what I have, and here's what I carry." The attention scores determine how much each token's Value contributes to the output.

DFlash injects the big model's fused features directly into the Key and Value projections of every draft layer. This means the drafter "attends to" the big model's knowledge at every layer of its computation — not just at the input as in EAGLE-3.

python · standard attention vs. DFlash KV injection

########################################
# Standard transformer attention
########################################
d = 64  # head dimension
Q = linear_q(hidden)       # [seq, d] — "what am I looking for?"
K = linear_k(hidden)       # [seq, d] — "what do I have as keys?"
V = linear_v(hidden)       # [seq, d] — "what info do I carry?"

scores = softmax(Q @ K.T / d**0.5)  # [seq, seq] attention weights
output = scores @ V                   # [seq, d] attended output

########################################
# DFlash attention with KV injection
########################################
# The big model's fused features are projected to K and V space
K_from_big = linear_k_inject(fused_features)   # [1, d]
V_from_big = linear_v_inject(fused_features)   # [1, d]

# Concatenate draft model's own KV with injected KV from big model
K_combined = torch.cat([linear_k(hidden), K_from_big], dim=0)  # [seq+1, d]
V_combined = torch.cat([linear_v(hidden), V_from_big], dim=0)  # [seq+1, d]

# Now every draft token attends to BOTH draft context AND big model knowledge
scores = softmax(Q @ K_combined.T / d**0.5)  # [seq, seq+1]
output = scores @ V_combined                    # [seq, d]
# This is done at EVERY draft layer — not just the first

Why non-causal attention enables parallel generation

Normally, attention is causal: token 5 can only see tokens 1–5, not 6 onward. This enforces left-to-right ordering, which is essential for autoregressive generation.

DFlash's draft model uses non-causal (bidirectional) attention among draft positions. All 16 [?] mask tokens see each other. This is what lets all 16 draft tokens be generated simultaneously — they inform each other's predictions during the single forward pass.

python · causal mask vs. bidirectional mask

# CAUSAL mask — standard autoregressive (EAGLE-3's decoder uses this)
# 1 = can attend, 0 = blocked
causal = torch.tensor([
    [1, 0, 0, 0],  # pos 0: sees only itself
    [1, 1, 0, 0],  # pos 1: sees 0,1
    [1, 1, 1, 0],  # pos 2: sees 0,1,2
    [1, 1, 1, 1],  # pos 3: sees all
])
# Each position must be generated one at a time — SEQUENTIAL

# BIDIRECTIONAL (non-causal) mask — DFlash draft model uses this
bidirectional = torch.tensor([
    [1, 1, 1, 1],  # draft[?] 0 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 1 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 2 sees all draft positions
    [1, 1, 1, 1],  # draft[?] 3 sees all draft positions
])
# All positions computed at once — ONE FORWARD PASS for 16 tokens

DFlash · 05

DFlash: Full Technical Architecture

DFlash combines three insights: (1) use diffusion for parallel draft generation, (2) condition the diffusion drafter deeply on the big model's multi-layer hidden states, (3) inject those features into every draft layer's KV cache so the drafter is continuously informed at every computation step.

The Complete Pipeline

DFlash — One Full Inference Cycle

Step 1

Big model forward pass

normal generation step, output_hidden_states=True

→

Step 2

Extract hidden states from N layers uniformly

h₁, h₅, h₁₀, h₁₅, h₂₀, h₂₇

→

Step 3

Feature fusion projection

N×d_model → d_model

Step 4

Draft model: inject fused features into all K/V layers, generate 16 [?] tokens simultaneously (non-causal attention, one forward pass)

→

Step 5

Big model verifies 16 drafts in one pass

accept / reject per position

→

Step 6

Repeat

Full Code Walkthrough (Annotated Pseudocode)

python · complete DFlash inference cycle with explanations

##########################################################
# STEP 1: Run big model forward pass
# Identical to normal generation — just add output_hidden_states=True
##########################################################
big_outputs = big_model(
    input_ids=current_tokens,           # [1, seq_len]
    use_cache=True,                   # reuse KV cache from previous tokens
    output_hidden_states=True          # ← DFlash needs these
)
# big_outputs.hidden_states: tuple of (num_layers+1) tensors
# each tensor shape: [1, seq_len, hidden_size]
last_token_logits = big_outputs.logits[0, -1]   # [vocab_size]

##########################################################
# STEP 2: Extract hidden states from multiple layers
##########################################################
all_hidden = big_outputs.hidden_states            # tuple of num_layers tensors
num_layers = len(all_hidden)                     # e.g. 33 for 32-layer model

# Sample uniformly across all layers (more coverage than EAGLE-3's 3 layers)
extract_layers = list(range(1, num_layers, num_layers // 6))  # e.g. [1,6,11,17,22,27]
sampled = torch.stack(
    [all_hidden[i][0, -1, :] for i in extract_layers], dim=0
)   # [num_extracted, hidden_size]  e.g. [6, 4096]

##########################################################
# STEP 3: Fuse into one conditioning vector
##########################################################
# feature_projector is a small learned linear layer: [num_extracted, d] → [d]
# It learns which layer's signals matter most for predicting future tokens
fused_ctx = feature_projector(sampled.flatten())   # [hidden_size]

##########################################################
# STEP 4: Draft model generates K tokens in ONE parallel pass
##########################################################
K = 16   # number of tokens to draft simultaneously

# Initialize K mask token embeddings — these are what get "denoised"
# Think of them as K [MASK] tokens, all unknown
mask_embeds = mask_embedding.expand(K, -1)        # [K, hidden_size]

# The draft model's forward pass:
# - Input: K mask token embeddings
# - KV injection: fused_ctx goes into Key and Value of EVERY draft layer
# - Attention mask: non-causal (all K positions attend to all K positions)
# - Output: logits for all K positions simultaneously
draft_logits = draft_model.forward(
    input_embeds=mask_embeds,           # [K, hidden_size]
    kv_injection_context=fused_ctx,     # injected into every layer's KV
    attention_mask="bidirectional"     # non-causal: all see all
)                                       # output: [K, vocab_size]

# Sample draft tokens from logits — ALL K at once, no loop needed
draft_tokens = torch.argmax(draft_logits, dim=-1)   # [K]
# ↑ This is the entire architectural difference from EAGLE-3.
# EAGLE-3 needs a for-loop with K iterations. DFlash produces all K in one call.

##########################################################
# STEP 5: Big model verifies all K drafts in one pass
##########################################################
# Concatenate existing context with K draft tokens
verify_ids = torch.cat([current_tokens, draft_tokens.unsqueeze(0)], dim=1)
verify_out = big_model(verify_ids)                  # one forward pass
verify_logits = verify_out.logits[0]                # [seq_len+K, vocab_size]

##########################################################
# STEP 6: Accept/reject loop
##########################################################
accepted = []
seq_len = current_tokens.shape[1]

for i in range(K):
    # What would the big model have produced at position seq_len+i?
    big_model_token = torch.argmax(verify_logits[seq_len + i - 1])
    if big_model_token == draft_tokens[i]:
        accepted.append(draft_tokens[i])
    else:
        accepted.append(big_model_token)  # use big model's correction
        break                              # stop here — can't trust later drafts

# Typical result: 8–12 tokens accepted
# Net throughput: 8–12 tokens for cost of ~2 big model passes
# vs baseline: 8–12 tokens for cost of 8–12 big model passes
# Speedup: 4–6×
current_tokens = torch.cat([current_tokens, torch.tensor(accepted).unsqueeze(0)], dim=1)

✅ The key difference from EAGLE-3 in one line

EAGLE-3: for step in range(K): draft_token = decoder(prev_token) — a loop with K sequential iterations.

DFlash: draft_tokens = draft_model(mask_embeds) — one call produces all K tokens. No loop. That's it.

DFlash · 06

Using the Same Model as Its Own Drafter

The DFlash draft model for Qwen3.6-35B-A3B is called z-lab/Qwen3.6-35B-A3B-DFlash. It uses the same Qwen3-style architecture as the target — same hidden dimensions, same tokenizer, same vocabulary. But it is not a full copy. Here is exactly what it is.

What is shared vs. uniquely trained

Target Model vs DFlash Drafter — Component Breakdown

TARGET: Qwen3.6-35B-A3B

Embedding Layer

~1.5 GB · vocabulary → vectors

62 Transformer Layers

~62 GB · the "brain"

LM Head (unembedding)

~1.5 GB · vectors → token probs

Total: ~35B params / ~70 GB

DRAFTER: Qwen3.6-35B-A3B-DFlash

Embedding Layer

↑ SHARED — pointer, 0 extra RAM

Feature Projector

~0.2 GB · newly trained

3–5 Draft Transformer Layers

~2.5 GB · newly trained, same architecture

LM Head (unembedding)

↑ SHARED — pointer, 0 extra RAM

Total extra: ~4 GB

💡 Why sharing embeddings and LM Head saves significant memory

The embedding layer (token ID → 4096-dim vector) and LM Head (4096-dim vector → 100k-dim vocab probabilities) are each ~1.5 GB for a large model. When the draft model reuses them as pointers to the same memory, not copies, you save ~3 GB and also guarantee the draft model speaks the same "language" as the target — same token representations, same vocabulary logits.

Why the same architecture type is used

The draft model uses Qwen3-style transformer layers because:

Hidden dimensions match: the target model outputs 4096-dim hidden states; the draft model's layers also expect 4096-dim inputs. No shape adapters needed.
Weight initialization: the draft model's layers can be initialized from the target model's early layers before fine-tuning. This gives a much better starting point than random initialization, making training faster and the final model more accurate.
Tokenizer is shared: same vocabulary, same byte-pair encoding. Draft tokens are directly comparable to target tokens with no conversion step.

GPU Memory Breakdown

Let's look at what actually occupies GPU memory when running DFlash with Qwen3.5-27B on a single A100 80GB.

❌ Without DFlash

Target model weights (BF16)~56 GB

KV cache (8k context)~8 GB

Draft model0 GB

Activations / overhead~4 GB

Throughput: ~20 tok/s (single user)

✅ With DFlash

Target model weights (BF16)~56 GB

KV cache (target + draft)~11 GB

Draft model extra params~4 GB

Activations / overhead~4 GB

Throughput: ~80–100 tok/s (single user)

Compute cost breakdown per cycle

Time budget — generating 12 tokens (DFlash vs baseline)

BASELINE — 12 tokens, no DFlash

12 × big model passes12.0 units

Total: 12.0 units of compute

DFLASH — 16 drafted, 12 accepted

1× big model (step N)1.0 unit

Feature extract + fuse0.05 unit

Draft model (16 tokens, 1 pass)0.2 units

1× big model verify (16 tokens)1.1 units

Total: ~2.35 units → 12 tokens → 5.1× faster

What happens to the KV cache when drafts are rejected

When the big model rejects a draft token, the KV cache entries for that position are simply discarded — they're never committed to the main context. The KV cache only grows when tokens are accepted. This is what makes speculative decoding "lossless": the final KV state is bit-for-bit identical to if you'd generated those tokens normally.

The Qwen3.6 DeltaNet (GDN) Complication

Qwen3.6 mixes Gated Delta Networks (a linear recurrent attention mechanism) with standard full attention. GDN layers maintain a running recurrent state — like a hidden memory that updates as each token is processed. This breaks standard speculative decoding.

python · the GDN rollback problem and solution

# Standard transformer KV cache — rolling back is trivial:
kv_cache[position] = (K_vector, V_vector)   # stored separately per position
# If rejected: just don't include that position in future attention. Easy.

########################################
# GDN / DeltaNet — stateful, hard to roll back:
########################################
# GDN maintains a recurrent state S that updates multiplicatively:
S_new = S_old * gate + delta_value   # S encodes compressed history
# After processing a draft token: S_old is gone, replaced by S_new
# If we reject that token, we can't undo S_new → S_old automatically
# (matrix multiplication is not trivially reversible)

########################################
# SGLang's solution: extra_buffer strategy
########################################
# Before starting the speculative draft phase:
S_checkpoint = copy(S)                # save current recurrent state

# Run draft model, run verification...
num_accepted = verify_and_count(draft_tokens)

# If any tokens rejected:
if num_accepted < K:
    S = S_checkpoint                    # restore clean state
    S = advance_state(S, accepted_tokens)  # replay only accepted tokens
# Result: S is exactly what it would have been without speculation

# This is why you need --mamba-scheduler-strategy extra_buffer in SGLang
# for Qwen3.6 models with DFlash

Results · 07

Benchmarks

Autoregressive baseline1×

EAGLE-3 (state-of-art before DFlash)3–4×

DFlash on code / math / structured textup to 6.2×

DFlash with thinking/reasoning mode≈4.5×

Draft method

Sequential (K steps)

Parallel (1 pass)

Draft model depth

1 layer (forced)

3–5 layers

Tokens drafted per cycle

5–7

15–16

Max speedup over baseline

3–4×

6.2×

vs EAGLE-3

—

2.5× faster

Extra VRAM needed

~280 MB

~4 GB

Output quality

Lossless

Practical · 08

Supported Models

All DFlash draft models are published by z-lab at https://huggingface.co/z-lab

Target Model	Draft Model	vLLM	SGLang	MLX
Qwen/Qwen3-8B	`z-lab/Qwen3-8B-DFlash-b16`	✓	✓	✓
Qwen/Qwen3.5-27B	`z-lab/Qwen3.5-27B-DFlash`	✓	✓	✓
Qwen/Qwen3.5-35B-A3B	`z-lab/Qwen3.5-35B-A3B-DFlash`	✓	✓	Exp
Qwen/Qwen3.6-35B-A3B	`z-lab/Qwen3.6-35B-A3B-DFlash`	Patched	PR branch	✗
Qwen/Qwen3-Coder-30B-A3B	`z-lab/Qwen3-Coder-30B-A3B-DFlash`	✓	✓	✗
meta-llama/Llama-3.1 family	`z-lab/Llama-3.1-*-DFlash`	✓	✓	✗
Kimi-K2.5 (coming soon)	Preview on z-lab HF	Soon	Soon	✗

Practical · 09

How to Run DFlash

Option A — vLLM stable (Qwen3.5-27B, easiest)

shell · install + launch

# Install
pip install -U vllm --torch-backend=auto

# Launch OpenAI-compatible server on localhost:8000
vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash",
    "model": "z-lab/Qwen3.5-27B-DFlash",
    "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768 \
  --speculative-disable-by-batch-size 32   # auto-disable at high concurrency

Option B — SGLang (Qwen3.5-35B-A3B, best for agents)

shell · sglang launch

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \  # GDN state rollback
  --trust-remote-code

Option C — Qwen3.6 (patched vLLM)

shell · qwen3.6 + dflash

# Install patched build until main vLLM release includes it
pip install vllm
pip install -U --torch-backend=auto \
  "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

vllm serve Qwen/Qwen3.6-35B-A3B \
  --speculative-config '{"method": "dflash",
    "model": "z-lab/Qwen3.6-35B-A3B-DFlash",
    "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

Option D — HuggingFace Transformers (quick experiments only)

python · transformers backend

from transformers import AutoModel, AutoModelForCausalLM

# Load draft model (shares embedding + LM head with target automatically)
draft = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16",
    trust_remote_code=True, dtype="auto", device_map="cuda:0"
).eval()

# Load target model
target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0"
).eval()

Calling the server (after vLLM or SGLang launch)

python · openai-compatible client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Completely normal OpenAI API call — DFlash is transparent
response = client.chat.completions.create(
    model="Qwen/Qwen3.5-27B",
    messages=[{"role": "user", "content": "Write a binary search in Python"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
# Output is 100% identical to non-DFlash. Just arrives faster.

Practical · 10

When to Use DFlash

Concurrent users

1–32

1–50

Any

Output length

Long (code, essays)

Medium-long

Any

Content type

Code, math, structured

Most types

Any

Extra VRAM

~4 GB more

~280 MB more

💡 Decision rule from the community

if batch_size > 32 or output_tokens < 50 → standard decoding
elif DFlash checkpoint exists and acceptance_rate > 0.7 → DFlash
else → EAGLE-3

Everything in One Place

LLMs are slow because generation is one token per full model pass — sequential, can't skip ahead.

Draft models solve this by guessing ahead and verifying all guesses in one parallel big-model pass.

EAGLE-3 attaches a tiny 1-layer decoder to the big model, fusing hidden states from 3 layers (early/mid/late) via a learned FC layer. It drafts autoregressively — one token per step, forced to stay at 1 layer. Ceiling: ~4×.

Diffusion means generating all token positions simultaneously (one forward pass) instead of sequentially. Fast, but lower quality when used alone as the main model.

DFlash uses a 3–5 layer diffusion draft model conditioned on the big model's hidden states from many layers. Features are injected into every draft layer's Key-Value cache. Non-causal (bidirectional) attention lets all 16 draft tokens be generated simultaneously in one pass. Big model still verifies everything — output quality is identical. Speedup: up to 6.2×, 2.5× faster than EAGLE-3.

The draft model shares embedding and LM Head weights with the target (zero extra RAM for those). Its 3–5 new transformer layers use the same architecture as the target so hidden dimensions match and initialization is clean. Total extra VRAM: ~4 GB.

Run it with vLLM (stable, Qwen3.5 and older) or SGLang (recommended for MoE and agents). For Qwen3.6 you need a patched vLLM for now. All draft models are at https://huggingface.co/z-lab.

UnderstandingDFlash

Why is LLM inference slow?

How Draft Models Work

EAGLE-3: Architecture Deep Dive

What is a Hidden State?

EAGLE-3 Three Components

Component 1: The Feature Fusion (FC Layer)

Component 2: The Draft Decoder (1 transformer layer)

The Draft Tree (not just a sequence)

EAGLE-3's Hard Ceiling

Diffusion Models: From Images to Text

Part 1 — Image Diffusion (the original idea)

Part 2 — Text Diffusion (dLLMs)

Part 3 — How the Drafter Accesses LLM Internals

Mechanism 1 — Feature Extraction

Mechanism 2 — KV Cache Injection

Why non-causal attention enables parallel generation

DFlash: Full Technical Architecture

The Complete Pipeline

Full Code Walkthrough (Annotated Pseudocode)

Using the Same Model as Its Own Drafter

What is shared vs. uniquely trained

Why the same architecture type is used

GPU Memory Breakdown

Compute cost breakdown per cycle

What happens to the KV cache when drafts are rejected

The Qwen3.6 DeltaNet (GDN) Complication

Benchmarks

Supported Models

How to Run DFlash

Option A — vLLM stable (Qwen3.5-27B, easiest)

Option B — SGLang (Qwen3.5-35B-A3B, best for agents)

Option C — Qwen3.6 (patched vLLM)

Option D — HuggingFace Transformers (quick experiments only)

Calling the server (after vLLM or SGLang launch)

When to Use DFlash

Everything in One Place

Understanding
DFlash