Skip to content

feat: EAGLE-3 post-norm support#4

Open
Dogacel wants to merge 7 commits into
ruixiang63:eagle3-adapt-new-archfrom
Dogacel:post-norm
Open

feat: EAGLE-3 post-norm support#4
Dogacel wants to merge 7 commits into
ruixiang63:eagle3-adapt-new-archfrom
Dogacel:post-norm

Conversation

@Dogacel
Copy link
Copy Markdown

@Dogacel Dogacel commented May 19, 2026

Overview

Support the new post-norm architecture as described in our paper & blog:

https://x.com/dogacel0/status/2054200111043949012?s=20

Additional information

Use the models to test,

  • llama 3.1 8B: Dogacel/paper-attention-drift-llama-post
  • gpt-oss 20B: Dogacel/specdrift-gpt-oss-20b-eagle3
  • gpt-oss 120B: Dogacel/specdrift-gpt-oss-120b-eagle3

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, used to adapt changes from SGLang diff -> llama.cpp and write test scripts. All code written by AI is manually inspected.

@Dogacel
Copy link
Copy Markdown
Author

Dogacel commented May 21, 2026

@ruixiang63 hi, can you help me merge this into your base? I don't have access to a DGX Spark yet, would be glad if you can test my checkpoints too! I've only tested on H100s.

@ruixiang63 ruixiang63 force-pushed the eagle3-adapt-new-arch branch from 9d2fdcf to e2ad542 Compare May 22, 2026 10:29
@tnhnyzc
Copy link
Copy Markdown

tnhnyzc commented May 25, 2026

I tested this branch locally with Dogacel/specdrift-gpt-oss-120b-eagle3 and wanted to share two small findings in case they are useful. I may be missing some context from the ongoing refactor, so please treat this as a repro/data point rather than a requested design direction.

Setup:

  • branch: Dogacel:post-norm at d1b50c76a
  • target: gpt-oss-120b-F16.gguf
  • draft: Dogacel/specdrift-gpt-oss-120b-eagle3, converted to BF16 GGUF
  • backend: Metal / Apple M3 Max
  • server mode: --spec-type draft-eagle3

1. SpecDrift GPT-OSS uses 5 extract layers

The draft config has:

"eagle_aux_hidden_state_layer_ids": [1, 9, 17, 25, 33]

The converted GGUF has fc.weight shaped as 5 * 2880 -> 2880 and five fc_norm.*.weight tensors. The current exact-3 checks reject this model even though the rest of the encoder path already appears to handle n_extract_layers dynamically.

Locally I changed the guard from n_extract_layers != 3 to extract_layers != nullptr && n_extract_layers > 0, and made the logging dynamic.

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

After relaxing the 3-layer guard, the model loaded and ran, but acceptance was extremely low until I synchronized the target context before reading extracted target layer inputs.

The relevant path appears to be:

  • extract_layer_inputs() copies layer inputs with ggml_backend_tensor_get_async(...)
  • common_speculative_impl_draft_eagle3::process() immediately calls llama_get_output_layer_inp(...) and copies those buffers into the EAGLE3 feature buffer

As a local correctness test, I added:

// Target layer inputs are copied out asynchronously after llama_decode(ctx_tgt).
// Make them host-visible before feeding them to the EAGLE3 encoder.
llama_synchronize(ctx_tgt);

right before the loop that reads llama_get_output_layer_inp(...) in common/speculative.cpp.

That changed acceptance/perf substantially on the same prompts:

config before sync after sync
n_max=1, p_min=0.75, short prompt 47.9 t/s, 10/34 accepted 66.2 t/s, 43/45 accepted
n_max=4, p_min=0.5, short prompt 33.9 t/s, 8/63 accepted 77.9 t/s, 50/54 accepted
n_max=4, p_min=0.75, longer prompt not tested before sync 62.0 t/s, 96/104 accepted
no-spec baseline, same longer prompt 55.3 t/s 55.3 t/s

The exact location of the sync may not be the right final abstraction; it may belong closer to the layer-input extraction API or speculative process boundary. But the acceptance jump made it look like the current branch can read stale/not-yet-visible target features in server mode.

Minimal local diff shape:

- if (n_extract_layers != 3) {
-     throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+ if (extract_layers == nullptr || n_extract_layers == 0) {
+     throw std::runtime_error("draft model is not eagle3 (invalid extract layer count " +
      std::to_string(n_extract_layers) + ")");
  }

and the sync before building features_buf from llama_get_output_layer_inp(...).

Hope this helps; happy to split this into a tiny PR against post-norm if that is easier to inspect.

@Dogacel
Copy link
Copy Markdown
Author

Dogacel commented May 25, 2026

I tested this branch locally with Dogacel/specdrift-gpt-oss-120b-eagle3 and wanted to share two small findings in case they are useful. I may be missing some context from the ongoing refactor, so please treat this as a repro/data point rather than a requested design direction.

Setup:

* branch: `Dogacel:post-norm` at `d1b50c76a`

* target: `gpt-oss-120b-F16.gguf`

* draft: `Dogacel/specdrift-gpt-oss-120b-eagle3`, converted to BF16 GGUF

* backend: Metal / Apple M3 Max

* server mode: `--spec-type draft-eagle3`

1. SpecDrift GPT-OSS uses 5 extract layers

The draft config has:

"eagle_aux_hidden_state_layer_ids": [1, 9, 17, 25, 33]

The converted GGUF has fc.weight shaped as 5 * 2880 -> 2880 and five fc_norm.*.weight tensors. The current exact-3 checks reject this model even though the rest of the encoder path already appears to handle n_extract_layers dynamically.

Locally I changed the guard from n_extract_layers != 3 to extract_layers != nullptr && n_extract_layers > 0, and made the logging dynamic.

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

After relaxing the 3-layer guard, the model loaded and ran, but acceptance was extremely low until I synchronized the target context before reading extracted target layer inputs.

The relevant path appears to be:

* `extract_layer_inputs()` copies layer inputs with `ggml_backend_tensor_get_async(...)`

* `common_speculative_impl_draft_eagle3::process()` immediately calls `llama_get_output_layer_inp(...)` and copies those buffers into the EAGLE3 feature buffer

As a local correctness test, I added:

// Target layer inputs are copied out asynchronously after llama_decode(ctx_tgt).
// Make them host-visible before feeding them to the EAGLE3 encoder.
llama_synchronize(ctx_tgt);

right before the loop that reads llama_get_output_layer_inp(...) in common/speculative.cpp.

That changed acceptance/perf substantially on the same prompts:
config before sync after sync
n_max=1, p_min=0.75, short prompt 47.9 t/s, 10/34 accepted 66.2 t/s, 43/45 accepted
n_max=4, p_min=0.5, short prompt 33.9 t/s, 8/63 accepted 77.9 t/s, 50/54 accepted
n_max=4, p_min=0.75, longer prompt not tested before sync 62.0 t/s, 96/104 accepted
no-spec baseline, same longer prompt 55.3 t/s 55.3 t/s

The exact location of the sync may not be the right final abstraction; it may belong closer to the layer-input extraction API or speculative process boundary. But the acceptance jump made it look like the current branch can read stale/not-yet-visible target features in server mode.

Minimal local diff shape:

- if (n_extract_layers != 3) {
-     throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+ if (extract_layers == nullptr || n_extract_layers == 0) {
+     throw std::runtime_error("draft model is not eagle3 (invalid extract layer count " +
      std::to_string(n_extract_layers) + ")");
  }

and the sync before building features_buf from llama_get_output_layer_inp(...).

Hope this helps; happy to split this into a tiny PR against post-norm if that is easier to inspect.

Nice thanks! I need to rebase my changes, maybe upstream has fixed those issues, if not I will use your fix.

@ruixiang63
Copy link
Copy Markdown
Owner

@tnhnyzc Regarding "Target layer-input extraction appears to need synchronization before EAGLE3 consumes it" — did you observe the same issue on the original EAGLE3 PR (ggml-org#18039), or only on this one? My guess is that post-norm changes the graph topology, so the ggml_set_sync() barrier from the original PR no longer lands on the correct split boundary. If that's the case, this would be a sync issue specific to this PR rather than a regression of the original behavior.

@Dogacel Thanks for the PR! I think this work deserves its own separate PR upstream once the original EAGLE3 PR lands, since it introduces some breaking changes and a variant of eagle3.
I would like to keep the original PR scoped to native EAGLE3 support so it stays easy to review and doesn't add extra burden on the llama.cpp maintainers.
I'd encourage you to open this directly against llama.cpp once ggml-org#18039 is merged. (though I'm not sure of the timeline, as Georgi hasn't had much bandwidth to review it yet.)
Thanks again for the PR. I've left some comments on the code as well.

@tnhnyzc
Copy link
Copy Markdown

tnhnyzc commented May 26, 2026

@ruixiang63 I tested this against the original EAGLE3 PR path with a regular GPT-OSS EAGLE3 draft.

Setup:

Result:

  • original path, no explicit sync before reading target layer inputs:
    draft acceptance = 0.06383, eval ~= 41.95 tok/s
  • after adding ctx->synchronize() inside llama_context::get_output_layer_inp():
    draft acceptance = 0.89474, eval ~= 50.51 tok/s

So at least on Metal this does not look specific to the post-norm PR. The original EAGLE3 layer-input handoff also appears to expose async-copied host buffers before completion.

Separate converter note: the PR converter currently derives GPT-OSS extract layers as [2,18,33], but this draft’s HF config explicitly provides [1,17,33] under eagle_config.eagle_aux_hidden_state_layer_ids. Preserving that config value also works locally; with [1,17,33], I saw the same sync pattern:

  • no sync: acceptance = 0.10526, eval ~= 41.98 tok/s
  • sync: acceptance = 0.91209, eval ~= 53.81 tok/s

@ruixiang63 ruixiang63 self-requested a review May 26, 2026 22:15
Comment thread common/speculative.cpp
n_extract_layers = llama_model_n_target_extract_layers(model_dft);
if (n_extract_layers != 3) {
throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
if (!(extract_layers != nullptr && n_extract_layers > 0)) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this new eagle3 variant extracts 5 layers embedding from target model instead of 3. So here may be better to use n_extract_layers != 3 || n_extract_layers != 5.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, there is no explicit assumption, the model makers can choose how many layers they want, I think it should be dynamic.

Comment thread conversion/__init__.py Outdated
"Llama4ForConditionalGeneration": "llama",
"LlamaBidirectionalModel": "llama",
"LlamaForCausalLM": "llama",
"LlamaForCausalLMEagle3": "llama",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream changes already contains LlamaForCausalLMEagle3

Comment thread conversion/llama.py
target_num_layers = target_config["num_hidden_layers"]
extract_layers = [2, target_num_layers // 2, target_num_layers - 3]
logger.info(f"EAGLE-3: extract_layers = {extract_layers} (target model has {target_num_layers} layers)")
cfg_extract = (eagle3_raw_config.get("eagle_config") or {}).get("eagle_aux_hidden_state_layer_ids")
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t use eagle_aux_hidden_state_layer_ids here because different Eagle3 checkpoints interpret layer_ids differently: some expect them to be set before extracting the layers, while others expect them afterward, which can sometimes require adding +1.
To avoid this ambiguity, I decided to compute the values manually based on the original paper instead of relying on the Eagle3 config.
So you may follow the same strategy here.

Copy link
Copy Markdown
Author

@Dogacel Dogacel May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rely on eagle_aux_hidden_state_layer_ids because there is no explicit formula beyond the standard EAGLE-3.

If you can point out what models require +1 and what models' dont, maybe we can generalize the logic. I don't like hardcoding the layer ids.

Comment thread src/models/eagle3.cpp
if (target_extract_layers.size() != 3) {
throw std::runtime_error("EAGLE3 requires exactly 3 entries in 'extract_layers'");
if (target_extract_layers.size() == 0) {
throw std::runtime_error("EAGLE3 requires at least 1 entry in 'extract_layers'");
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eagle3 needs 3 or 5. "at least 1 " is misleading.

Comment thread src/llama-context.cpp
}

extract_layer_inputs(res);
extract_layer_inputs(res, (uint32_t) n_tokens_prev, n_tokens_all);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complicates the logic, so we should keep the original API parameters. ubatch is already handled correctly by this function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case it was causing segfault, I will share the reproduction script.

Comment thread src/llama-context.cpp
}

void llama_context::extract_layer_inputs(const llm_graph_result * res) {
void llama_context::extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unnecessary.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary unless my launch config is wrong (and I assume most people would set it up similarly.

Try to comment this change and reproduce a segfault using,

./build/bin/llama-server -m $TARGET -md $DRAFTER --spec-type draft-eagle3 --spec-draft-n-max 4 --spec-draft-p-min 0.25 -np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap

llama-benchy --base-url http://localhost:8080/v1 --tg 128 --model openai/gpt-oss-120b

Comment thread src/llama-context.h
// the buffer holds the whole batch ([n_tokens_all] rows); each ubatch is
// written at its running offset n_tokens_prev so multi-ubatch decodes
// accumulate instead of overwriting (mirrors the logits/pre-norm paths).
void extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same unnecessary.

Comment thread tools/server/server-context.cpp Outdated

params_base = params;

const bool spec_eagle3 = std::find(params_base.speculative.types.begin(), params_base.speculative.types.end(),
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need this? Eagle3 already supportws np > 1 and doesn't need unified KV cache.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had crashes in long-context with ubatch, let me share the script to reproduce bugs I have faced.

cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP;
}

if (spec_eagle3 && cparams.n_ubatch < cparams.n_batch) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubatch and n_batch in eagle3 should be correctly supported. Why this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try removing this line and see the abort,

./build/bin/llama-server -m "$TARGET" -md "$DRAFT" \
    --spec-type draft-eagle3 --spec-draft-n-max 8 --spec-draft-p-min 0.5 \
    -np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap 2>&1 | tee /tmp/repro.log

  llama-benchy --base-url http://localhost:8080/v1 --pp 2048 --tg 8 --concurrency 1 --model openai/gpt-oss-120b

This hits:

    // micro-batching is not possible for non-causal encoding, so we process the batch in a single shot
    GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens");

@Dogacel
Copy link
Copy Markdown
Author

Dogacel commented May 27, 2026

@ruixiang63 pushed new changes, LMK if you have any issues.

@ruixiang63 ruixiang63 force-pushed the eagle3-adapt-new-arch branch 2 times, most recently from cf1aab7 to e8ddf01 Compare May 27, 2026 16:53
@ruixiang63
Copy link
Copy Markdown
Owner

@tnhnyzc I wasn’t able to reproduce the issue on DGX Spark, but I think it makes sense to add a sync after embedding extraction. I added you as a co-author on commit acc31b1. Could you please verify if it fixes your issue?

@Dogacel Thanks for reporting this. I was able to reproduce the bug, and it appears to be caused by llama_encode not supporting ubatch and extract_layer_inputs not handling the ubatch correctly. I pushed a fix in commit e8ddf01 and added you as a co-author. I also validated it locally, and it works well for both np > 1 and ub < b. Please verify.

@tnhnyzc
Copy link
Copy Markdown

tnhnyzc commented May 27, 2026

@ruixiang63 Yep, verified on my Metal/macOS setup and it fixes the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants