feat: EAGLE-3 post-norm support by Dogacel · Pull Request #4 · ruixiang63/llama.cpp

Dogacel · 2026-05-19T02:18:39Z

Overview

Support the new post-norm architecture as described in our paper & blog:

https://x.com/dogacel0/status/2054200111043949012?s=20

Additional information

Use the models to test,

llama 3.1 8B: Dogacel/paper-attention-drift-llama-post
gpt-oss 20B: Dogacel/specdrift-gpt-oss-20b-eagle3
gpt-oss 120B: Dogacel/specdrift-gpt-oss-120b-eagle3

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, used to adapt changes from SGLang diff -> llama.cpp and write test scripts. All code written by AI is manually inspected.

Dogacel · 2026-05-21T15:57:56Z

@ruixiang63 hi, can you help me merge this into your base? I don't have access to a DGX Spark yet, would be glad if you can test my checkpoints too! I've only tested on H100s.

tnhnyzc · 2026-05-25T10:15:19Z

I tested this branch locally with Dogacel/specdrift-gpt-oss-120b-eagle3 and wanted to share two small findings in case they are useful. I may be missing some context from the ongoing refactor, so please treat this as a repro/data point rather than a requested design direction.

Setup:

branch: Dogacel:post-norm at d1b50c76a
target: gpt-oss-120b-F16.gguf
draft: Dogacel/specdrift-gpt-oss-120b-eagle3, converted to BF16 GGUF
backend: Metal / Apple M3 Max
server mode: --spec-type draft-eagle3

1. SpecDrift GPT-OSS uses 5 extract layers

The draft config has:

"eagle_aux_hidden_state_layer_ids": [1, 9, 17, 25, 33]

The converted GGUF has fc.weight shaped as 5 * 2880 -> 2880 and five fc_norm.*.weight tensors. The current exact-3 checks reject this model even though the rest of the encoder path already appears to handle n_extract_layers dynamically.

Locally I changed the guard from n_extract_layers != 3 to extract_layers != nullptr && n_extract_layers > 0, and made the logging dynamic.

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

After relaxing the 3-layer guard, the model loaded and ran, but acceptance was extremely low until I synchronized the target context before reading extracted target layer inputs.

The relevant path appears to be:

extract_layer_inputs() copies layer inputs with ggml_backend_tensor_get_async(...)
common_speculative_impl_draft_eagle3::process() immediately calls llama_get_output_layer_inp(...) and copies those buffers into the EAGLE3 feature buffer

As a local correctness test, I added:

// Target layer inputs are copied out asynchronously after llama_decode(ctx_tgt).
// Make them host-visible before feeding them to the EAGLE3 encoder.
llama_synchronize(ctx_tgt);

right before the loop that reads llama_get_output_layer_inp(...) in common/speculative.cpp.

That changed acceptance/perf substantially on the same prompts:

config	before sync	after sync
`n_max=1, p_min=0.75`, short prompt	47.9 t/s, 10/34 accepted	66.2 t/s, 43/45 accepted
`n_max=4, p_min=0.5`, short prompt	33.9 t/s, 8/63 accepted	77.9 t/s, 50/54 accepted
`n_max=4, p_min=0.75`, longer prompt	not tested before sync	62.0 t/s, 96/104 accepted
no-spec baseline, same longer prompt	55.3 t/s	55.3 t/s

The exact location of the sync may not be the right final abstraction; it may belong closer to the layer-input extraction API or speculative process boundary. But the acceptance jump made it look like the current branch can read stale/not-yet-visible target features in server mode.

Minimal local diff shape:

- if (n_extract_layers != 3) {
-     throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+ if (extract_layers == nullptr || n_extract_layers == 0) {
+     throw std::runtime_error("draft model is not eagle3 (invalid extract layer count " +
      std::to_string(n_extract_layers) + ")");
  }

and the sync before building features_buf from llama_get_output_layer_inp(...).

Hope this helps; happy to split this into a tiny PR against post-norm if that is easier to inspect.

Dogacel · 2026-05-25T15:19:43Z

I tested this branch locally with Dogacel/specdrift-gpt-oss-120b-eagle3 and wanted to share two small findings in case they are useful. I may be missing some context from the ongoing refactor, so please treat this as a repro/data point rather than a requested design direction.

Setup:
* branch: `Dogacel:post-norm` at `d1b50c76a`

* target: `gpt-oss-120b-F16.gguf`

* draft: `Dogacel/specdrift-gpt-oss-120b-eagle3`, converted to BF16 GGUF

* backend: Metal / Apple M3 Max

* server mode: `--spec-type draft-eagle3`
1. SpecDrift GPT-OSS uses 5 extract layers

The draft config has:
"eagle_aux_hidden_state_layer_ids": [1, 9, 17, 25, 33]
The converted GGUF has fc.weight shaped as 5 * 2880 -> 2880 and five fc_norm.*.weight tensors. The current exact-3 checks reject this model even though the rest of the encoder path already appears to handle n_extract_layers dynamically.

Locally I changed the guard from n_extract_layers != 3 to extract_layers != nullptr && n_extract_layers > 0, and made the logging dynamic.

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

After relaxing the 3-layer guard, the model loaded and ran, but acceptance was extremely low until I synchronized the target context before reading extracted target layer inputs.

The relevant path appears to be:
* `extract_layer_inputs()` copies layer inputs with `ggml_backend_tensor_get_async(...)`

* `common_speculative_impl_draft_eagle3::process()` immediately calls `llama_get_output_layer_inp(...)` and copies those buffers into the EAGLE3 feature buffer
As a local correctness test, I added:
// Target layer inputs are copied out asynchronously after llama_decode(ctx_tgt).
// Make them host-visible before feeding them to the EAGLE3 encoder.
llama_synchronize(ctx_tgt);
right before the loop that reads llama_get_output_layer_inp(...) in common/speculative.cpp.

That changed acceptance/perf substantially on the same prompts:
config before sync after sync
n_max=1, p_min=0.75, short prompt 47.9 t/s, 10/34 accepted 66.2 t/s, 43/45 accepted
n_max=4, p_min=0.5, short prompt 33.9 t/s, 8/63 accepted 77.9 t/s, 50/54 accepted
n_max=4, p_min=0.75, longer prompt not tested before sync 62.0 t/s, 96/104 accepted
no-spec baseline, same longer prompt 55.3 t/s 55.3 t/s

The exact location of the sync may not be the right final abstraction; it may belong closer to the layer-input extraction API or speculative process boundary. But the acceptance jump made it look like the current branch can read stale/not-yet-visible target features in server mode.

Minimal local diff shape:
- if (n_extract_layers != 3) {
-     throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+ if (extract_layers == nullptr || n_extract_layers == 0) {
+     throw std::runtime_error("draft model is not eagle3 (invalid extract layer count " +
      std::to_string(n_extract_layers) + ")");
  }
and the sync before building features_buf from llama_get_output_layer_inp(...).

Hope this helps; happy to split this into a tiny PR against post-norm if that is easier to inspect.

Nice thanks! I need to rebase my changes, maybe upstream has fixed those issues, if not I will use your fix.

ruixiang63 · 2026-05-26T10:47:36Z

@tnhnyzc Regarding "Target layer-input extraction appears to need synchronization before EAGLE3 consumes it" — did you observe the same issue on the original EAGLE3 PR (ggml-org#18039), or only on this one? My guess is that post-norm changes the graph topology, so the ggml_set_sync() barrier from the original PR no longer lands on the correct split boundary. If that's the case, this would be a sync issue specific to this PR rather than a regression of the original behavior.

@Dogacel Thanks for the PR! I think this work deserves its own separate PR upstream once the original EAGLE3 PR lands, since it introduces some breaking changes and a variant of eagle3.
I would like to keep the original PR scoped to native EAGLE3 support so it stays easy to review and doesn't add extra burden on the llama.cpp maintainers.
I'd encourage you to open this directly against llama.cpp once ggml-org#18039 is merged. (though I'm not sure of the timeline, as Georgi hasn't had much bandwidth to review it yet.)
Thanks again for the PR. I've left some comments on the code as well.

tnhnyzc · 2026-05-26T13:07:57Z

@ruixiang63 I tested this against the original EAGLE3 PR path with a regular GPT-OSS EAGLE3 draft.

Setup:

PR/head: [Speculative decoding] feat: add EAGLE3 speculative decoding support ggml-org/llama.cpp#18039, ruixiang63:eagle3-adapt-new-arch at e2ad542
target: openai/gpt-oss-120b GGUF
draft: lmsys/EAGLE3-gpt-oss-120b-bf16
draft GGUF converted with the PR converter behavior, i.e. extract_layers=[2,18,33]
backend: Metal/macOS
same prompt/server args in both runs

Result:

original path, no explicit sync before reading target layer inputs:
draft acceptance = 0.06383, eval ~= 41.95 tok/s
after adding ctx->synchronize() inside llama_context::get_output_layer_inp():
draft acceptance = 0.89474, eval ~= 50.51 tok/s

So at least on Metal this does not look specific to the post-norm PR. The original EAGLE3 layer-input handoff also appears to expose async-copied host buffers before completion.

Separate converter note: the PR converter currently derives GPT-OSS extract layers as [2,18,33], but this draft’s HF config explicitly provides [1,17,33] under eagle_config.eagle_aux_hidden_state_layer_ids. Preserving that config value also works locally; with [1,17,33], I saw the same sync pattern:

no sync: acceptance = 0.10526, eval ~= 41.98 tok/s
sync: acceptance = 0.91209, eval ~= 53.81 tok/s

ruixiang63 · 2026-05-26T10:50:17Z

        n_extract_layers = llama_model_n_target_extract_layers(model_dft);
-        if (n_extract_layers != 3) {
-            throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+        if (!(extract_layers != nullptr && n_extract_layers > 0)) {


If I understand correctly, this new eagle3 variant extracts 5 layers embedding from target model instead of 3. So here may be better to use n_extract_layers != 3 || n_extract_layers != 5.

Oh, there is no explicit assumption, the model makers can choose how many layers they want, I think it should be dynamic.

ruixiang63 · 2026-05-26T10:51:36Z

    "Llama4ForConditionalGeneration": "llama",
    "LlamaBidirectionalModel": "llama",
    "LlamaForCausalLM": "llama",
+    "LlamaForCausalLMEagle3": "llama",


The upstream changes already contains LlamaForCausalLMEagle3

ruixiang63 · 2026-05-26T10:54:42Z

            target_num_layers = target_config["num_hidden_layers"]
-            extract_layers = [2, target_num_layers // 2, target_num_layers - 3]
-            logger.info(f"EAGLE-3: extract_layers = {extract_layers} (target model has {target_num_layers} layers)")
+            cfg_extract = (eagle3_raw_config.get("eagle_config") or {}).get("eagle_aux_hidden_state_layer_ids")


I didn’t use eagle_aux_hidden_state_layer_ids here because different Eagle3 checkpoints interpret layer_ids differently: some expect them to be set before extracting the layers, while others expect them afterward, which can sometimes require adding +1.
To avoid this ambiguity, I decided to compute the values manually based on the original paper instead of relying on the Eagle3 config.
So you may follow the same strategy here.

I think we should rely on eagle_aux_hidden_state_layer_ids because there is no explicit formula beyond the standard EAGLE-3.

If you can point out what models require +1 and what models' dont, maybe we can generalize the logic. I don't like hardcoding the layer ids.

ruixiang63 · 2026-05-26T11:08:12Z

-    if (target_extract_layers.size() != 3) {
-        throw std::runtime_error("EAGLE3 requires exactly 3 entries in 'extract_layers'");
+    if (target_extract_layers.size() == 0) {
+        throw std::runtime_error("EAGLE3 requires at least 1 entry in 'extract_layers'");


eagle3 needs 3 or 5. "at least 1 " is misleading.

ruixiang63 · 2026-05-26T11:10:18Z

        }

-        extract_layer_inputs(res);
+        extract_layer_inputs(res, (uint32_t) n_tokens_prev, n_tokens_all);


This complicates the logic, so we should keep the original API parameters. ubatch is already handled correctly by this function.

In my case it was causing segfault, I will share the reproduction script.

ruixiang63 · 2026-05-26T11:11:46Z

 }

-void llama_context::extract_layer_inputs(const llm_graph_result * res) {
+void llama_context::extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all) {


This change is unnecessary.

It is necessary unless my launch config is wrong (and I assume most people would set it up similarly.

Try to comment this change and reproduce a segfault using,

./build/bin/llama-server -m $TARGET -md $DRAFTER --spec-type draft-eagle3 --spec-draft-n-max 4 --spec-draft-p-min 0.25 -np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap llama-benchy --base-url http://localhost:8080/v1 --tg 128 --model openai/gpt-oss-120b

ruixiang63 · 2026-05-26T11:12:05Z

+    // the buffer holds the whole batch ([n_tokens_all] rows); each ubatch is
+    // written at its running offset n_tokens_prev so multi-ubatch decodes
+    // accumulate instead of overwriting (mirrors the logits/pre-norm paths).
+    void extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all);


same unnecessary.

ruixiang63 · 2026-05-26T11:15:46Z


        params_base = params;

+        const bool spec_eagle3 = std::find(params_base.speculative.types.begin(), params_base.speculative.types.end(),


Why need this? Eagle3 already supportws np > 1 and doesn't need unified KV cache.

I had crashes in long-context with ubatch, let me share the script to reproduce bugs I have faced.

ruixiang63 · 2026-05-26T11:16:40Z

                cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP;
            }

+            if (spec_eagle3 && cparams.n_ubatch < cparams.n_batch) {


ubatch and n_batch in eagle3 should be correctly supported. Why this?

Try removing this line and see the abort,

./build/bin/llama-server -m "$TARGET" -md "$DRAFT" \ --spec-type draft-eagle3 --spec-draft-n-max 8 --spec-draft-p-min 0.5 \ -np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap 2>&1 | tee /tmp/repro.log llama-benchy --base-url http://localhost:8080/v1 --pp 2048 --tg 8 --concurrency 1 --model openai/gpt-oss-120b

This hits:

// micro-batching is not possible for non-causal encoding, so we process the batch in a single shot GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens");

Dogacel · 2026-05-27T05:47:11Z

@ruixiang63 pushed new changes, LMK if you have any issues.

ruixiang63 · 2026-05-27T17:02:01Z

@tnhnyzc I wasn’t able to reproduce the issue on DGX Spark, but I think it makes sense to add a sync after embedding extraction. I added you as a co-author on commit acc31b1. Could you please verify if it fixes your issue?

@Dogacel Thanks for reporting this. I was able to reproduce the bug, and it appears to be caused by llama_encode not supporting ubatch and extract_layer_inputs not handling the ubatch correctly. I pushed a fix in commit e8ddf01 and added you as a co-author. I also validated it locally, and it works well for both np > 1 and ub < b. Please verify.

tnhnyzc · 2026-05-27T17:42:45Z

@ruixiang63 Yep, verified on my Metal/macOS setup and it fixes the issue. Thanks!

Dogacel mentioned this pull request May 19, 2026

[Speculative decoding] feat: add EAGLE3 speculative decoding support ggml-org/llama.cpp#18039

Draft

ruixiang63 force-pushed the eagle3-adapt-new-arch branch from 5eafd44 to 71eb0c7 Compare May 20, 2026 14:35

ggerganov and others added 4 commits May 22, 2026 10:27

llama : enable layer input extraction

4815f8f

spec: support eagle3

91f698b

eagle3: fix params bug

35c1b77

eagle3: support Gemma4 eagle3 from RedHatAI

e2ad542

ruixiang63 force-pushed the eagle3-adapt-new-arch branch from 9d2fdcf to e2ad542 Compare May 22, 2026 10:29

Dogacel added 2 commits May 25, 2026 11:10

feat: EAGLE-3 post-norm support

c657fc9

bug fixes for u_batch

adcedaa

Dogacel force-pushed the post-norm branch from d1b50c7 to adcedaa Compare May 26, 2026 05:36

ruixiang63 self-requested a review May 26, 2026 22:15

ruixiang63 reviewed May 26, 2026

View reviewed changes

tidy up

9da1990

ruixiang63 force-pushed the eagle3-adapt-new-arch branch 2 times, most recently from cf1aab7 to e8ddf01 Compare May 27, 2026 16:53


		params_base = params;

		const bool spec_eagle3 = std::find(params_base.speculative.types.begin(), params_base.speculative.types.end(),

Conversation

Dogacel commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Dogacel commented May 21, 2026

Uh oh!

tnhnyzc commented May 25, 2026

1. SpecDrift GPT-OSS uses 5 extract layers

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

Uh oh!

Dogacel commented May 25, 2026

1. SpecDrift GPT-OSS uses 5 extract layers

2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes it

Uh oh!

ruixiang63 commented May 26, 2026

Uh oh!

tnhnyzc commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dogacel May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dogacel commented May 27, 2026

Uh oh!

ruixiang63 commented May 27, 2026

Uh oh!

tnhnyzc commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dogacel commented May 19, 2026 •

edited

Loading

Dogacel May 26, 2026 •

edited

Loading