feat: EAGLE-3 post-norm support#4
Conversation
5eafd44 to
71eb0c7
Compare
|
@ruixiang63 hi, can you help me merge this into your base? I don't have access to a DGX Spark yet, would be glad if you can test my checkpoints too! I've only tested on H100s. |
9d2fdcf to
e2ad542
Compare
|
I tested this branch locally with Setup:
1. SpecDrift GPT-OSS uses 5 extract layersThe draft config has: "eagle_aux_hidden_state_layer_ids": [1, 9, 17, 25, 33]The converted GGUF has Locally I changed the guard from 2. Target layer-input extraction appears to need synchronization before EAGLE3 consumes itAfter relaxing the 3-layer guard, the model loaded and ran, but acceptance was extremely low until I synchronized the target context before reading extracted target layer inputs. The relevant path appears to be:
As a local correctness test, I added: // Target layer inputs are copied out asynchronously after llama_decode(ctx_tgt).
// Make them host-visible before feeding them to the EAGLE3 encoder.
llama_synchronize(ctx_tgt);right before the loop that reads That changed acceptance/perf substantially on the same prompts:
The exact location of the sync may not be the right final abstraction; it may belong closer to the layer-input extraction API or speculative process boundary. But the acceptance jump made it look like the current branch can read stale/not-yet-visible target features in server mode. Minimal local diff shape: - if (n_extract_layers != 3) {
- throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " +
+ if (extract_layers == nullptr || n_extract_layers == 0) {
+ throw std::runtime_error("draft model is not eagle3 (invalid extract layer count " +
std::to_string(n_extract_layers) + ")");
}and the sync before building Hope this helps; happy to split this into a tiny PR against |
Nice thanks! I need to rebase my changes, maybe upstream has fixed those issues, if not I will use your fix. |
|
@tnhnyzc Regarding "Target layer-input extraction appears to need synchronization before EAGLE3 consumes it" — did you observe the same issue on the original EAGLE3 PR (ggml-org#18039), or only on this one? My guess is that post-norm changes the graph topology, so the @Dogacel Thanks for the PR! I think this work deserves its own separate PR upstream once the original EAGLE3 PR lands, since it introduces some breaking changes and a variant of eagle3. |
|
@ruixiang63 I tested this against the original EAGLE3 PR path with a regular GPT-OSS EAGLE3 draft. Setup:
Result:
So at least on Metal this does not look specific to the post-norm PR. The original EAGLE3 layer-input handoff also appears to expose async-copied host buffers before completion. Separate converter note: the PR converter currently derives GPT-OSS extract layers as [2,18,33], but this draft’s HF config explicitly provides [1,17,33] under eagle_config.eagle_aux_hidden_state_layer_ids. Preserving that config value also works locally; with [1,17,33], I saw the same sync pattern:
|
| n_extract_layers = llama_model_n_target_extract_layers(model_dft); | ||
| if (n_extract_layers != 3) { | ||
| throw std::runtime_error("draft model is not eagle3 (expected 3 extract layers, got " + | ||
| if (!(extract_layers != nullptr && n_extract_layers > 0)) { |
There was a problem hiding this comment.
If I understand correctly, this new eagle3 variant extracts 5 layers embedding from target model instead of 3. So here may be better to use n_extract_layers != 3 || n_extract_layers != 5.
There was a problem hiding this comment.
Oh, there is no explicit assumption, the model makers can choose how many layers they want, I think it should be dynamic.
| "Llama4ForConditionalGeneration": "llama", | ||
| "LlamaBidirectionalModel": "llama", | ||
| "LlamaForCausalLM": "llama", | ||
| "LlamaForCausalLMEagle3": "llama", |
There was a problem hiding this comment.
The upstream changes already contains LlamaForCausalLMEagle3
| target_num_layers = target_config["num_hidden_layers"] | ||
| extract_layers = [2, target_num_layers // 2, target_num_layers - 3] | ||
| logger.info(f"EAGLE-3: extract_layers = {extract_layers} (target model has {target_num_layers} layers)") | ||
| cfg_extract = (eagle3_raw_config.get("eagle_config") or {}).get("eagle_aux_hidden_state_layer_ids") |
There was a problem hiding this comment.
I didn’t use eagle_aux_hidden_state_layer_ids here because different Eagle3 checkpoints interpret layer_ids differently: some expect them to be set before extracting the layers, while others expect them afterward, which can sometimes require adding +1.
To avoid this ambiguity, I decided to compute the values manually based on the original paper instead of relying on the Eagle3 config.
So you may follow the same strategy here.
There was a problem hiding this comment.
I think we should rely on eagle_aux_hidden_state_layer_ids because there is no explicit formula beyond the standard EAGLE-3.
If you can point out what models require +1 and what models' dont, maybe we can generalize the logic. I don't like hardcoding the layer ids.
| if (target_extract_layers.size() != 3) { | ||
| throw std::runtime_error("EAGLE3 requires exactly 3 entries in 'extract_layers'"); | ||
| if (target_extract_layers.size() == 0) { | ||
| throw std::runtime_error("EAGLE3 requires at least 1 entry in 'extract_layers'"); |
There was a problem hiding this comment.
eagle3 needs 3 or 5. "at least 1 " is misleading.
| } | ||
|
|
||
| extract_layer_inputs(res); | ||
| extract_layer_inputs(res, (uint32_t) n_tokens_prev, n_tokens_all); |
There was a problem hiding this comment.
This complicates the logic, so we should keep the original API parameters. ubatch is already handled correctly by this function.
There was a problem hiding this comment.
In my case it was causing segfault, I will share the reproduction script.
| } | ||
|
|
||
| void llama_context::extract_layer_inputs(const llm_graph_result * res) { | ||
| void llama_context::extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all) { |
There was a problem hiding this comment.
It is necessary unless my launch config is wrong (and I assume most people would set it up similarly.
Try to comment this change and reproduce a segfault using,
./build/bin/llama-server -m $TARGET -md $DRAFTER --spec-type draft-eagle3 --spec-draft-n-max 4 --spec-draft-p-min 0.25 -np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap
llama-benchy --base-url http://localhost:8080/v1 --tg 128 --model openai/gpt-oss-120b| // the buffer holds the whole batch ([n_tokens_all] rows); each ubatch is | ||
| // written at its running offset n_tokens_prev so multi-ubatch decodes | ||
| // accumulate instead of overwriting (mirrors the logits/pre-norm paths). | ||
| void extract_layer_inputs(const llm_graph_result * res, uint32_t n_tokens_prev, uint32_t n_tokens_all); |
|
|
||
| params_base = params; | ||
|
|
||
| const bool spec_eagle3 = std::find(params_base.speculative.types.begin(), params_base.speculative.types.end(), |
There was a problem hiding this comment.
Why need this? Eagle3 already supportws np > 1 and doesn't need unified KV cache.
There was a problem hiding this comment.
I had crashes in long-context with ubatch, let me share the script to reproduce bugs I have faced.
| cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP; | ||
| } | ||
|
|
||
| if (spec_eagle3 && cparams.n_ubatch < cparams.n_batch) { |
There was a problem hiding this comment.
ubatch and n_batch in eagle3 should be correctly supported. Why this?
There was a problem hiding this comment.
Try removing this line and see the abort,
./build/bin/llama-server -m "$TARGET" -md "$DRAFT" \
--spec-type draft-eagle3 --spec-draft-n-max 8 --spec-draft-p-min 0.5 \
-np 1 -c 4096 --port 8080 -ngl 99 -fa on --jinja --no-mmap 2>&1 | tee /tmp/repro.log
llama-benchy --base-url http://localhost:8080/v1 --pp 2048 --tg 8 --concurrency 1 --model openai/gpt-oss-120bThis hits:
// micro-batching is not possible for non-causal encoding, so we process the batch in a single shot
GGML_ASSERT(cparams.n_ubatch >= n_tokens && "encoder requires n_ubatch >= n_tokens");|
@ruixiang63 pushed new changes, LMK if you have any issues. |
cf1aab7 to
e8ddf01
Compare
|
@tnhnyzc I wasn’t able to reproduce the issue on DGX Spark, but I think it makes sense to add a sync after embedding extraction. I added you as a co-author on commit acc31b1. Could you please verify if it fixes your issue? @Dogacel Thanks for reporting this. I was able to reproduce the bug, and it appears to be caused by |
|
@ruixiang63 Yep, verified on my Metal/macOS setup and it fixes the issue. Thanks! |
Overview
Support the new post-norm architecture as described in our paper & blog:
https://x.com/dogacel0/status/2054200111043949012?s=20
Additional information
Use the models to test,
Requirements