[NPU] Enable Eagle3 top-1 proposal with SDPA stateful pipeline #2947

GuoliangShiIntel · 2025-10-30T09:28:57Z

Description

Ticket: https://jira.devtools.intel.com/browse/CVS-175909

Fixes #(issue)

Checklist:

Tests have been updated or added to cover the new code
This patch fully addresses the ticket.
I have made corresponding changes to the documentation

Copilot

Pull Request Overview

This PR enables Eagle3 speculative decoding with top-1 proposal using SDPA (Scaled Dot-Product Attention) stateful pipeline for NPU devices. Eagle3 is an advanced speculative decoding algorithm that uses hidden states from the target model to guide draft model token generation.

Key changes:

Implements StatefulEagle3LLMPipeline for Eagle3 speculative decoding on NPU
Adds Eagle3 model transformations (embedding weight sharing, hidden state extraction)
Integrates Eagle3 into continuous batching pipeline with draft-to-target token mapping
Updates test infrastructure to support Eagle3 models and excludes them from standard API tests

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`tools/llm_bench/task/text_generation.py`	Adds eagle3_mode flag to disable special tokens during encoding for better compression
`tools/llm_bench/benchmark.py`	Adds command-line argument for eagle3_mode flag
`tests/python_tests/utils/hugging_face.py`	Adds eagle3-specific model loading logic with conditional tokenizer handling
`tests/python_tests/test_continuous_batching.py`	Adds Eagle3 test cases and parametrization for speculative decoding tests
`tests/python_tests/samples/test_speculative_decoding_lm.py`	Adds Eagle3-specific test class (currently skipped)
`tests/python_tests/samples/conftest.py`	Adds Eagle3 model configurations for Qwen3-1.7B models
`src/cpp/src/speculative_decoding/update_request_structs.hpp`	Adds hidden_states field to GeneratedSequence structure
`src/cpp/src/speculative_decoding/speculative_decoding_utils.*`	Implements Eagle3 configuration extraction and runtime info structures
`src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.*`	Implements complete Eagle3 pipeline with inference wrappers and validation
`src/cpp/src/speculative_decoding/speculative_decoding_impl.*`	Refactors speculative decoding to support common generation strategy
`src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.*`	Implements Eagle3-specific transformations and continuous batching integration
`src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.*`	Adds Eagle3DecodingImpl class and hidden state management
`src/cpp/src/sequence_group.hpp`	Adds hidden state storage and getters/setters to Sequence class
`src/cpp/src/sampling/sampler.*`	Adds draft-to-target token mapping support
`src/cpp/src/llm/pipeline.cpp`	Integrates Eagle3 pipeline selection based on configuration
`src/cpp/src/continuous_batching/pipeline.cpp`	Adds Eagle3 pipeline initialization in continuous batching
`src/cpp/src/continuous_batching/model_runner.hpp`	Adds hidden state flags and management infrastructure
`src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp`	Declares Eagle3DecodingImpl friend class
`samples/python/text_generation/speculative_decoding_lm.py`	Updates sample to use NPU device with reduced token count
`.github/workflows/*.yml`	Adds Eagle3-specific test workflows and excludes from standard tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/python_tests/utils/hugging_face.py

tests/python_tests/test_continuous_batching.py

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp

src/cpp/src/continuous_batching/model_runner.hpp

samples/python/text_generation/speculative_decoding_lm.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp

samples/python/text_generation/speculative_decoding_lm.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp:1

[nitpick] There appears to be a missing comment or documentation explaining what 'adjust_factor' represents and why it's needed for Eagle3. Consider adding a brief comment for maintainability.

// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/python_tests/utils/hugging_face.py

tests/python_tests/samples/test_speculative_decoding_lm.py

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp

src/cpp/src/continuous_batching/model_runner.hpp

.github/workflows/windows.yml

Copilot

Pull Request Overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp

Copilot · 2025-11-17T05:27:35Z

src/cpp/src/continuous_batching/pipeline.cpp

+struct Eagle3RTInfo {
+    bool eagle3_mode = false;
+    std::vector<int> hidden_layers_list;
+    std::filesystem::path dt_mapping_table;
+};
+
+Eagle3RTInfo
+extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) {
+    Eagle3RTInfo eagle_rt_info;
+    if (config.find("eagle3_mode") != config.end()) {
+        eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>();
+        config.erase("eagle3_mode");
+        if (config.find("hidden_layers_list") != config.end()) {
+            eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>();
+            config.erase("hidden_layers_list");
+        } else {
+            // compute the layers from number of hidden layers
+            auto config_file_path = models_path / "config.json";
+            if (!std::filesystem::exists(config_file_path))
+                OPENVINO_THROW("cannot deduce layers for hidden layer extraction");
+            std::ifstream file(config_file_path);
+
+            nlohmann::json data = nlohmann::json::parse(file);
+            using ov::genai::utils::read_json_param;
+            int num_decoder_layers = 0;
+            read_json_param(data, "num_hidden_layers", num_decoder_layers);
+            OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction");
+            // The following default hidden layer selection corresponds to the EAGLE reference implementation:
+            // https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138
+            // These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from
+            // early, middle, and late stages of the decoder, as recommended by the EAGLE authors.
+            // If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.
+            eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };
+        }
+    }
+    return eagle_rt_info;
+}
+


The Eagle3RTInfo struct and extract_eagle_mode_from_config function are duplicated between this file and speculative_decoding_utils.hpp/cpp. This code duplication should be eliminated by using the implementation from speculative_decoding_utils instead.

Suggested change

struct Eagle3RTInfo {

bool eagle3_mode = false;

std::vector<int> hidden_layers_list;

std::filesystem::path dt_mapping_table;

};

Eagle3RTInfo

extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) {

Eagle3RTInfo eagle_rt_info;

if (config.find("eagle3_mode") != config.end()) {

eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>();

config.erase("eagle3_mode");

if (config.find("hidden_layers_list") != config.end()) {

eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>();

config.erase("hidden_layers_list");

} else {

// compute the layers from number of hidden layers

auto config_file_path = models_path / "config.json";

if (!std::filesystem::exists(config_file_path))

OPENVINO_THROW("cannot deduce layers for hidden layer extraction");

std::ifstream file(config_file_path);

nlohmann::json data = nlohmann::json::parse(file);

using ov::genai::utils::read_json_param;

int num_decoder_layers = 0;

read_json_param(data, "num_hidden_layers", num_decoder_layers);

OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction");

// The following default hidden layer selection corresponds to the EAGLE reference implementation:

// https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138

// These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from

// early, middle, and late stages of the decoder, as recommended by the EAGLE authors.

// If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.

eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };

}

}

return eagle_rt_info;

}

tests/python_tests/samples/test_speculative_decoding_lm.py

.github/workflows/windows.yml

Copilot

Pull Request Overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-19T05:28:37Z

tests/python_tests/utils/hugging_face.py

+    if "eagle3" not in str(model_id).lower():
+        hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))
+        opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
+        return opt_model, hf_tokenizer
+    else:
+        hf_tokenizer = None
+        opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
+        return opt_model, hf_tokenizer


Duplicated code in both branches: the opt_model creation logic is nearly identical except for the eagle3=True parameter. Consider refactoring to eliminate duplication by conditionally setting eagle3 parameter.

Suggested change

if "eagle3" not in str(model_id).lower():

hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))

opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))

return opt_model, hf_tokenizer

else:

hf_tokenizer = None

opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))

return opt_model, hf_tokenizer

eagle3_flag = "eagle3" in str(model_id).lower()

if not eagle3_flag:

hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))

else:

hf_tokenizer = None

opt_model = retry_request(lambda: model_class.from_pretrained(

model_id,

eagle3=eagle3_flag,

export=isinstance(model_id, str),

compile=False,

load_in_8bit=False,

ov_config=get_default_llm_properties(),

local_files_only=local_files_only

))

return opt_model, hf_tokenizer

Copilot · 2025-11-19T05:28:37Z

tests/python_tests/test_continuous_batching.py

+    if draft_model_id is None:
+        generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5)
+        extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id)
+        total_time = (time.perf_counter() - start_time) * 1000

+    else:
+        if (pipeline_type == PipelineType.SPECULATIVE_DECODING):
+            generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5)
+            extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id)
+            total_time = (time.perf_counter() - start_time) * 1000


Lines 510-512 and 516-518 are identical. The nested condition on line 515 makes the logic unclear. Consider simplifying by removing the redundant else branch or restructuring the conditions.

Copilot · 2025-11-19T05:28:38Z

tests/python_tests/samples/conftest.py

+    "qwen3_1.7b_eagle3": {
+        "name": "AngelSlim/Qwen3-1.7B_eagle3",
+        "convert_args": ["--task", "text-generation-with-past", "--trust-remote-code", "--eagle3"]
    }


The --eagle3 flag is passed in convert_args but there's no documentation or comment explaining what this flag does or why it's needed specifically for this model configuration.

Copilot · 2025-11-19T05:28:38Z

src/cpp/src/speculative_decoding/speculative_decoding_utils.cpp

+void ensure_num_assistant_tokens_is_set(ov::genai::GenerationConfig& config) {
+    // Only num_assistant_tokens is supported, not assistant_confidence_threshold
+    OPENVINO_ASSERT(
+        config.assistant_confidence_threshold == 0.f,


Magic number 0.f should be defined as a named constant (e.g., DISABLED_ASSISTANT_CONFIDENCE_THRESHOLD) to clarify its meaning and improve maintainability.

Copilot · 2025-11-19T05:28:38Z

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp

+                                                ov::Tensor& input_ids,
+                                                ov::Tensor& attention_mask,
+                                                ov::Tensor& position_ids) {
+    OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count");


Error message on line 169 could be more specific by indicating which condition failed (empty tokens vs zero token count) to aid debugging.

Suggested change

OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count");

OPENVINO_ASSERT(!m_tokens.empty(), "Cannot build inputs: empty sequence (m_tokens is empty)");

OPENVINO_ASSERT(token_count > 0, "Cannot build inputs: zero token count (token_count <= 0)");

Copilot · 2025-11-19T05:28:39Z

src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp

+void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) {
+    // extract embedding weight from main model
+    auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model)
+        -> std::shared_ptr<ov::Node> {


The threshold value 1000 for vocabulary size is a heuristic but lacks explanation. Add a comment explaining why this specific value was chosen.

Suggested change

-> std::shared_ptr<ov::Node> {

-> std::shared_ptr<ov::Node> {

// The threshold value 1000 for vocabulary size is a heuristic chosen because most modern language models

// (e.g., GPT, Llama, Falcon) have vocabulary sizes well above 1000, while other embedding layers (e.g., for classification)

// typically have much smaller vocabularies. This helps reliably identify the token embedding layer in the model.

// Adjust this value if working with models with unusually small or large vocabularies.

Copilot · 2025-11-19T05:28:39Z

src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp

+        if (m_device.find("NPU") != std::string::npos) {
+            // Scale input down by 100x before MatMul to avoid FP16 overflow, then scale result back up
+            // The factor 100 (0.01 and 100.0) is an empirical value
+            auto scale_down_const = std::make_shared<v0::Constant>(matmul_input0.get_element_type(), ov::Shape{}, 0.01f);


The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.

Copilot · 2025-11-19T05:28:39Z

src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.cpp

+            new_matmul->set_friendly_name(matmul_node->get_friendly_name() + "/matmul");
+
+            // Scale result back up to maintain numerical equivalence
+            auto scale_up_const = std::make_shared<v0::Constant>(new_matmul->get_element_type(), ov::Shape{}, 100.0f);


The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.

Copilot · 2025-11-19T05:28:40Z

src/cpp/src/continuous_batching/model_runner.hpp

+                                }
+                            }
+                        } else {
+                            OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model");


Error message is unclear about what action the user should take. Consider adding guidance such as 'Ensure target model has generated hidden states before draft model inference'.

Suggested change

OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model");

OPENVINO_ASSERT(false, "Missing hidden state from target model to eagle draft model. Ensure the target model has generated hidden states before draft model inference.");

Copilot · 2025-11-19T05:28:40Z

.github/workflows/windows.yml

            timeout: 90
+          - name: 'EAGLE3 speculative decoding tests'
+            cmd: |
+              python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23


Installing from a specific commit hash (ea9607daf32919024cdd4390deec9693a7b64d23) in a personal GitHub repository is fragile and not reproducible long-term. Consider using a tagged release from the official repository or documenting why this specific commit is required.

Suggested change

python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23

# Install optimum-intel from the official repository for reproducibility.

python -m pip install optimum-intel

Copilot AI review requested due to automatic review settings November 5, 2025 03:02

Copilot AI reviewed Nov 5, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 6, 2025 07:51

Copilot AI reviewed Nov 6, 2025

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp Outdated Show resolved Hide resolved

samples/python/text_generation/speculative_decoding_lm.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings November 6, 2025 08:34

Copilot AI reviewed Nov 6, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 17, 2025 05:25

GuoliangShiIntel force-pushed the sgl/eagle_3_npu_support_v2 branch from ff7728e to a47dbaf Compare November 17, 2025 05:25

github-actions bot removed the category: llm_bench Label for tool/llm_bench folder label Nov 17, 2025

Copilot AI reviewed Nov 17, 2025

View reviewed changes

songbell and others added 2 commits November 17, 2025 21:48

eagle3 cb impl with top-1 proposal PR#2740

257978f

Support eagle 3 top 1 for NPU

a47dbaf

GuoliangShiIntel force-pushed the sgl/eagle_3_npu_support_v2 branch from b4d9c7d to 8aed616 Compare November 19, 2025 05:26

Copilot AI review requested due to automatic review settings November 19, 2025 05:26

Copilot AI reviewed Nov 19, 2025

View reviewed changes

Small code clean

8aed616

	OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count");
	OPENVINO_ASSERT(!m_tokens.empty(), "Cannot build inputs: empty sequence (m_tokens is empty)");
	OPENVINO_ASSERT(token_count > 0, "Cannot build inputs: zero token count (token_count <= 0)");

-        -> std::shared_ptr<ov::Node> {
+        -> std::shared_ptr<ov::Node> {
+        // The threshold value 1000 for vocabulary size is a heuristic chosen because most modern language models
+        // (e.g., GPT, Llama, Falcon) have vocabulary sizes well above 1000, while other embedding layers (e.g., for classification)
+        // typically have much smaller vocabularies. This helps reliably identify the token embedding layer in the model.
+        // Adjust this value if working with models with unusually small or large vocabularies.

	OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model");
	OPENVINO_ASSERT(false, "Missing hidden state from target model to eagle draft model. Ensure the target model has generated hidden states before draft model inference.");

	python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23
	# Install optimum-intel from the official repository for reproducibility.
	python -m pip install optimum-intel

[NPU] Enable Eagle3 top-1 proposal with SDPA stateful pipeline #2947

Are you sure you want to change the base?

[NPU] Enable Eagle3 top-1 proposal with SDPA stateful pipeline #2947

Uh oh!

Conversation

GuoliangShiIntel commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

GuoliangShiIntel commented Oct 30, 2025 •

edited

Loading