-
Notifications
You must be signed in to change notification settings - Fork 302
[NPU] Enable Eagle3 top-1 proposal with SDPA stateful pipeline #2947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPU] Enable Eagle3 top-1 proposal with SDPA stateful pipeline #2947
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables Eagle3 speculative decoding with top-1 proposal using SDPA (Scaled Dot-Product Attention) stateful pipeline for NPU devices. Eagle3 is an advanced speculative decoding algorithm that uses hidden states from the target model to guide draft model token generation.
Key changes:
- Implements StatefulEagle3LLMPipeline for Eagle3 speculative decoding on NPU
- Adds Eagle3 model transformations (embedding weight sharing, hidden state extraction)
- Integrates Eagle3 into continuous batching pipeline with draft-to-target token mapping
- Updates test infrastructure to support Eagle3 models and excludes them from standard API tests
Reviewed Changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tools/llm_bench/task/text_generation.py |
Adds eagle3_mode flag to disable special tokens during encoding for better compression |
tools/llm_bench/benchmark.py |
Adds command-line argument for eagle3_mode flag |
tests/python_tests/utils/hugging_face.py |
Adds eagle3-specific model loading logic with conditional tokenizer handling |
tests/python_tests/test_continuous_batching.py |
Adds Eagle3 test cases and parametrization for speculative decoding tests |
tests/python_tests/samples/test_speculative_decoding_lm.py |
Adds Eagle3-specific test class (currently skipped) |
tests/python_tests/samples/conftest.py |
Adds Eagle3 model configurations for Qwen3-1.7B models |
src/cpp/src/speculative_decoding/update_request_structs.hpp |
Adds hidden_states field to GeneratedSequence structure |
src/cpp/src/speculative_decoding/speculative_decoding_utils.* |
Implements Eagle3 configuration extraction and runtime info structures |
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.* |
Implements complete Eagle3 pipeline with inference wrappers and validation |
src/cpp/src/speculative_decoding/speculative_decoding_impl.* |
Refactors speculative decoding to support common generation strategy |
src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.* |
Implements Eagle3-specific transformations and continuous batching integration |
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.* |
Adds Eagle3DecodingImpl class and hidden state management |
src/cpp/src/sequence_group.hpp |
Adds hidden state storage and getters/setters to Sequence class |
src/cpp/src/sampling/sampler.* |
Adds draft-to-target token mapping support |
src/cpp/src/llm/pipeline.cpp |
Integrates Eagle3 pipeline selection based on configuration |
src/cpp/src/continuous_batching/pipeline.cpp |
Adds Eagle3 pipeline initialization in continuous batching |
src/cpp/src/continuous_batching/model_runner.hpp |
Adds hidden state flags and management infrastructure |
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp |
Declares Eagle3DecodingImpl friend class |
samples/python/text_generation/speculative_decoding_lm.py |
Updates sample to use NPU device with reduced token count |
.github/workflows/*.yml |
Adds Eagle3-specific test workflows and excludes from standard tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp:1
- [nitpick] There appears to be a missing comment or documentation explaining what 'adjust_factor' represents and why it's needed for Eagle3. Consider adding a brief comment for maintainability.
// Copyright (C) 2023-2025 Intel Corporation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Show resolved
Hide resolved
ff7728e to
a47dbaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.cpp
Outdated
Show resolved
Hide resolved
| struct Eagle3RTInfo { | ||
| bool eagle3_mode = false; | ||
| std::vector<int> hidden_layers_list; | ||
| std::filesystem::path dt_mapping_table; | ||
| }; | ||
|
|
||
| Eagle3RTInfo | ||
| extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) { | ||
| Eagle3RTInfo eagle_rt_info; | ||
| if (config.find("eagle3_mode") != config.end()) { | ||
| eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>(); | ||
| config.erase("eagle3_mode"); | ||
| if (config.find("hidden_layers_list") != config.end()) { | ||
| eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>(); | ||
| config.erase("hidden_layers_list"); | ||
| } else { | ||
| // compute the layers from number of hidden layers | ||
| auto config_file_path = models_path / "config.json"; | ||
| if (!std::filesystem::exists(config_file_path)) | ||
| OPENVINO_THROW("cannot deduce layers for hidden layer extraction"); | ||
| std::ifstream file(config_file_path); | ||
|
|
||
| nlohmann::json data = nlohmann::json::parse(file); | ||
| using ov::genai::utils::read_json_param; | ||
| int num_decoder_layers = 0; | ||
| read_json_param(data, "num_hidden_layers", num_decoder_layers); | ||
| OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction"); | ||
| // The following default hidden layer selection corresponds to the EAGLE reference implementation: | ||
| // https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138 | ||
| // These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from | ||
| // early, middle, and late stages of the decoder, as recommended by the EAGLE authors. | ||
| // If you wish to use different layers, provide the "hidden_layers_list" parameter in the config. | ||
| eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 }; | ||
| } | ||
| } | ||
| return eagle_rt_info; | ||
| } | ||
|
|
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Eagle3RTInfo struct and extract_eagle_mode_from_config function are duplicated between this file and speculative_decoding_utils.hpp/cpp. This code duplication should be eliminated by using the implementation from speculative_decoding_utils instead.
| struct Eagle3RTInfo { | |
| bool eagle3_mode = false; | |
| std::vector<int> hidden_layers_list; | |
| std::filesystem::path dt_mapping_table; | |
| }; | |
| Eagle3RTInfo | |
| extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) { | |
| Eagle3RTInfo eagle_rt_info; | |
| if (config.find("eagle3_mode") != config.end()) { | |
| eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>(); | |
| config.erase("eagle3_mode"); | |
| if (config.find("hidden_layers_list") != config.end()) { | |
| eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>(); | |
| config.erase("hidden_layers_list"); | |
| } else { | |
| // compute the layers from number of hidden layers | |
| auto config_file_path = models_path / "config.json"; | |
| if (!std::filesystem::exists(config_file_path)) | |
| OPENVINO_THROW("cannot deduce layers for hidden layer extraction"); | |
| std::ifstream file(config_file_path); | |
| nlohmann::json data = nlohmann::json::parse(file); | |
| using ov::genai::utils::read_json_param; | |
| int num_decoder_layers = 0; | |
| read_json_param(data, "num_hidden_layers", num_decoder_layers); | |
| OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction"); | |
| // The following default hidden layer selection corresponds to the EAGLE reference implementation: | |
| // https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138 | |
| // These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from | |
| // early, middle, and late stages of the decoder, as recommended by the EAGLE authors. | |
| // If you wish to use different layers, provide the "hidden_layers_list" parameter in the config. | |
| eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 }; | |
| } | |
| } | |
| return eagle_rt_info; | |
| } |
b4d9c7d to
8aed616
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "eagle3" not in str(model_id).lower(): | ||
| hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only)) | ||
| opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only)) | ||
| return opt_model, hf_tokenizer | ||
| else: | ||
| hf_tokenizer = None | ||
| opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only)) | ||
| return opt_model, hf_tokenizer |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated code in both branches: the opt_model creation logic is nearly identical except for the eagle3=True parameter. Consider refactoring to eliminate duplication by conditionally setting eagle3 parameter.
| if "eagle3" not in str(model_id).lower(): | |
| hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only)) | |
| opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only)) | |
| return opt_model, hf_tokenizer | |
| else: | |
| hf_tokenizer = None | |
| opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only)) | |
| return opt_model, hf_tokenizer | |
| eagle3_flag = "eagle3" in str(model_id).lower() | |
| if not eagle3_flag: | |
| hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only)) | |
| else: | |
| hf_tokenizer = None | |
| opt_model = retry_request(lambda: model_class.from_pretrained( | |
| model_id, | |
| eagle3=eagle3_flag, | |
| export=isinstance(model_id, str), | |
| compile=False, | |
| load_in_8bit=False, | |
| ov_config=get_default_llm_properties(), | |
| local_files_only=local_files_only | |
| )) | |
| return opt_model, hf_tokenizer |
| if draft_model_id is None: | ||
| generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5) | ||
| extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id) | ||
| total_time = (time.perf_counter() - start_time) * 1000 | ||
|
|
||
| else: | ||
| if (pipeline_type == PipelineType.SPECULATIVE_DECODING): | ||
| generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5) | ||
| extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id) | ||
| total_time = (time.perf_counter() - start_time) * 1000 |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 510-512 and 516-518 are identical. The nested condition on line 515 makes the logic unclear. Consider simplifying by removing the redundant else branch or restructuring the conditions.
| "qwen3_1.7b_eagle3": { | ||
| "name": "AngelSlim/Qwen3-1.7B_eagle3", | ||
| "convert_args": ["--task", "text-generation-with-past", "--trust-remote-code", "--eagle3"] | ||
| } |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --eagle3 flag is passed in convert_args but there's no documentation or comment explaining what this flag does or why it's needed specifically for this model configuration.
| void ensure_num_assistant_tokens_is_set(ov::genai::GenerationConfig& config) { | ||
| // Only num_assistant_tokens is supported, not assistant_confidence_threshold | ||
| OPENVINO_ASSERT( | ||
| config.assistant_confidence_threshold == 0.f, |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 0.f should be defined as a named constant (e.g., DISABLED_ASSISTANT_CONFIDENCE_THRESHOLD) to clarify its meaning and improve maintainability.
| ov::Tensor& input_ids, | ||
| ov::Tensor& attention_mask, | ||
| ov::Tensor& position_ids) { | ||
| OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count"); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message on line 169 could be more specific by indicating which condition failed (empty tokens vs zero token count) to aid debugging.
| OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count"); | |
| OPENVINO_ASSERT(!m_tokens.empty(), "Cannot build inputs: empty sequence (m_tokens is empty)"); | |
| OPENVINO_ASSERT(token_count > 0, "Cannot build inputs: zero token count (token_count <= 0)"); |
| void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) { | ||
| // extract embedding weight from main model | ||
| auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model) | ||
| -> std::shared_ptr<ov::Node> { |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold value 1000 for vocabulary size is a heuristic but lacks explanation. Add a comment explaining why this specific value was chosen.
| -> std::shared_ptr<ov::Node> { | |
| -> std::shared_ptr<ov::Node> { | |
| // The threshold value 1000 for vocabulary size is a heuristic chosen because most modern language models | |
| // (e.g., GPT, Llama, Falcon) have vocabulary sizes well above 1000, while other embedding layers (e.g., for classification) | |
| // typically have much smaller vocabularies. This helps reliably identify the token embedding layer in the model. | |
| // Adjust this value if working with models with unusually small or large vocabularies. |
| if (m_device.find("NPU") != std::string::npos) { | ||
| // Scale input down by 100x before MatMul to avoid FP16 overflow, then scale result back up | ||
| // The factor 100 (0.01 and 100.0) is an empirical value | ||
| auto scale_down_const = std::make_shared<v0::Constant>(matmul_input0.get_element_type(), ov::Shape{}, 0.01f); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.
| new_matmul->set_friendly_name(matmul_node->get_friendly_name() + "/matmul"); | ||
|
|
||
| // Scale result back up to maintain numerical equivalence | ||
| auto scale_up_const = std::make_shared<v0::Constant>(new_matmul->get_element_type(), ov::Shape{}, 100.0f); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.
| } | ||
| } | ||
| } else { | ||
| OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model"); |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message is unclear about what action the user should take. Consider adding guidance such as 'Ensure target model has generated hidden states before draft model inference'.
| OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model"); | |
| OPENVINO_ASSERT(false, "Missing hidden state from target model to eagle draft model. Ensure the target model has generated hidden states before draft model inference."); |
| timeout: 90 | ||
| - name: 'EAGLE3 speculative decoding tests' | ||
| cmd: | | ||
| python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23 |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing from a specific commit hash (ea9607daf32919024cdd4390deec9693a7b64d23) in a personal GitHub repository is fragile and not reproducible long-term. Consider using a tagged release from the official repository or documenting why this specific commit is required.
| python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23 | |
| # Install optimum-intel from the official repository for reproducibility. | |
| python -m pip install optimum-intel |
Description
Ticket: https://jira.devtools.intel.com/browse/CVS-175909
Fixes #(issue)
Checklist: