Skip to content

Conversation

@GuoliangShiIntel
Copy link

@GuoliangShiIntel GuoliangShiIntel commented Oct 30, 2025

Description

Ticket: https://jira.devtools.intel.com/browse/CVS-175909

Fixes #(issue)

Checklist:

  • Tests have been updated or added to cover the new code
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation

@github-actions github-actions bot added category: llm_bench Label for tool/llm_bench folder category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: GHA CI based on Github actions category: LLM samples GenAI LLM samples category: CPP API Changes in GenAI C++ public headers no-match-files category: GGUF GGUF file reader labels Oct 30, 2025
Copilot AI review requested due to automatic review settings November 5, 2025 03:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables Eagle3 speculative decoding with top-1 proposal using SDPA (Scaled Dot-Product Attention) stateful pipeline for NPU devices. Eagle3 is an advanced speculative decoding algorithm that uses hidden states from the target model to guide draft model token generation.

Key changes:

  • Implements StatefulEagle3LLMPipeline for Eagle3 speculative decoding on NPU
  • Adds Eagle3 model transformations (embedding weight sharing, hidden state extraction)
  • Integrates Eagle3 into continuous batching pipeline with draft-to-target token mapping
  • Updates test infrastructure to support Eagle3 models and excludes them from standard API tests

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tools/llm_bench/task/text_generation.py Adds eagle3_mode flag to disable special tokens during encoding for better compression
tools/llm_bench/benchmark.py Adds command-line argument for eagle3_mode flag
tests/python_tests/utils/hugging_face.py Adds eagle3-specific model loading logic with conditional tokenizer handling
tests/python_tests/test_continuous_batching.py Adds Eagle3 test cases and parametrization for speculative decoding tests
tests/python_tests/samples/test_speculative_decoding_lm.py Adds Eagle3-specific test class (currently skipped)
tests/python_tests/samples/conftest.py Adds Eagle3 model configurations for Qwen3-1.7B models
src/cpp/src/speculative_decoding/update_request_structs.hpp Adds hidden_states field to GeneratedSequence structure
src/cpp/src/speculative_decoding/speculative_decoding_utils.* Implements Eagle3 configuration extraction and runtime info structures
src/cpp/src/speculative_decoding/speculative_decoding_stateful_eagle3.* Implements complete Eagle3 pipeline with inference wrappers and validation
src/cpp/src/speculative_decoding/speculative_decoding_impl.* Refactors speculative decoding to support common generation strategy
src/cpp/src/speculative_decoding/speculative_decoding_eagle3_impl.* Implements Eagle3-specific transformations and continuous batching integration
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.* Adds Eagle3DecodingImpl class and hidden state management
src/cpp/src/sequence_group.hpp Adds hidden state storage and getters/setters to Sequence class
src/cpp/src/sampling/sampler.* Adds draft-to-target token mapping support
src/cpp/src/llm/pipeline.cpp Integrates Eagle3 pipeline selection based on configuration
src/cpp/src/continuous_batching/pipeline.cpp Adds Eagle3 pipeline initialization in continuous batching
src/cpp/src/continuous_batching/model_runner.hpp Adds hidden state flags and management infrastructure
src/cpp/include/openvino/genai/continuous_batching_pipeline.hpp Declares Eagle3DecodingImpl friend class
samples/python/text_generation/speculative_decoding_lm.py Updates sample to use NPU device with reduced token count
.github/workflows/*.yml Adds Eagle3-specific test workflows and excludes from standard tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 6, 2025 07:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 6, 2025 08:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp:1

  • [nitpick] There appears to be a missing comment or documentation explaining what 'adjust_factor' represents and why it's needed for Eagle3. Consider adding a brief comment for maintainability.
// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 17, 2025 05:25
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/eagle_3_npu_support_v2 branch from ff7728e to a47dbaf Compare November 17, 2025 05:25
@github-actions github-actions bot removed the category: llm_bench Label for tool/llm_bench folder label Nov 17, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 25 to 62
struct Eagle3RTInfo {
bool eagle3_mode = false;
std::vector<int> hidden_layers_list;
std::filesystem::path dt_mapping_table;
};

Eagle3RTInfo
extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) {
Eagle3RTInfo eagle_rt_info;
if (config.find("eagle3_mode") != config.end()) {
eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>();
config.erase("eagle3_mode");
if (config.find("hidden_layers_list") != config.end()) {
eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>();
config.erase("hidden_layers_list");
} else {
// compute the layers from number of hidden layers
auto config_file_path = models_path / "config.json";
if (!std::filesystem::exists(config_file_path))
OPENVINO_THROW("cannot deduce layers for hidden layer extraction");
std::ifstream file(config_file_path);

nlohmann::json data = nlohmann::json::parse(file);
using ov::genai::utils::read_json_param;
int num_decoder_layers = 0;
read_json_param(data, "num_hidden_layers", num_decoder_layers);
OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction");
// The following default hidden layer selection corresponds to the EAGLE reference implementation:
// https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138
// These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from
// early, middle, and late stages of the decoder, as recommended by the EAGLE authors.
// If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.
eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };
}
}
return eagle_rt_info;
}

Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Eagle3RTInfo struct and extract_eagle_mode_from_config function are duplicated between this file and speculative_decoding_utils.hpp/cpp. This code duplication should be eliminated by using the implementation from speculative_decoding_utils instead.

Suggested change
struct Eagle3RTInfo {
bool eagle3_mode = false;
std::vector<int> hidden_layers_list;
std::filesystem::path dt_mapping_table;
};
Eagle3RTInfo
extract_eagle_mode_from_config(ov::AnyMap& config, const std::filesystem::path& models_path) {
Eagle3RTInfo eagle_rt_info;
if (config.find("eagle3_mode") != config.end()) {
eagle_rt_info.eagle3_mode = config.at("eagle3_mode").as<bool>();
config.erase("eagle3_mode");
if (config.find("hidden_layers_list") != config.end()) {
eagle_rt_info.hidden_layers_list = config.at("hidden_layers_list").as<std::vector<int>>();
config.erase("hidden_layers_list");
} else {
// compute the layers from number of hidden layers
auto config_file_path = models_path / "config.json";
if (!std::filesystem::exists(config_file_path))
OPENVINO_THROW("cannot deduce layers for hidden layer extraction");
std::ifstream file(config_file_path);
nlohmann::json data = nlohmann::json::parse(file);
using ov::genai::utils::read_json_param;
int num_decoder_layers = 0;
read_json_param(data, "num_hidden_layers", num_decoder_layers);
OPENVINO_ASSERT(num_decoder_layers > 3, "num_decoder_layers is too small to deduce hidden layers for extraction");
// The following default hidden layer selection corresponds to the EAGLE reference implementation:
// https://github.com/SafeAILab/EAGLE/blob/0ea94696/eagle/model/modeling_llama_kv.py#L1138
// These layers (2, num_decoder_layers / 2, num_decoder_layers - 3) are chosen to capture features from
// early, middle, and late stages of the decoder, as recommended by the EAGLE authors.
// If you wish to use different layers, provide the "hidden_layers_list" parameter in the config.
eagle_rt_info.hidden_layers_list = { 2, num_decoder_layers / 2, num_decoder_layers - 3 };
}
}
return eagle_rt_info;
}

Copilot uses AI. Check for mistakes.
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/eagle_3_npu_support_v2 branch from b4d9c7d to 8aed616 Compare November 19, 2025 05:26
Copilot AI review requested due to automatic review settings November 19, 2025 05:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +169 to +176
if "eagle3" not in str(model_id).lower():
hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))
opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
return opt_model, hf_tokenizer
else:
hf_tokenizer = None
opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
return opt_model, hf_tokenizer
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code in both branches: the opt_model creation logic is nearly identical except for the eagle3=True parameter. Consider refactoring to eliminate duplication by conditionally setting eagle3 parameter.

Suggested change
if "eagle3" not in str(model_id).lower():
hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))
opt_model = retry_request(lambda: model_class.from_pretrained(model_id, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
return opt_model, hf_tokenizer
else:
hf_tokenizer = None
opt_model = retry_request(lambda: model_class.from_pretrained(model_id, eagle3=True, export=isinstance(model_id, str), compile=False, load_in_8bit=False, ov_config=get_default_llm_properties(), local_files_only=local_files_only))
return opt_model, hf_tokenizer
eagle3_flag = "eagle3" in str(model_id).lower()
if not eagle3_flag:
hf_tokenizer = retry_request(lambda: AutoTokenizer.from_pretrained(model_id, local_files_only=local_files_only))
else:
hf_tokenizer = None
opt_model = retry_request(lambda: model_class.from_pretrained(
model_id,
eagle3=eagle3_flag,
export=isinstance(model_id, str),
compile=False,
load_in_8bit=False,
ov_config=get_default_llm_properties(),
local_files_only=local_files_only
))
return opt_model, hf_tokenizer

Copilot uses AI. Check for mistakes.
Comment on lines +509 to +518
if draft_model_id is None:
generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5)
extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id)
total_time = (time.perf_counter() - start_time) * 1000

else:
if (pipeline_type == PipelineType.SPECULATIVE_DECODING):
generation_config = GenerationConfig(do_sample=False, max_new_tokens=20, ignore_eos=True, num_assistant_tokens=5)
extended_perf_metrics = run_extended_perf_metrics_collection(main_model_id, generation_config, prompt, pipeline_type, draft_model_id)
total_time = (time.perf_counter() - start_time) * 1000
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 510-512 and 516-518 are identical. The nested condition on line 515 makes the logic unclear. Consider simplifying by removing the redundant else branch or restructuring the conditions.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to 154
"qwen3_1.7b_eagle3": {
"name": "AngelSlim/Qwen3-1.7B_eagle3",
"convert_args": ["--task", "text-generation-with-past", "--trust-remote-code", "--eagle3"]
}
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --eagle3 flag is passed in convert_args but there's no documentation or comment explaining what this flag does or why it's needed specifically for this model configuration.

Copilot uses AI. Check for mistakes.
void ensure_num_assistant_tokens_is_set(ov::genai::GenerationConfig& config) {
// Only num_assistant_tokens is supported, not assistant_confidence_threshold
OPENVINO_ASSERT(
config.assistant_confidence_threshold == 0.f,
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 0.f should be defined as a named constant (e.g., DISABLED_ASSISTANT_CONFIDENCE_THRESHOLD) to clarify its meaning and improve maintainability.

Copilot uses AI. Check for mistakes.
ov::Tensor& input_ids,
ov::Tensor& attention_mask,
ov::Tensor& position_ids) {
OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count");
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message on line 169 could be more specific by indicating which condition failed (empty tokens vs zero token count) to aid debugging.

Suggested change
OPENVINO_ASSERT(!m_tokens.empty() && token_count > 0, "Cannot build inputs: empty sequence or zero token count");
OPENVINO_ASSERT(!m_tokens.empty(), "Cannot build inputs: empty sequence (m_tokens is empty)");
OPENVINO_ASSERT(token_count > 0, "Cannot build inputs: zero token count (token_count <= 0)");

Copilot uses AI. Check for mistakes.
void share_embedding_weights(std::shared_ptr<ov::Model>& main_model, std::shared_ptr<ov::Model>& draft_model) {
// extract embedding weight from main model
auto find_embedding_gather = [](const std::shared_ptr<ov::Model>& model)
-> std::shared_ptr<ov::Node> {
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold value 1000 for vocabulary size is a heuristic but lacks explanation. Add a comment explaining why this specific value was chosen.

Suggested change
-> std::shared_ptr<ov::Node> {
-> std::shared_ptr<ov::Node> {
// The threshold value 1000 for vocabulary size is a heuristic chosen because most modern language models
// (e.g., GPT, Llama, Falcon) have vocabulary sizes well above 1000, while other embedding layers (e.g., for classification)
// typically have much smaller vocabularies. This helps reliably identify the token embedding layer in the model.
// Adjust this value if working with models with unusually small or large vocabularies.

Copilot uses AI. Check for mistakes.
if (m_device.find("NPU") != std::string::npos) {
// Scale input down by 100x before MatMul to avoid FP16 overflow, then scale result back up
// The factor 100 (0.01 and 100.0) is an empirical value
auto scale_down_const = std::make_shared<v0::Constant>(matmul_input0.get_element_type(), ov::Shape{}, 0.01f);
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.

Copilot uses AI. Check for mistakes.
new_matmul->set_friendly_name(matmul_node->get_friendly_name() + "/matmul");

// Scale result back up to maintain numerical equivalence
auto scale_up_const = std::make_shared<v0::Constant>(new_matmul->get_element_type(), ov::Shape{}, 100.0f);
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scaling factors (0.01 and 100.0) are described as empirical values in the comment on line 170, but there's no explanation of how they were determined or under what conditions they might need adjustment.

Copilot uses AI. Check for mistakes.
}
}
} else {
OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model");
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message is unclear about what action the user should take. Consider adding guidance such as 'Ensure target model has generated hidden states before draft model inference'.

Suggested change
OPENVINO_ASSERT(false, "missing hidden state from target model to eagle draft model");
OPENVINO_ASSERT(false, "Missing hidden state from target model to eagle draft model. Ensure the target model has generated hidden states before draft model inference.");

Copilot uses AI. Check for mistakes.
timeout: 90
- name: 'EAGLE3 speculative decoding tests'
cmd: |
python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing from a specific commit hash (ea9607daf32919024cdd4390deec9693a7b64d23) in a personal GitHub repository is fragile and not reproducible long-term. Consider using a tagged release from the official repository or documenting why this specific commit is required.

Suggested change
python -m pip install git+https://github.com/xufang-lisa/optimum-intel.git@ea9607daf32919024cdd4390deec9693a7b64d23
# Install optimum-intel from the official repository for reproducibility.
python -m pip install optimum-intel

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: GHA CI based on Github actions category: LLM samples GenAI LLM samples category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants