forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 6
Imarkov/conditional compilation ranges #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ProExpertProg
wants to merge
304
commits into
imarkov/fused_allreduce_torch_native
from
imarkov/conditional_compilation_ranges
Closed
Changes from 20 commits
Commits
Show all changes
304 commits
Select commit
Hold shift + click to select a range
d381eb9
Multi turn benchmark progress bar for synthetic conversation generati…
segevido 2e78150
[CI] Add mergify rules for `nvidia` label (#28417)
mgoin b30dfa0
[Attention] Refactor CUDA attention backend selection logic (#24794)
MatthewBonanni 7dbe6d8
Fix Fused MoE LoRA Triton kernel bug (#28450)
chaojun-zhang afffd3c
[Model] Pass `mm_features` directly into `get_mrope_input_positions` …
DarkLight1337 3380543
Add request timeout override for multi-turn benchmarks (#28386)
segevido fa19702
[Docs] Fix grammar in CPU installation guide (#28461)
maryamtahhan a1448b4
[Kernels] Split up fused_moe/layer.py, isolate more modular kernel co…
bnellnm 533b018
[BugFix] Fix Failing Ruff Check (#28469)
jvlunteren a90ad7d
Add @markmc to CODEOWNERS for Observability (#28457)
markmc b886068
[BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (#28444)
faaany 3143eb2
[BugFix] Add test_outputs.py to CI pipeline (#28466)
usberkeley 287bbbe
[Doc] Fix typo in serving docs (#28474)
the-codeboy f9a4087
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel …
mgoin a7ef3eb
[NIXL] Generalize block-first backend layouts (FlashInfer-like) (#28282)
NickLucche 68c09ef
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Mo…
izhuhaoran 05576df
[ROCm][Quantization] extend AMD Quark to support mixed-precision quan…
xuebwang-amd 5a1271d
[Quantization] fix attention quantization of gpt_oss model (#27334)
xuebwang-amd e553424
[CI/Build] Refactor Attention backend for test_prefix_prefill from xf…
zhewenl 684f254
Prefer FlashAttention MLA as default over FlashMLA (#27363)
MatthewBonanni 6c3c0f8
[Kernel] Optimize rms_norm kernel (#27931)
xyang16 d5edcb8
[BugFix] Fix Siglip2Attention on XPU (#28448)
faaany 76e4dcf
[Misc] Remove unused attention prefix prefill ops functions (#26971)
lgeiger 4228be7
[Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhea…
Jialin de120bc
[V0 deprecation] Clean up num_prefill_tokens logic for V0 (#28203)
gcanlin 8c32c6e
[Misc] fix typo in DCP comment (#28389)
Livinfly 9d1c474
[LoRA][1/N]Remove LoRA extra vocab (#28382)
jeejeelee df4d3a4
[TPU] Rename path to tpu platform (#28452)
kyuyeunk d4902ba
[Misc] Cleanup Executor interface (#28441)
wangxiyuan 28534b9
Add Zurich vLLM Meetup (#28488)
mgoin e5f599d
[Bugfix] Disable shared expert overlap if Marlin MoE is used (#28410)
mgoin 412e153
[Feature] Allow configuring FlashInfer workspace size (#28269)
maxyanghu d235395
Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (#28491)
adabeyta 1788aa1
[BugFix] Graceful handling of torch symm mem errors. (#27671)
ilmarkov 48c8793
[Frontend] Change CompilationMode to a proper Enum (#28165)
gmagogsfm 3f770f4
[Performance] Cache loaded custom logitsprocs to avoid overheads (#28…
Isotr0py e171039
[[V0 deprecation]]Remove VLLM_USE_V1 env (#28204)
wangxiyuan 7f829be
[CPU] Refactor CPU attention backend (#27954)
bigPYJ1151 9f0247c
`VLLM_USE_TRITON_FLASH_ATTN` V0 variable deprecation (#27611)
AndreasKaratzas cbb799e
[Model][Qwen3VL] Simplify `get_mrope_input_positions` using numpy (#2…
lgeiger 4ccffe5
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#…
fake0fan b9ce9a3
[BugFix] Add fallback path in `apply_rotary_pos_emb_flashattn` for no…
faaany f31419e
[Benchmark] Add retry support to fix workload bias in multi-turn benc…
ai-jz ac0bb2c
[Core] Cache `vllm_is_batch_invariant` (#28304)
lgeiger 91864b7
[CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (#28…
fake0fan c748355
[CI] Introduce autorun_on_main feature (#27836)
hl475 1761dea
[BugFix]: --enable-lora with model granite-4.0-micro crash (#27733)
yyzxw d3ade61
[Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (#27597)
wuyaoxuehun a4730c1
[XPU]Fix crash due to removed VLLM_USE_V1 attribute (#28520)
chaojun-zhang d143152
[KVConnector] Enable get_block_ids_with_load_errors() in LMCache conn…
ziruiliu c5f10cc
add cpu option for p/d in nixl_connector (#28356)
ZhengHongming888 edb59a9
[ROCm] [Bugfix] Fix `fused_qknorm_rope_kernel` rocm compatibility (#2…
tjtanaa a9d18b5
[Bugfix] Fix gpt_oss packed_modules_mapping (#28536)
jeejeelee 10138c9
[V0 deprecation] Deprecate use_v1 parameter (#28112)
wangxiyuan 54aecd9
Fix pre-commit (and XPU) on `main` (#28556)
hmellor f76e85c
[Performance][Hopper] Avoid M dim padding to 4x for most cases (due t…
alexm-redhat bc5bd45
[Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (#…
gcanlin 728a9eb
[Misc] Refactor Attention kv transfer methods into decorator (#27816)
NickLucche a742134
Remove deprecated fields from `CompilationConfig` (#27593)
hmellor 3044195
[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec…
benchislett bac9045
Implement ARC KV cache eviction policy for CPU offloader (#27039)
albertoperdomo2 a1e7fa3
[EPLB][ROCm]: support EPBL for ROCm backend (#27731)
PerryZhang01 64d57c3
[Model] [Config] Correctly identify granite-4.0-micro as non-hybrid m…
tdoublep 319abd5
Remove dynamic shape
ilmarkov a39dd7b
[CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken i…
hmellor 94a9ebc
[KV connector][WIP] KV cache proxy based on LMCache multi-process mod…
ApostaC 58ce8d1
[BugFix] Priority scheduling and spec tokens preemption (#28558)
andylolu2 478ee51
[Misc]Fix typo in llm_engine.py (#28584)
frank-wei 74a9a9f
[Performance][B200] Fix deepgemm prologue (#27897)
varun-sundar-rabindranath d8140b9
[ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel na…
vllmellm 3eb0c26
[TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (#28487)
QiliangCui 10f01d5
[Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (#28294)
mgoin 4ca5cd5
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#1…
HollowMan6 69d0e90
[MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (#28406)
alexm-redhat 51c599f
Skip models that cannot currently init on Transformers v5 (#28471)
hmellor 52eadce
[Docs] Update meetups.md description (#28583)
mgoin d75ad04
[ROCm][Bugfix] Revert removing setuptools version restriction (#28592)
gshtras 2dacd57
[platform] Move get_cu_count to utils (#27005)
wangxiyuan a543e67
[Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support…
mgoin 8832fff
[BugFix] Fix `mm_encoder_attn_backend` arg type checking (#28599)
njhill 3226283
[Docs] Add some details about what the MoE block needs for the Transf…
hmellor 97d1c99
Rename clashing method names for vLLM model protocol (#27583)
hmellor a1d3866
[n-gen] DO NOT repeatedly return finished child requests (#28591)
Jialin 7c38ed0
[Frontend] split append tool output (#28333)
qandrew 1a0b157
[Frontend][responsesAPI][1/n] convert responses API tool input to cha…
qandrew 7dca0c9
[BugFix][ROCm] Fix `get_cu_count` missing variable error (#28608)
ganyi1996ppo dbbe0c7
[XPU] Support Triton path for LoRA operations on XPU (#28511)
faaany 7e082bc
Support DeepEP for Kimi-k2-thinking through enabling gemm selection f…
luccafong d44fbba
[build][cmake]: Bundle static ACL and torch libgomp for CPU extension…
Radu2k ca00b1b
[ROCm][BugFix] Remove the usage of `device_info` from aiter (#28383)
ganyi1996ppo 4504e80
[Bugfix] Prevent crash on empty grammar string (#28210)
tjandy98 c33b87e
Use official xformers-0.0.33 built for PT 2.9 (#28600)
huydhn 4ab34f6
Add NUMA node validation for CPU thread binding (#28555)
usberkeley fa183e9
[Bugfix] fix kimi-linear crash (#28445)
ZJY0516 5c9ad13
[Frontend] supports interleaved thinking (#28531)
chaunceyjiang 11ac9dd
Support all interleaved layer types (#28485)
sarckk d168de0
Make ranges inclusive-inclusive
ilmarkov e63fd44
Fix: Correctly filter special tokens in benchmark_prefix_caching (#28…
dw2761 5e97320
[BugFix] Fix type error when assign a trition kernel tensor to a torc…
liuzijing2014 c428e8d
Fix io processor pooling #28273 (#28484)
baonudesifeizhai c47b6c8
[XPU] add sym params to IPEXConfig (#28611)
zufangzhu c9fe6ab
[Bugfix] Fix FPS value type for Qwen2.5-Omni video processing (#28630)
faaany 86d15bf
[Hardware][PowerPC] Fix fp16 compilation error for Power in cpu atten…
Akashcodes732 8da2f28
[ROCm][BugFix]Fix `get_cu_count` in rocm_aiter_fa.py (#28618)
ganyi1996ppo a7791ea
[CI/Build] Install uv for AMD MI300: Language Models Tests (Hybrid) %…
amdfaa 07a606a
[CI Failure] Fix backend selection for encoder-only models (#28534)
hl475 3035d1a
[BugFix] DeepSeek-OCR: apply NoRepeatNGramLogitsProcessor to greedy p…
YuanpingSong b230286
Fix `get_num_experts` when config sets it explicitly to `None` (#28652)
hmellor d338775
[Misc] Turn off encoder torch compile by default (#28634)
ywang96 06c4873
Rewrite C++ meta funcs to Python (#28595)
janeyx99 327c0a9
[BugFix] Ensure `EngineArgs.create_engine_config` is idempotent (#28515)
njhill fdfd507
[TPU] patch TPU wheel build script to resolve metadata issue (#27279)
jcyang43 fe1cd77
[Performance][B200] silu_mul_quant: pack scales in int32 (#28358)
varun-sundar-rabindranath 119c492
[Bugfix] Fix validate model input for decoder models (#27099)
yannicks1 f9f3b59
[Attention][Bugfix] Fix FA sink support (#28660)
MatthewBonanni 5d6ce2b
[Perf] Support stream interval for reducing host overhead (#27869)
elvischenv 968060c
[bugfix] correct local_chunk_len for DCP in reorg_kvcache with long c…
pisceskkk 262d263
[Bugfix] Eliminate tuple inputs to submodules in graph partitioning (…
gmagogsfm faed7bf
[Bugfix] [CPU] bump torch to 2.9.0 for Darwin to fix segmentation fau…
kebe7jun 1b622de
[Misc] Update CODEOWNERS for simon-mo and comaniac (#28675)
simon-mo e64011f
[CI] Bug: Fix ci entrypoint pooling (#28684)
yewentao256 6e25b1c
[KV Connector] Test async mode in scheduler tests (#28550)
markmc f2b8e1c
Mirrored test group definitions for AMD (2025-11-11) (#28573)
Alexei-V-Ivanov-AMD 4d5943b
[quantization][config] enable override existing quant_config (#28510)
ILikeIneine 2aa75c7
[ROCm] Bump up the version of amd-smi to 6.4.3 (#28680)
SageMoore 622e610
[CPU][Bugfix] Fix Apple Silicon M1 compilation failure (#28681)
mgoin b39a502
[ci][amd] fix basic models extra init test (#28676)
bradleyhd 01bea11
[Misc] Remove `warn_for_unimplemented_methods` (#28613)
DarkLight1337 da14ae0
[XPU][CI]disable lm cache uts (#28696)
jikunshang 0aecd91
[Misc] Update xformers to 0.33.0.post1 (#28678)
ywang96 0b25498
[Misc] add ignore mapper for quark quantization (#28275)
haoyangli-amd 15ae8e0
[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_i…
rasmith 9310357
[BugFix][CI/Build][ROCM] Fix import error and apply assert in appropr…
rasmith 529cea3
use default CCL_ZE_IPC_EXCHANGE (#28700)
yma11 b65e752
Merge branch 'main' into imarkov/conditional_compilation_ranges
ilmarkov c36bcfe
[Bugfix] fix dots.ocr pp support (#28705)
ZJY0516 bc3e430
[BugFix] Fix multi-modal async scheduling race condition (#28706)
njhill c9a3a02
Add output token counting to gsm8k eval (#28594)
mgoin fd75d3e
[Minor] avoid register new custom and just import silly_attn (#28578)
BoyuanFeng 8cfbe89
[Misc] fix comment in test_envs (#28529)
xingliu14 ecf8230
[Metrics] Log number of preempted requests (#28522)
610lyn 360bd87
[Frontend] Added chat-style multimodal support to /classify. (#27516)
WorldExplored 41b92f7
[Model][MM] Extract conv layer as CustomOp (#28455)
shen-shanshan 4516d44
[DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer (…
gjc0824 9324e10
Fix KV sharing fast prefill with cudagraph enabled (#28537)
sarckk db56a59
[BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (d…
LucasWilkinson 8d3748d
[Doc] Fix macOS installation dependency resolution issue (#26721)
shahfasal 433c0f8
[Model] Fix bailing_moe accuracy problem (#28277)
zhaozx-cn 96b23b8
[Bugfix][Nixl] Fix kernel physical<>logical block_size issue (#28677)
NickLucche 511a6b6
[Config] Clean up SchedulerConfig initialization (#28665)
DarkLight1337 3f8a874
[Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (#2…
djmmoss c934cae
[Fix] improve aspect ratio in dummy image generation and add common …
dongbo910220 5f3cd7f
[Docs] Update the name of `Transformers backend` -> `Transformers mod…
hmellor d54a18a
[CI][CPU] Smoke test for Apple Silicon using GHA MacOS runner (#28688)
mgoin 6f1e7f7
[DisaggEverything] Tokens in<>out `/generate` endpoint (#24261)
NickLucche 8cc40f8
[Attention] Bump FA for removed method (#28429)
MatthewBonanni a17e36f
Fix typo in comment: existance -> existence (#28737)
OthmanMohammad 0854248
Remove audio optional dependency for mistral-common (#28722)
juliendenize cdd7025
[kernel] Improve FP8 PTPC on Hopper for larger shapes (#28692)
czhu-cohere 9261eb3
docs(lora_resolvers): clarify multi-resolver order and storage path r…
wangchen615 964d65d
LLaMA4 LoRA Adapter Enablement (#28602)
kfhfar a425dc2
[Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with to…
tjtanaa 6718755
[Docs] Enable some more markdown lint rules for the docs (#28731)
hmellor e2741f6
[Chore] Rename `SchedulerConfig.chunked_prefill_enabled` (#28735)
DarkLight1337 cec275e
[Bugfix] resolve Qwen3-VL GPTQModel quantized model loading failure (…
GuanH fd45550
[BugFix] Fix misprint introduced by modular_kernel refactoring. (#28728)
halyavin 8977ffb
[ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.c…
SageMoore f08eab2
[CI] Fix macos smoke test uv cache issue (#28736)
mgoin 0de4f21
[Bugfix] TypeError: 'NoneType' object is not callable (#27410)
mostrowskix 5a84b76
[ROCm][CI/Build] Change install location of uv (#28741)
gshtras 2e0ad62
Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch…
laithsakka e5c7895
[Bugfix] Fix incorrect use of hidden_states for shared_experts due to…
alexm-redhat bf3ffb6
[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (#28739)
benchislett e0c910b
[Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 …
tdoublep ba041d9
[Log] Save profiler results to file instead of stdout (#28144)
rasmith 75f01b9
[ROCm][CI/Build] Upgrade to ROCm 7.1 and AITER main (#28753)
gshtras 58e61e5
[Test] Rework e2e async scheduling tests (#28744)
njhill 186352b
[Core] Performance: Use list[np.ndarray] instead of list[list[int]] f…
Jialin 9fc81ec
[TPU] Fix import error in tpu launch (#28758)
QiliangCui f05d474
[Model][Qwen3VL] Use `mm_position` to compute mrope positions (#28730)
lgeiger edfe498
[Bugfix] Build hadacore kernels on >SM90 (#28748)
mgoin ac86bff
Revert "[Core] Performance: Use list[np.ndarray] instead of list[list…
njhill 363aaee
Fix IntermediateTensors initialization and add type hints (#28743)
OthmanMohammad c9e6658
[NIXL] heterogeneous block_size support (#26759)
xuechendi 6965ef4
[Performance][DeepGEMM] Estimate expected_m (#28694)
varun-sundar-rabindranath 98b4d38
[Redo] #26368 (#28771)
DarkLight1337 dd6ac1c
[RL] [V1] Remove unused device argument from reset_kv_cache (#28766)
zhuohan123 74b5267
Use narrow over indexing in `hadacore_transform` to prep for ABI stab…
janeyx99 1ec978c
[Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (#2…
zhewenl 638e419
[Misc] Make `SchedulerConfig.max_model_len` init-only (#28733)
DarkLight1337 173b356
[PERF] Remove TRTLLM Gen attn kernel limitation `max_seq_len <=131072…
vadiklyutiy f36292d
[compile] Enable sequence parallelism matching w/o custom ops enabled…
angelayi cb15ee2
Allow Gemma3 to take image embeddings (#28483)
tingtingtangmeta 89d3679
[Doc] Fix failing doc build (#28772)
DarkLight1337 085a525
[Model] Fix lmhead init bug of bailing_moe (#28777)
hwhaokun e439c78
Add support for Eagle with separate lm-head and embed_tokens layers (…
eldarkurtic 637f292
[CI] Fix broken pipeline (#28781)
njhill 07cadab
[Model][Qwen3VL] Cache positional embedding indices (#28475)
lgeiger 2bb4435
[Doc]: fix typos in various files (#28567)
didier-durand be263f7
[BugFix] Fix `AssertionError: DCP not support reorder_batch_threshold…
LucasWilkinson f849ee7
Adding a benchmark for batch invariance (#28161)
bwasti d231876
[Benchmark] Fix client seed synchronization in multi-turn benchmark (…
ai-jz a55b646
[Model] Allow users to control skip reading cache per request. (#28194)
noooop b316ac6
[V1] Support MP Executor for multi node distributed inference (#23691)
luccafong af02c40
Fixed gpt-oss _load_weights_other() parameter position bug (#28715)
River12 3bc1175
[Bugfix] Fix host and port join for ipv6 in bench serve (#28679)
scottzh8 8d259fa
Fix gpt oss weight loading with EP + bf16 (#28765)
ashors1 63fed55
[Doc]: fix typos in various files (#28811)
didier-durand ac1daf3
fix comment typo (#28802)
andyxning 5a87076
[Model][QwenVL] Optimize `Qwen2_5_VisionAttention` q,k preparation (#…
lgeiger 03ee481
Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261)
amirkl94 80b6080
[BugFix] Fix async scheduling + chunked prefill + preemption (#28787)
njhill 561253b
[Performance][Fix] update nvfp4 code to support renorm routing (#28569)
jiahanc d64429b
[NIXL][XPU] update install script of NIXL (#28778)
zhenwei-intel 60e089f
[ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (#28670)
sammysun0711 6f37419
[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser str…
jscaldwell55 3380ed5
[Doc] Add llama4 LoRA tag (#28825)
jeejeelee 577bb34
[CPU][Bugfix] Fix _to_list in CPU model runner (#28824)
bigPYJ1151 ab01cd1
[BugFix] Fix glm4_moe_mtp load weights bug (#28805)
wuyaoxuehun d4acf51
[Metrics] Fix KV cache usage percent metric multiproc (#28792)
jaywonchung 1b82fb0
[XPU] work around for sp, avoid custom op import error (#28822)
jikunshang 64e39d6
[BugFix] Temporary fix for IMA with MTP = 2 and full-cg (#28315)
LucasWilkinson 7f06449
[Bugfix][Perf] Revert applying HF processor on text-only inputs for m…
ywang96 e42bd8c
Cast return value to int64_t for cache size (#28814)
tiehexue f8b19c0
[Bugfix] Fix GPT-OSS on AMD after #28603 (#28816)
zhewenl d8874c6
[Core] Async Scheduling X Spec Decoding Compatibility (#24799)
Ronald1995 7765e5b
[BugFix] Fix PP performance and PP kv connector output regression (#…
njhill 95ae50b
[Quantization] [Eagle] Add complete quantization support to the draft…
shreyas269 a289cc1
[Test] Batch Invariant: Rename and organize tests (#27421)
yewentao256 f77bce0
[Model] Add Afmoe architecture implementation (#28332)
pranav4501 6148584
[BugFix] Corner case that could cause out-of-sync with external launc…
bangshengtang 552cac9
[Misc] Fix wrong comment in scheduler (#28880)
zhuohan123 b6e0439
[Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (#28…
bbartels 88ab591
Run macos smoke test workflow on main commit (#28752)
mgoin d0a7362
[ROCm][Quantization] add apply_vllm_mapper in quark config for models…
xuebwang-amd 3ddcf46
[Refactor] Remove Unused Func in Batch Invariant (#28881)
yewentao256 bf9e1e8
[Bugfix] Fix wrong CLI defaults for dynamic `SchedulerConfig` fields …
DarkLight1337 083cf32
[Doc]: fix typos in various files (#28863)
didier-durand 0168f69
[Misc] Remove unnecessary parentheses from log statements (#28897)
andyxning 5bdd155
[CI] Fix async scheduling + spec decoding test flake (#28902)
njhill 5bb1da5
[MISC] Remove format.sh (#28906)
KuntaiDu 896e41a
[CI/Build] Replace wikipedia url with local server ones (#28908)
Isotr0py 4393684
[BugFix] Fix PP/async scheduling with pooling models (#28899)
njhill 285eaa4
[Bugfix] Safeguard against missing backend in AttentionBackendEnum (#…
jesse996 b9489f5
[Model][Perf] Use cos and sin cache in QwenVL (#28798)
gcanlin 184b12f
[Bugfix][NIXL] Fix `block_size_ratio` when logical !=physical blocks …
NickLucche f6aa122
[CI Sprint] Quantization CI Cleanup (#24130)
killershrimp 49a986e
[Benchmark] multi_turn: Report warmup-inclusive runtime (#28937)
segevido c261237
[Model] Add Gemma3 GGUF multimodal support (#27772)
lucianommartins af10400
Merge branch 'main' into imarkov/conditional_compilation_ranges
ilmarkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| import torch | ||
| from torch import fx as fx | ||
| from torch import nn | ||
| from torch.library import Library | ||
|
|
||
| from vllm.compilation.counter import compilation_counter | ||
| from vllm.compilation.decorators import support_torch_compile | ||
| from vllm.compilation.inductor_pass import ( | ||
| CustomGraphPass, | ||
| InductorPass, | ||
| get_pass_context, | ||
| ) | ||
| from vllm.config import ( | ||
| VllmConfig, | ||
| set_current_vllm_config, | ||
| ) | ||
| from vllm.config.compilation import CompilationConfig, CompilationMode | ||
| from vllm.config.scheduler import SchedulerConfig | ||
| from vllm.forward_context import set_forward_context | ||
|
|
||
| # create a library to hold the custom op | ||
| silly_lib = Library("silly", "FRAGMENT") # noqa | ||
|
|
||
| BATCH_SIZE = 64 | ||
| MLP_SIZE = 128 | ||
|
|
||
|
|
||
| @support_torch_compile | ||
| class TestModel(nn.Module): | ||
| def __init__(self, *, vllm_config: VllmConfig, prefix: str = "", **kwargs) -> None: | ||
| super().__init__() | ||
|
|
||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| x = x + x | ||
| attn_output = torch.empty_like(x) | ||
| torch.ops.silly.attention(x, x, x, attn_output) | ||
| x = attn_output | ||
| x = x * 3 | ||
| return x | ||
|
|
||
|
|
||
| @torch.inference_mode | ||
| def run_model(vllm_config: VllmConfig, model: nn.Module, batch_sizes: list[int]): | ||
| with set_forward_context({}, vllm_config=vllm_config): | ||
| model(torch.randn(BATCH_SIZE, MLP_SIZE).cuda()) | ||
| for batch_size in batch_sizes: | ||
| model(torch.randn(batch_size, MLP_SIZE).cuda()) | ||
|
|
||
|
|
||
| class PostGradPassManagerCheckRanges(CustomGraphPass): | ||
| def __init__(self, ranges: list[tuple[int, int]]): | ||
| self.ranges = ranges | ||
|
|
||
| def __call__(self, graph: fx.Graph): | ||
| compile_range = get_pass_context().compile_range | ||
| assert compile_range in self.ranges, ( | ||
| f"Compile range {compile_range} not in {self.ranges}" | ||
| ) | ||
|
|
||
| def uuid(self) -> str: | ||
| state = { | ||
| "ranges": self.ranges, | ||
| } | ||
| return InductorPass.hash_dict(state) | ||
|
|
||
|
|
||
| def test_compile_ranges(): | ||
| vllm_config = VllmConfig( | ||
| scheduler_config=SchedulerConfig( | ||
| max_num_batched_tokens=8192, | ||
| ), | ||
| compilation_config=CompilationConfig( | ||
| mode=CompilationMode.VLLM_COMPILE, | ||
| compile_ranges_split_points=[8, 32], | ||
| ), | ||
| inductor_compile_config={ | ||
| "post_grad_custom_post_pass": PostGradPassManagerCheckRanges( | ||
| [(1, 8), (8, 32), (32, 2049)] | ||
| ) | ||
| }, | ||
| ) | ||
|
|
||
| with set_current_vllm_config(vllm_config): | ||
| model = TestModel(vllm_config=vllm_config, prefix="").eval().cuda() | ||
| batch_sizes = [1, 16, 48] | ||
| # A has support_torch_compile | ||
| with compilation_counter.expect( | ||
| num_graphs_seen=1, | ||
| num_piecewise_graphs_seen=1, | ||
| num_backend_compilations=3, | ||
| # num_cudagraph_sizes * num_piecewise_capturable_graphs_seen | ||
| ): | ||
| run_model(vllm_config, model, batch_sizes) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -83,7 +83,7 @@ class CompilerManager: | |||||||
| """ | ||||||||
|
|
||||||||
| def __init__(self, compilation_config: CompilationConfig): | ||||||||
| self.cache: dict[tuple[int | None, int, str], Any] = dict() | ||||||||
| self.cache: dict[tuple[tuple[int, int] | None, int, str], Any] = dict() | ||||||||
| self.is_cache_updated = False | ||||||||
| self.compilation_config = compilation_config | ||||||||
| self.compiler = make_compiler(compilation_config) | ||||||||
|
|
@@ -92,11 +92,11 @@ def compute_hash(self, vllm_config: VllmConfig) -> str: | |||||||
| return self.compiler.compute_hash(vllm_config) | ||||||||
|
|
||||||||
| @contextmanager | ||||||||
| def compile_context(self, runtime_shape: int | None = None): | ||||||||
| def compile_context(self, compile_range: tuple[int, int] | None = None): | ||||||||
| """Provide compilation context for the duration of compilation to set | ||||||||
| any torch global properties we want to scope to a single Inductor | ||||||||
| compilation (e.g. partition rules, pass context).""" | ||||||||
| with pass_context(runtime_shape): | ||||||||
| with pass_context(compile_range): | ||||||||
| if self.compilation_config.use_inductor_graph_partition: | ||||||||
| with inductor_partition_rule_context( | ||||||||
| self.compilation_config.splitting_ops | ||||||||
|
|
@@ -152,26 +152,28 @@ def load( | |||||||
| graph: fx.GraphModule, | ||||||||
| example_inputs: list[Any], | ||||||||
| graph_index: int, | ||||||||
| runtime_shape: int | None = None, | ||||||||
| compile_range: tuple[int, int] | None = None, | ||||||||
| ) -> Callable | None: | ||||||||
| if (runtime_shape, graph_index, self.compiler.name) not in self.cache: | ||||||||
| if (compile_range, graph_index, self.compiler.name) not in self.cache: | ||||||||
| return None | ||||||||
| handle = self.cache[(runtime_shape, graph_index, self.compiler.name)] | ||||||||
| handle = self.cache[(compile_range, graph_index, self.compiler.name)] | ||||||||
| compiled_graph = self.compiler.load( | ||||||||
| handle, graph, example_inputs, graph_index, runtime_shape | ||||||||
| handle, graph, example_inputs, graph_index, compile_range | ||||||||
| ) | ||||||||
| if runtime_shape is None: | ||||||||
| if compile_range is None: | ||||||||
| logger.debug( | ||||||||
| "Directly load the %s-th graph for dynamic shape from %s via handle %s", | ||||||||
| "Directly load the %s-th graph for dynamic compile range" | ||||||||
| "from %s via handle %s", | ||||||||
| graph_index, | ||||||||
| self.compiler.name, | ||||||||
| handle, | ||||||||
| ) | ||||||||
| else: | ||||||||
| logger.debug( | ||||||||
| "Directly load the %s-th graph for shape %s from %s via handle %s", | ||||||||
| "Directly load the %s-th graph for compile range %s" | ||||||||
| "from %s via handle %s", | ||||||||
| graph_index, | ||||||||
| str(runtime_shape), | ||||||||
| str(compile_range), | ||||||||
| self.compiler.name, | ||||||||
| handle, | ||||||||
| ) | ||||||||
|
|
@@ -185,7 +187,7 @@ def compile( | |||||||
| compilation_config: CompilationConfig, | ||||||||
| graph_index: int = 0, | ||||||||
| num_graphs: int = 1, | ||||||||
| runtime_shape: int | None = None, | ||||||||
| compile_range: tuple[int, int] | None = None, | ||||||||
| ) -> Any: | ||||||||
| if graph_index == 0: | ||||||||
| # before compiling the first graph, record the start time | ||||||||
|
|
@@ -197,25 +199,24 @@ def compile( | |||||||
| compiled_graph = None | ||||||||
|
|
||||||||
| # try to load from the cache | ||||||||
| compiled_graph = self.load(graph, example_inputs, graph_index, runtime_shape) | ||||||||
| compiled_graph = self.load(graph, example_inputs, graph_index, compile_range) | ||||||||
| if compiled_graph is not None: | ||||||||
| if graph_index == num_graphs - 1: | ||||||||
| # after loading the last graph for this shape, record the time. | ||||||||
| # there can be multiple graphs due to piecewise compilation. | ||||||||
| now = time.time() | ||||||||
| elapsed = now - compilation_start_time | ||||||||
| compilation_config.compilation_time += elapsed | ||||||||
| if runtime_shape is None: | ||||||||
| if compile_range is None: | ||||||||
|
||||||||
| if compile_range is None: | |
| compilation_config.compilation_time += elapsed | |
| if compile_range is None: |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the current range to cache key and check the number of times the manager gets called (to make sure the bug you found doesn't manifest)