-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][fix] Correct virtual memory allocation alignment #9491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThe changes enhance CUDA memory allocation error reporting and refactor alignment computation in virtual memory management. A new optional Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant VirtualMemory as VirtualMemory:<br/>allocate()
participant Config as Configuration:<br/>pageAligned()
participant Atomic as mAlignment<br/>(atomic)
participant CUDA as CUDA Runtime
Caller->>VirtualMemory: allocate(n, device)
VirtualMemory->>Config: pageAligned(n, device)
Config->>Atomic: load mAlignment
alt mAlignment == 0 (uninitialized)
Config->>CUDA: cuMemGetAllocationGranularity()
CUDA-->>Config: gpu_alignment
Config->>Config: lcm(system_page_size, gpu_alignment)
Config->>Atomic: store alignment (spin-wait)
end
Config->>Config: round n to alignment
Config-->>VirtualMemory: page_aligned_size
VirtualMemory->>CUDA: cuMemAddressReserve(page_aligned_size)
alt Success
CUDA-->>VirtualMemory: reserved address
else Error
VirtualMemory->>VirtualMemory: checkDriver with info log
end
VirtualMemory-->>Caller: address
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cpp/include/tensorrt_llm/runtime/virtualMemory.h (2)
465-495: Atomic alignment field is a good fit; ensure required headers are includedUsing
std::atomic<std::size_t> mAlignmentlazily initialized to 0 matches the newpageAligneddesign and avoids doing CUDA driver queries in the constructor. However, this header now directly depends onstd::atomicandstd::numeric_limits(used below), but only<numeric>is explicitly included.To keep the header self-contained and robust against transitive include changes, consider explicitly adding:
#include <map> #include <mutex> #include <numeric> +#include <atomic> +#include <limits> #include <unistd.h>
496-529: Let me analyze the provided code snippet and the original review to determine the validity of the suggestions.Analysis of the original review concerns:
Heterogeneous multi-GPU alignment caching: The code uses a single cached
mAlignmentthat is computed on the first device that callspageAligned(). For heterogeneous systems with different GPU types, this could theoretically be an issue, but:
- The alignment is computed using
cuMemGetAllocationGranularitywith a specificdeviceparameter- However, the cached value doesn't account for the device parameter on subsequent calls
- This is a valid concern, though it applies only if multiple different devices are used
Memory ordering semantics: The use of
memory_order_relaxedfor synchronization is adequate for the current scalar use case, but the reviewer suggests future-proofing withacquire/releaseif the function grows more complex. This is a reasonable optional suggestion.Both suggestions in the original review are optional and forward-looking, not addressing critical bugs. The code as written is correct for single-device or homogeneous multi-GPU scenarios.
Lazy, device-aware
pageAlignedimplementation is sound; minor edge considerationsThe lazy initialization using a sentinel (
loading = max_size_t), CAS, and a spin-wait withpause/yieldis correct for typical usage:
- Alignment is computed once via
cuMemGetAllocationGranularityfor the first device that calls the function, then cached.std::lcm(getpagesize(), gpuAlignment)ensures the size is valid for both OS page size and GPU allocation granularity.- Concurrent callers either compute once or spin briefly until the value is stored; no data race on
mAlignment.Two observations:
Device-specific alignment: If heterogeneous multi-GPU scenarios are used (different GPU types with different granularities), this single cached alignment ignores the
deviceparameter after first initialization. This is acceptable for homogeneous setups but worth documenting.Memory ordering: Current
memory_order_relaxedsemantics are correct for the scalar alignment value. Consideracquire/releaseonly if this function evolves to coordinate additional state beyond the alignment value itself.The implementation is correct as-is for its current usage.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cpp/include/tensorrt_llm/runtime/virtualMemory.h(4 hunks)cpp/tensorrt_llm/common/cudaDriverWrapper.h(2 hunks)cpp/tensorrt_llm/runtime/virtualMemory.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/tensorrt_llm/runtime/virtualMemory.cppcpp/include/tensorrt_llm/runtime/virtualMemory.h
**/*.h
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.h: Use a preprocessor guard in C++ header files with the guard name formatTRTLLM_followed by the filename in all caps (e.g.,TRTLLM_FOO_BAR_HELLO_Hfor fileFooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation
Files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/include/tensorrt_llm/runtime/virtualMemory.h
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/tensorrt_llm/runtime/virtualMemory.cppcpp/include/tensorrt_llm/runtime/virtualMemory.h
🧠 Learnings (9)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.
📚 Learning: 2025-08-25T00:03:39.294Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1185-1189
Timestamp: 2025-08-25T00:03:39.294Z
Learning: TLLM_CHECK_WITH_INFO is a host-side utility function and cannot be called from CUDA device functions (those marked with __device__ or __global__). In device code, assert() is the primary mechanism for handling "should never happen" conditions, and like standard C++ assert, CUDA's assert only works in debug builds and is compiled out in release builds.
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.h
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/include/tensorrt_llm/runtime/virtualMemory.h
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.h
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/tensorrt_llm/runtime/virtualMemory.cppcpp/include/tensorrt_llm/runtime/virtualMemory.h
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/include/tensorrt_llm/runtime/virtualMemory.h
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the guard name format `TRTLLM_` followed by the filename in all caps (e.g., `TRTLLM_FOO_BAR_HELLO_H` for file `FooBarHello.h`); do not include directory names in the symbol
Applied to files:
cpp/tensorrt_llm/common/cudaDriverWrapper.hcpp/include/tensorrt_llm/runtime/virtualMemory.h
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/runtime/virtualMemory.cpp
📚 Learning: 2025-11-24T17:09:17.870Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:09:17.870Z
Learning: Applies to **/*.cu : CUDA code must be compiled with a CUDA compiler and includes declarations/definitions with CUDA keywords (`__device__`, `__managed__`, `__constant__`, `__global__`), device functions, and kernel launching with <<<...>>> syntax
Applied to files:
cpp/include/tensorrt_llm/runtime/virtualMemory.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
cpp/tensorrt_llm/common/cudaDriverWrapper.h (2)
143-162: ExtendedcheckDriverinfo parameter looks correct and preserves existing behaviorThe added
infoparameter cleanly augments the error message when non-null and defaults tonullptrso existing call sites remain source-compatible. The conditional formatting branch is straightforward and still falls back to the previous message structure wheninfois absent. No functional issues spotted here.
187-192:TLLM_CU_CHECK_WITH_INFOis well-integrated; consider usage constraintsThe new macro correctly preformats the info string and passes it into
checkDriver, giving richer context on CUDA driver failures. Lifetime of thefmtstr(...).c_str()argument is safe for the duration of the call. Just keep in mind this remains a host-side utility (like other TLLM_*CHECK macros) and should not be invoked from device code.cpp/include/tensorrt_llm/runtime/virtualMemory.h (1)
203-239: Improved cuMemCreate error reporting inLocalCreator::createSwitching to
TLLM_CU_CHECK_WITH_INFOforcuMemCreatewith"allocating %zu bytes of memory"andmSizegives exactly the extra context you want when allocations fail, without changing semantics. The memory accounting that follows still uses the samemSize, so behavior remains consistent.cpp/tensorrt_llm/runtime/virtualMemory.cpp (6)
49-60: More informative status checks inmaterializeUsing
TLLM_CHECK_WITH_INFOwith the current status value before materializing a chunk is an improvement over a plain assert-style check. It keeps the control flow unchanged but gives clear diagnostics if materialize is called in an invalid state.
88-115: Release-path status validation is clearer and saferThe new
TLLM_CHECK_WITH_INFOin_releaseprecisely documents valid states (MATERIALIZEDorERROREDwith non-INVALID_STATE) and reports the actual status on misuse. This doesn’t change behavior when called correctly but improves failure-mode debuggability.
153-172: Manageraddnow surfaces detailed misuse informationSwitching these internal checks in
CudaVirtualMemoryManager::addtoTLLM_CHECK_WITH_INFO(status validation and duplicate-handle detection) is a straight upgrade in error reporting and doesn’t alter the success path. The ScopeGuard rollback logic remains intact.
339-380: Alignment fix inallocatecorrectly ties VA reservation, mapping, and allocation to GPU granularityThe revised allocation path looks correct and addresses the original bug:
pageAlignedSize = mConfig->pageAligned(n, device)ensures the size passed to:
cuMemAddressReserve,UnicastConfigurator(forcuMemMap/cuMemUnmap),- and
LocalCreator/cuMemCreate
is a common multiple of both system page size and the GPU’s allocation granularity.TLLM_CU_CHECK_WITH_INFOaroundcuMemAddressReservewith"allocating %zu bytes of address space"provides precise context on failures.- The configurators that operate on logical payload (
MemsetConfigurator,OffloadConfigurator) still usen, so they only touch/backup the intended region inside the larger aligned allocation, which is fine.Taken together, the VA reservation, mapping size, and physical allocation size are now consistent and satisfy the GPU alignment requirements.
You may want to run a quick smoke test on devices with non-trivial
cuMemGetAllocationGranularity(e.g., large-bar configurations) and verify thatcuMemCreate,cuMemMap, andcuMemAddressReserve/Freeall succeed for a variety of non-alignednvalues.
382-389: Deallocate remains consistent with new alignment strategy
deallocatenow relies onmConfig->pageAligned(n)forcuMemAddressFree, which uses the same cached alignment value computed on first use inpageAligned. For a givenn, the aligned size used incuMemAddressReserveand incuMemAddressFreewill match, so freeing the VA range remains correct under the new device-aware alignment scheme.
417-426: Allocator configuration guard now emits helpful context on misuseThe
TLLM_CHECK_WITH_INFOinsetVirtualMemoryAllocatorthat prints the tag, mode, and stream of an already-active allocator is a useful diagnostic improvement when someone attempts to reconfigure virtual memory while it is already in use. No behavior change for the valid path, and the message should make misconfiguration much easier to track down.
f51758c to
a35aac3
Compare
|
/bot run |
|
PR_Github #25929 [ run ] triggered by Bot. Commit: |
|
PR_Github #25929 [ run ] completed with state |
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
a35aac3 to
30d67da
Compare
|
/bot run |
|
PR_Github #26120 [ run ] triggered by Bot. Commit: |
|
PR_Github #26120 [ run ] completed with state |
|
/bot run |
|
PR_Github #26244 [ run ] triggered by Bot. Commit: |
|
PR_Github #26244 [ run ] completed with state |
Description
This PR fixed a bug that virtual address reservation / virtual memory allocation size may not satisfy the GPU alignment requirement.
Test Coverage
VirtualMemoryManagerTest.TestCudaVirtualMemoryAllocatorUnalignedSizePR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.