-
Notifications
You must be signed in to change notification settings - Fork 719
Handle --gpus flag using CDI #4617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -9,8 +9,8 @@ nerdctl provides docker-compatible NVIDIA GPU support. | |||||
|
|
||||||
| - NVIDIA Drivers | ||||||
| - Same requirement as when you use GPUs on Docker. For details, please refer to [the doc by NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#pre-requisites). | ||||||
| - `nvidia-container-cli` | ||||||
| - containerd relies on this CLI for setting up GPUs inside container. You can install this via [`libnvidia-container` package](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/arch-overview.html#libnvidia-container). | ||||||
| - The NVIDIA Container Toolkit | ||||||
| - containerd relies on the NVIDIA Container Toolkit to make GPUs usable inside a container. You can install the NVIDIA Container Toolkit by following the [official installation instructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). | ||||||
|
|
||||||
| ## Options for `nerdctl run --gpus` | ||||||
|
|
||||||
|
|
@@ -27,23 +27,24 @@ You can also pass detailed configuration to `--gpus` option as a list of key-val | |||||
|
|
||||||
| - `count`: number of GPUs to use. `all` exposes all available GPUs. | ||||||
| - `device`: IDs of GPUs to use. UUID or numbers of GPUs can be specified. | ||||||
| - `capabilities`: [Driver capabilities](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities). If unset, use default driver `utility`, `compute`. | ||||||
|
|
||||||
| The following example exposes a specific GPU to the container. | ||||||
|
|
||||||
| ``` | ||||||
| nerdctl run -it --rm --gpus '"capabilities=utility,compute",device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a' nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi | ||||||
| nerdctl run -it --rm --gpus 'device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a' nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi | ||||||
| ``` | ||||||
|
|
||||||
| Note that although `capabilities` options may be provided, these are ignored when processing the GPU request. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Fields for `nerdctl compose` | ||||||
|
|
||||||
| `nerdctl compose` also supports GPUs following [compose-spec](https://github.com/compose-spec/compose-spec/blob/master/deploy.md#devices). | ||||||
|
|
||||||
| You can use GPUs on compose when you specify some of the following `capabilities` in `services.demo.deploy.resources.reservations.devices`. | ||||||
| You can use GPUs on compose when you specify the `driver` as `nvidia` or one or | ||||||
| more of the following `capabilities` in `services.demo.deploy.resources.reservations.devices`. | ||||||
|
|
||||||
| - `gpu` | ||||||
| - `nvidia` | ||||||
| - all allowed capabilities for `nerdctl run --gpus` | ||||||
|
|
||||||
| Available fields are the same as `nerdctl run --gpus`. | ||||||
|
|
||||||
|
|
@@ -59,12 +60,37 @@ services: | |||||
| resources: | ||||||
| reservations: | ||||||
| devices: | ||||||
| - capabilities: ["utility"] | ||||||
| - driver: nvidia | ||||||
| count: all | ||||||
| ``` | ||||||
|
|
||||||
| ## Trouble Shooting | ||||||
|
|
||||||
| ### `nerdctl run --gpus` fails due to an unresolvable CDI device | ||||||
|
|
||||||
| If the required CDI specifications for NVIDIA devices are not available on the | ||||||
| system, the `nerdctl run` command will fail with an error similar to: `CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all` (the | ||||||
| exact error message will depend on the device(s) requested). | ||||||
|
|
||||||
| This should be the same error message that is reported when the `--device` flag | ||||||
| is used to request a CDI device: | ||||||
| ``` | ||||||
| nerdctl run --device=nvidia.com/gpu=all | ||||||
| ``` | ||||||
|
|
||||||
| Ensure that the NVIDIA Container Toolkit (>= v1.18.0 is recommended) is installed and the requested CDI devices are present in the ouptut of `nvidia-ctk cdi list`: | ||||||
|
|
||||||
| ``` | ||||||
| $ nvidia-ctk cdi list | ||||||
| INFO[0000] Found 3 CDI devices | ||||||
| nvidia.com/gpu=0 | ||||||
| nvidia.com/gpu=GPU-3eb87630-93d5-b2b6-b8ff-9b359caf4ee2 | ||||||
| nvidia.com/gpu=all | ||||||
| ``` | ||||||
|
|
||||||
| See the NVIDIA Container Toolkit [CDI documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) for more information. | ||||||
|
|
||||||
|
|
||||||
| ### `nerdctl run --gpus` fails when using the Nvidia gpu-operator | ||||||
|
|
||||||
| If the Nvidia driver is installed by the [gpu-operator](https://github.com/NVIDIA/gpu-operator).The `nerdctl run` will fail with the error message `(FATA[0000] exec: "nvidia-container-cli": executable file not found in $PATH)`. | ||||||
|
|
||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, does this need any change on https://github.com/containerd/nerdctl/blob/main/docs/gpu.md ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have updated this as well in the latest revision. |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the note should be rather moved to the top of the documentation