Skip to content

Commit f4aeae8

Browse files
Documentation updates - part 1 (#493)
This is the first batch of documentation updates. I divided my work into several parts to minimize merge conflicts, as multiple people are contributing to these files. The main focus of my changes is to apply good writing practices that enhance readability, flow, and overall professionalism. I also split the Quick Start guide into three separate documents, as it’s an important procedure that should remain easy to follow. It was starting to feel too long and complex in its previous form. Please review these updates and let me know if any corrections are needed, particularly regarding the environment variables section. It wasn’t clear where VLLM_PROMPT_SEQ_BUCKET_MAX, VLLM_HANDLE_TOPK_DUPLICATES, and VLLM_CONFIG_HIDDEN_LAYERS belong, since they were only mentioned in a tip. I’ve placed them in an additional table for now, but please let me know if that’s not accurate. --------- Signed-off-by: mhelf-intel <monika.helfer@intel.com> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
1 parent 6d64695 commit f4aeae8

File tree

14 files changed

+831
-835
lines changed

14 files changed

+831
-835
lines changed

docs/.nav.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,11 @@ nav:
22
- Home:
33
- vLLM x Intel Gaudi: README.md
44
- Getting Started:
5-
- getting_started/quickstart.md
6-
- getting_started/installation.md
5+
- Quick Start:
6+
- getting_started/quickstart.md
7+
- getting_started/quickstart_configuration.md
8+
- getting_started/quickstart_inference.md
9+
- Installation: getting_started/installation.md
710
- Quick Links:
811
- User Guide: user_guide/README.md
912
- Developer Guide: dev_guide/README.md

docs/configuration/env_vars.md

Lines changed: 82 additions & 63 deletions
Large diffs are not rendered by default.

docs/configuration/long_context.md

Lines changed: 15 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# Long Context Configuration
33

44
Long context feature enables support for a token context window exceeding 128K tokens. It is supported by the following models:
5+
56
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
67
- [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
78
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
@@ -17,39 +18,32 @@ Set the following environment variables to avoid OOM/functional issues. Additio
1718
- `VLLM_RPC_TIMEOUT=100000`
1819
- `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`
1920

20-
## Warmup buckets preparation
21-
Exponential bucketing mechanism automatically prepares buckets for long context. Linear bucketing mechanism requires manual flags settings.
21+
## Warm-up Buckets Preparation
2222

23-
**32K context length flags examples for linear warmup:**
23+
Exponential bucketing mechanism automatically prepares buckets for long context. Linear bucketing mechanism requires manual flags settings. The following table presents 32K context length flags examples for linear warm-up:
2424

25-
- `VLLM_GRAPH_RESERVED_MEM`: The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
26-
- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576`: Suggested value, depends on warmup results.
27-
- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048`: Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
28-
- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768`: Value for context length of 32K. Use 16384 for 16K.
29-
- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024`: Suggested value, depends on warmup results.
30-
- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024`: Suggested value, depends on warmup results.
31-
- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792`: `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
32-
- `128 *((32 + 1)* 1024) / 128`
33-
- `32 *((32 + 1)* 1024) / 128`
25+
| Flag | Suggested value | Notes |
26+
|------|-----------------|-------|
27+
| `VLLM_GRAPH_RESERVED_MEM` | `0.02` or `0.1` | It depends on the model and context length settings. Set to `0.02` for Llama3.1-8B or `0.1` for Llama3.1-70B. |
28+
| `VLLM_PROMPT_SEQ_BUCKET_MIN` | `24576` | The value depends on the warm-up results. |
29+
| `VLLM_PROMPT_SEQ_BUCKET_STEP` | `2048` | The value depends on the warm-up results. We recommend increasing it to a higher value for faster warm-up. for Intel Gaudi 3, we suggest setting it to `16384`. |
30+
| `VLLM_PROMPT_SEQ_BUCKET_MAX` | `32768` | The value for context length is 32K; use 16384 for 16K. |
31+
| `VLLM_DECODE_BLOCK_BUCKET_MIN` | `1024` | The value depends on the warm-up results. |
32+
| `VLLM_DECODE_BLOCK_BUCKET_STEP` | `1024` | The value depends on the warm-up results. |
33+
| `VLLM_DECODE_BLOCK_BUCKET_MAX` | `33792` | Calculate the value of `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example: `128 *((32 + 1)* 1024) / 128` and `32 *((32 + 1)* 1024) / 128`. |
3434

3535
## Batch Size Settings
3636

37-
The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.
38-
39-
If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
37+
The default `batch_size=256` setting is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups. If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
4038

41-
**Recompilation message example:**
39+
An example of a recompilation message:
4240

4341
```bash
4442
Configuration: (prompt, 1, 36864) was not warmed-up!
4543
```
4644

47-
**Warning message example:**
45+
An example of a warning message:
4846

4947
```bash
5048
Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
5149
```
52-
53-
## Multi-Step Scheduling Feature Usage
54-
55-
Enabling Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.
Lines changed: 13 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,26 @@
11
# Quantization, FP8 Inference and Model Calibration Process
22

3-
!!! note
4-
Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in detail in [docs.habana.ai vLLM Inference Section](https://docs.habana.ai/en/v1.21.0/PyTorch/Inference_on_PyTorch/vLLM_Inference/vLLM_FP8_Inference.html).
3+
To run quantized models using vLLM Hardware Plugin for Intel® Gaudi®, you need measurement files, which you can get by following the FP8 model calibration procedure available in the [FP8 Calibration and Inference with vLLM](https://docs.habana.ai/en/v1.21.0/PyTorch/Inference_on_PyTorch/vLLM_Inference/vLLM_FP8_Inference.html) guide. For an end-to-end example tutorial for quantizing a BF16 Llama 3.1 model to FP8 and then inferencing, see this [this guide](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/vLLM_Tutorials/FP8_Quantization_using_INC/FP8_Quantization_using_INC.ipynb).
54

6-
An end-to-end example tutorial for quantizing a BF16 Llama 3.1 model to FP8 and then inferencing is provided in this [Gaudi-tutorials repository](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/vLLM_Tutorials/FP8_Quantization_using_INC/FP8_Quantization_using_INC.ipynb).
7-
8-
Once you have completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
5+
Once you have completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following commands:
96

107
```bash
118
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
129
vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len 2048 --block-size 128 --max-num-seqs 32 --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor-parallel-size 8
1310
```
1411

15-
`QUANT_CONFIG` is an environment variable that points to the measurement or quantization configuration file. The measurement configuration file is used during the calibration procedure to collect
16-
measurements for a given model. The quantization configuration is used during inference.
12+
`QUANT_CONFIG` is an environment variable that points to the measurement or quantization configuration file. The measurement configuration file is required during calibration to collect
13+
measurements for a given model. The quantization configuration is needed during inference.
14+
15+
Here are a few recommendations for this process:
16+
17+
- For prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the time-consuming warm-up stage. However, we do not recommend disabling this feature in production environments, as it can lead to a significant performance decrease.
1718

18-
!!! tip
19-
If you are prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which is time-consuming.
20-
However, disabling this feature in production environments is not recommended, as it can lead to a significant performance decrease.
19+
- For benchmarking an FP8 model with `scale_format=const`, the `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` setting can help speed up the warm-up stage.
2120

22-
!!! tip
23-
If you are benchmarking an FP8 model with `scale_format=const`, setting `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` can help speed up the warmup stage.
21+
- When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
2422

25-
!!! tip
26-
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
27-
- `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
28-
- `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
23+
- `VLLM_ENGINE_ITERATION_TIMEOUT_S`: Adjusts the vLLM server timeout to the provided value, in seconds.
24+
- `VLLM_RPC_TIMEOUT`: Adjusts the RPC protocol timeout used by the OpenAI-compatible API, in microseconds.
2925

30-
!!! tip
31-
When running FP8 models with `scale_format=scalar` and lazy mode (`PT_HPU_LAZY_MODE=1`) in order to reduce warmup time it is useful to set `RUNTIME_SCALE_PATCHING=1`. This may introduce a small performance degradation but warmup time should be significantly reduced. Runtime Scale Patching is enabled by default for Torch compile.
26+
- When running FP8 models with `scale_format=scalar` and lazy mode (`PT_HPU_LAZY_MODE=1`) in order to reduce warm-up time, it is useful to set `RUNTIME_SCALE_PATCHING=1`. This may introduce a small performance degradation but the warm-up time should be significantly reduced. Runtime Scale Patching is enabled by default for Torch compile.

docs/configuration/multi_node.md

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,59 @@
11

2-
# Multi-node Configuration
2+
# Configuring Multi-Node Environment
33

4-
vLLM works with a multi-node environment setup via Ray. To run models on multiple nodes, follow the procedure below.
4+
This feature will be introduced in a future release.
5+
6+
vLLM works with a multi-node environment setup via Ray. To run models on multiple nodes, follow this procedure.
57

68
## Prerequisites
7-
Perform the following on all nodes:
89

9-
- Install the latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm).
10+
Before you start, install the latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm).
11+
12+
## Configuration Procedure
1013

11-
- Check if all Gaudi NIC ports are up by running:
14+
1. Check if all Gaudi NIC ports are up using the following commands on the host, not inside the container.
1215

13-
!!! note
14-
Following commands should be run on the host and NOT inside the container.
16+
```bash
17+
cd /opt/habanalabs/qual/gaudi2/bin
18+
./manage_network_ifs.sh --status
19+
# All the ports should be in 'up' state. Try flipping the state
20+
./manage_network_ifs.sh --down
21+
./manage_network_ifs.sh --up
22+
# Give it a minute for the NIC's to flip and check the status again
23+
```
1524

16-
```bash
17-
cd /opt/habanalabs/qual/gaudi2/bin
18-
./manage_network_ifs.sh --status
19-
# All the ports should be in 'up' state. Try flipping the state
20-
./manage_network_ifs.sh --down
21-
./manage_network_ifs.sh --up
22-
# Give it a minute for the NIC's to flip and check the status again
23-
```
25+
2. Set the following flags:
2426

25-
- Set the following flags:
27+
```bash
28+
# Check the network interface for outbound/inbound comms. Command 'ip a' or 'ifconfig' should list all the interfaces
29+
export GLOO_SOCKET_IFNAME=eth0
30+
export HCCL_SOCKET_IFNAME=eth0
31+
```
2632

27-
```bash
28-
# Check the network interface for outbound/inbound comms. Command 'ip a' or 'ifconfig' should list all the interfaces
29-
export GLOO_SOCKET_IFNAME=eth0
30-
export HCCL_SOCKET_IFNAME=eth0
31-
```
33+
3. Start Ray on the head node.
3234

33-
## 1. Start Ray on the head node:
35+
```bash
36+
ray start --head --port=6379
37+
```
3438

35-
```bash
36-
ray start --head --port=6379
37-
```
39+
4. Add workers to the Ray cluster.
3840

39-
## 2. Add workers to the Ray cluster:
41+
```bash
42+
ray start --address='<ip-of-ray-head-node>:6379'
43+
```
4044

41-
```bash
42-
ray start --address='<ip-of-ray-head-node>:6379'
43-
```
45+
5. Start the vLLM server:
4446

45-
## 3. Start the vLLM server:
47+
```bash
48+
vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len 2048 --block-size 128 --max-num-seqs 32 --tensor-parallel-size 16 --distributed-executor-backend ray
49+
```
4650

47-
```bash
48-
vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len 2048 --block-size 128 --max-num-seqs 32 --tensor-parallel-size 16 --distributed-executor-backend ray
49-
```
51+
For information on running FP8 models with a multi-node setup, see [this guide](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md).
5052

51-
!!! note
52-
Running FP8 models with a multi-node setup is described in the documentation of FP8 calibration procedure: [README](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md).
53+
## Online Serving Examples
5354

54-
# Other Online Serving Examples
55+
For information about reproducing performance numbers using vLLM Hardware Plugin for Intel® Gaudi® for various types of models and varying context lengths, refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts. The following models and example scripts are available for 2K and 4K context length scenarios:
5556

56-
Please refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts designed to help the user reproduce performance numbers with vLLM on Gaudi for various types of models and varying context lengths. Below is a list of the models and example scripts provided for 2K and 4K context length scenarios:
5757
- deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-2k
5858
- deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-4k
5959
- llama-3.1-70b-instruct_gaudi3_1.20_contextlen-2k
Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,13 @@
11
# Pipeline Parallelism
22

3-
Pipeline parallelism is a distributed model parallelization technique that splits the model vertically across its layers, distributing different parts of the model across multiple devices.
4-
With this feature, when running a model that does not fit on a single node with tensor parallelism and requires a multi-node solution, we can split the model vertically across its layers and distribute the slices across available nodes.
5-
For example, if we have two nodes, each with 8 HPUs, we no longer have to use `tensor_parallel_size=16` but can instead set `tensor_parallel_size=8` with pipeline_parallel_size=2.
6-
Because pipeline parallelism runs a `pp_size` number of virtual engines on each device, we have to lower `max_num_seqs` accordingly, since it acts as a micro batch for each virtual engine.
3+
Pipeline parallelism is a distributed model parallelization technique that splits the model vertically across its layers, distributing different parts of the model across multiple devices. This approach is particularly useful when a model cannot fit on a single node using tensor parallelism alone and requires a multi-node setup. In such cases, the model’s layers can be split across multiple nodes, allowing each node to handle a segment of the model. For example, if you have two nodes, each equipped with 8 HPUs, you no longer need to set `tensor_parallel_size=16`. Instead, you can configure `tensor_parallel_size=8` and `pipeline_parallel_size=2`.
74

8-
## Running Pipeline Parallelism
9-
10-
The following example shows how to use Pipeline parallelism with vLLM on HPU:
5+
The following example shows how to use the pipeline parallelism with vLLM on HPU:
116

127
```bash
138
vllm serve <model_path> --device hpu --tensor-parallel-size 8 --pipeline_parallel_size 2 --distributed-executor-backend ray
149
```
1510

16-
!!! note
17-
Currently, pipeline parallelism on Lazy mode requires the `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0` flag.
11+
Since pipeline parallelism runs a `pp_size` number of virtual engines on each device, you have to lower `max_num_seqs` accordingly, as it acts as a micro batch for each virtual engine.
12+
13+
Currently, pipeline parallelism on the lazy mode requires the `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0` flag.

0 commit comments

Comments
 (0)