vllm-project
diff --git a/‎docs/.nav.yml‎
Lines changed: 5 additions & 2 deletions b/‎docs/.nav.yml‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/configuration/env_vars.md‎
Lines changed: 82 additions & 63 deletions b/‎docs/configuration/env_vars.md‎
Lines changed: 82 additions & 63 deletions
diff --git a/‎docs/configuration/long_context.md‎
Lines changed: 15 additions & 21 deletions b/‎docs/configuration/long_context.md‎
Lines changed: 15 additions & 21 deletions
diff --git a/‎docs/configuration/model_calibration.md‎
Lines changed: 13 additions & 18 deletions b/‎docs/configuration/model_calibration.md‎
Lines changed: 13 additions & 18 deletions
diff --git a/‎docs/configuration/multi_node.md‎
Lines changed: 37 additions & 37 deletions b/‎docs/configuration/multi_node.md‎
Lines changed: 37 additions & 37 deletions
diff --git a/‎docs/configuration/pipeline_parallelism.md‎
Lines changed: 5 additions & 9 deletions b/‎docs/configuration/pipeline_parallelism.md‎
Lines changed: 5 additions & 9 deletions
@@ -2,8 +2,11 @@ nav:
   - Home: 
     - vLLM x Intel Gaudi: README.md
     - Getting Started:
-      - getting_started/quickstart.md
-      - getting_started/installation.md
+      - Quick Start: 
+        - getting_started/quickstart.md
+        - getting_started/quickstart_configuration.md
+        - getting_started/quickstart_inference.md
+      - Installation: getting_started/installation.md
     - Quick Links:
       - User Guide: user_guide/README.md
       - Developer Guide: dev_guide/README.md
 
@@ -2,6 +2,7 @@
 # Long Context Configuration
 
 Long context feature enables support for a token context window exceeding 128K tokens. It is supported by the following models:
+
 - [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
 - [meta-llama/Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b)
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
@@ -17,39 +18,32 @@ Set the following environment variables to avoid OOM/functional issues.  Additio
 - `VLLM_RPC_TIMEOUT=100000`
 - `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1`
 
-## Warmup buckets preparation
-Exponential bucketing mechanism automatically prepares buckets for long context. Linear bucketing mechanism requires manual flags settings.
+## Warm-up Buckets Preparation
 
-**32K context length flags examples for linear warmup:**
+Exponential bucketing mechanism automatically prepares buckets for long context. Linear bucketing mechanism requires manual flags settings. The following table presents 32K context length flags examples for linear warm-up:
 
-- `VLLM_GRAPH_RESERVED_MEM`: The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B.
-- `VLLM_PROMPT_SEQ_BUCKET_MIN=24576`: Suggested value, depends on warmup results.
-- `VLLM_PROMPT_SEQ_BUCKET_STEP=2048`: Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3.
-- `VLLM_PROMPT_SEQ_BUCKET_MAX=32768`: Value for context length of 32K. Use 16384 for 16K.
-- `VLLM_DECODE_BLOCK_BUCKET_MIN=1024`: Suggested value, depends on warmup results.
-- `VLLM_DECODE_BLOCK_BUCKET_STEP=1024`: Suggested value, depends on warmup results.
-- `VLLM_DECODE_BLOCK_BUCKET_MAX=33792`: `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example:
-  - `128 *((32 + 1)* 1024) / 128`
-  - `32 *((32 + 1)* 1024) / 128`
+| Flag | Suggested value | Notes |
+|------|-----------------|-------|
+| `VLLM_GRAPH_RESERVED_MEM` | `0.02` or `0.1` | It depends on the model and context length settings. Set to `0.02` for Llama3.1-8B or `0.1` for Llama3.1-70B. |
+| `VLLM_PROMPT_SEQ_BUCKET_MIN` | `24576` | The value depends on the warm-up results. |
+| `VLLM_PROMPT_SEQ_BUCKET_STEP` | `2048` | The value depends on the warm-up results. We recommend increasing it to a higher value for faster warm-up. for Intel Gaudi 3, we suggest setting it to `16384`. |
+| `VLLM_PROMPT_SEQ_BUCKET_MAX` | `32768` | The value for context length is 32K; use 16384 for 16K. |
+| `VLLM_DECODE_BLOCK_BUCKET_MIN` | `1024` | The value depends on the warm-up results. |
+| `VLLM_DECODE_BLOCK_BUCKET_STEP` | `1024` | The value depends on the warm-up results. |
+| `VLLM_DECODE_BLOCK_BUCKET_MAX` | `33792` | Calculate the value of `max_num_seqs * max_decode_seq // self.block_size`, where `max_decode_seq` represents the sum of input and output sequences. For example: `128 *((32 + 1)* 1024) / 128` and `32 *((32 + 1)* 1024) / 128`. |
 
 ## Batch Size Settings
 
-The default `batch_size=256` is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups.
-
-If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
+The default `batch_size=256` setting is not optimal for long contexts (8K+). Recompilations may occur if there is not enough KV cache space for some sequence groups. If recompilation or next recomputation warnings appear during inference, reduce `batch_size` to improve stability.
 
-**Recompilation message example:**
+An example of a recompilation message:
 
 ```bash
 Configuration: (prompt, 1, 36864) was not warmed-up!
 ```
 
-**Warning message example:**
+An example of a warning message:
 
 ```bash
 Sequence group cmpl-3cbf19b0c6d74b3f90b5d5db2ed2385e-0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory.
 ```
-
-## Multi-Step Scheduling Feature Usage
-
-Enabling Multi-Step Scheduling is recommended for better decode performance. Refer to vllm-project#6854 for more details.
@@ -1,31 +1,26 @@
 # Quantization, FP8 Inference and Model Calibration Process
 
-!!! note
-    Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in detail in [docs.habana.ai vLLM Inference Section](https://docs.habana.ai/en/v1.21.0/PyTorch/Inference_on_PyTorch/vLLM_Inference/vLLM_FP8_Inference.html).
+To run quantized models using vLLM Hardware Plugin for Intel® Gaudi®, you need measurement files, which you can get by following the FP8 model calibration procedure available in the [FP8 Calibration and Inference with vLLM](https://docs.habana.ai/en/v1.21.0/PyTorch/Inference_on_PyTorch/vLLM_Inference/vLLM_FP8_Inference.html) guide. For an end-to-end example tutorial for quantizing a BF16 Llama 3.1 model to FP8 and then inferencing, see this [this guide](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/vLLM_Tutorials/FP8_Quantization_using_INC/FP8_Quantization_using_INC.ipynb).
 
-An end-to-end example tutorial for quantizing a BF16 Llama 3.1 model to FP8 and then inferencing is provided in this [Gaudi-tutorials repository](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/vLLM_Tutorials/FP8_Quantization_using_INC/FP8_Quantization_using_INC.ipynb).
-
-Once you have completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
+Once you have completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following commands:
 
 ```bash
 export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
 vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len  2048 --block-size 128 --max-num-seqs 32 --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor-parallel-size 8
 ```
 
-`QUANT_CONFIG` is an environment variable that points to the measurement or quantization configuration file. The measurement configuration file is used during the calibration procedure to collect
-measurements for a given model. The quantization configuration is used during inference.
+`QUANT_CONFIG` is an environment variable that points to the measurement or quantization configuration file. The measurement configuration file is required during calibration to collect
+measurements for a given model. The quantization configuration is needed during inference.
+
+Here are a few recommendations for this process:
+
+- For prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the time-consuming warm-up stage. However, we do not recommend disabling this feature in production environments, as it can lead to a significant performance decrease.
 
-!!! tip
-    If you are prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which is time-consuming.
-    However, disabling this feature in production environments is not recommended, as it can lead to a significant performance decrease.
+- For benchmarking an FP8 model with `scale_format=const`, the `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` setting can help speed up the warm-up stage.
 
-!!! tip
-    If you are benchmarking an FP8 model with `scale_format=const`, setting `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` can help speed up the warmup stage.
+- When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
 
-!!! tip
-    When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
-    - `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
-    - `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
+  - `VLLM_ENGINE_ITERATION_TIMEOUT_S`: Adjusts the vLLM server timeout to the provided value, in seconds.
+  - `VLLM_RPC_TIMEOUT`: Adjusts the RPC protocol timeout used by the OpenAI-compatible API, in microseconds.
 
-!!! tip
-    When running FP8 models with `scale_format=scalar` and lazy mode (`PT_HPU_LAZY_MODE=1`) in order to reduce warmup time it is useful to set `RUNTIME_SCALE_PATCHING=1`. This may introduce a small performance degradation but warmup time should be significantly reduced. Runtime Scale Patching is enabled by default for Torch compile.
+- When running FP8 models with `scale_format=scalar` and lazy mode (`PT_HPU_LAZY_MODE=1`) in order to reduce warm-up time, it is useful to set `RUNTIME_SCALE_PATCHING=1`. This may introduce a small performance degradation but the warm-up time should be significantly reduced. Runtime Scale Patching is enabled by default for Torch compile.
@@ -1,59 +1,59 @@
 
-# Multi-node Configuration
+# Configuring Multi-Node Environment
 
-vLLM works with a multi-node environment setup via Ray. To run models on multiple nodes, follow the procedure below.
+This feature will be introduced in a future release.
+
+vLLM works with a multi-node environment setup via Ray. To run models on multiple nodes, follow this procedure.
 
 ## Prerequisites
-Perform the following on all nodes:
 
-- Install the latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm).
+Before you start, install the latest [vllm-fork](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm).
+
+## Configuration Procedure
 
-- Check if all Gaudi NIC ports are up by running:
+1. Check if all Gaudi NIC ports are up using the following commands on the host, not inside the container.
 
-!!! note
-    Following commands should be run on the host and NOT inside the container.
+    ```bash
+    cd /opt/habanalabs/qual/gaudi2/bin 
+    ./manage_network_ifs.sh --status 
+    # All the ports should be in 'up' state. Try flipping the state
+    ./manage_network_ifs.sh --down 
+    ./manage_network_ifs.sh --up
+    # Give it a minute for the NIC's to flip and check the status again
+    ```
 
-```bash
-cd /opt/habanalabs/qual/gaudi2/bin 
-./manage_network_ifs.sh --status 
-# All the ports should be in 'up' state. Try flipping the state
-./manage_network_ifs.sh --down 
-./manage_network_ifs.sh --up
-# Give it a minute for the NIC's to flip and check the status again
-```
+2. Set the following flags:
 
-- Set the following flags:
+    ```bash
+    # Check the network interface for outbound/inbound comms. Command 'ip a' or 'ifconfig' should list all the interfaces
+    export GLOO_SOCKET_IFNAME=eth0
+    export HCCL_SOCKET_IFNAME=eth0
+    ```
 
-```bash
-# Check the network interface for outbound/inbound comms. Command 'ip a' or 'ifconfig' should list all the interfaces
-export GLOO_SOCKET_IFNAME=eth0
-export HCCL_SOCKET_IFNAME=eth0
-```
+3. Start Ray on the head node.
 
-## 1. Start Ray on the head node:
+    ```bash
+    ray start --head --port=6379
+    ```
 
-```bash
-ray start --head --port=6379
-```
+4. Add workers to the Ray cluster.
 
-## 2. Add workers to the Ray cluster:
+    ```bash
+    ray start --address='<ip-of-ray-head-node>:6379'
+    ```
 
-```bash
-ray start --address='<ip-of-ray-head-node>:6379'
-```
+5. Start the vLLM server:
 
-## 3. Start the vLLM server:
+    ```bash
+    vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len  2048 --block-size 128 --max-num-seqs 32 --tensor-parallel-size 16 --distributed-executor-backend ray
+    ```
 
-```bash
-vllm serve meta-llama/Llama-3.1-405B-Instruct --dtype bfloat16 --max-model-len  2048 --block-size 128 --max-num-seqs 32 --tensor-parallel-size 16 --distributed-executor-backend ray
-```
+For information on running FP8 models with a multi-node setup, see [this guide](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md).
 
-!!! note
-    Running FP8 models with a multi-node setup is described in the documentation of FP8 calibration procedure: [README](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md).
+## Online Serving Examples
 
-# Other Online Serving Examples
+For information about reproducing performance numbers using vLLM Hardware Plugin for Intel® Gaudi® for various types of models and varying context lengths, refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts. The following models and example scripts are available for 2K and 4K context length scenarios:
 
-Please refer to this [collection](https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Benchmarking_on_vLLM/Online_Static#quick-start) of static-batched online serving example scripts designed to help the user reproduce performance numbers with vLLM on Gaudi for various types of models and varying context lengths. Below is a list of the models and example scripts provided for 2K and 4K context length scenarios:
 - deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-2k
 - deepseek-r1-distill-llama-70b_gaudi3_1.20_contextlen-4k
 - llama-3.1-70b-instruct_gaudi3_1.20_contextlen-2k
 
@@ -1,17 +1,13 @@
 # Pipeline Parallelism
 
-Pipeline parallelism is a distributed model parallelization technique that splits the model vertically across its layers, distributing different parts of the model across multiple devices.
-With this feature, when running a model that does not fit on a single node with tensor parallelism and requires a multi-node solution, we can split the model vertically across its layers and distribute the slices across available nodes.
-For example, if we have two nodes, each with 8 HPUs, we no longer have to use `tensor_parallel_size=16` but can instead set `tensor_parallel_size=8` with pipeline_parallel_size=2.
-Because pipeline parallelism runs a `pp_size` number of virtual engines on each device, we have to lower `max_num_seqs` accordingly, since it acts as a micro batch for each virtual engine.
+Pipeline parallelism is a distributed model parallelization technique that splits the model vertically across its layers, distributing different parts of the model across multiple devices. This approach is particularly useful when a model cannot fit on a single node using tensor parallelism alone and requires a multi-node setup. In such cases, the model’s layers can be split across multiple nodes, allowing each node to handle a segment of the model. For example, if you have two nodes, each equipped with 8 HPUs, you no longer need to set `tensor_parallel_size=16`. Instead, you can configure `tensor_parallel_size=8` and `pipeline_parallel_size=2`.
 
-## Running Pipeline Parallelism
-
-The following example shows how to use Pipeline parallelism with vLLM on HPU:
+The following example shows how to use the pipeline parallelism with vLLM on HPU:
 
 ```bash
 vllm serve <model_path> --device hpu --tensor-parallel-size 8 --pipeline_parallel_size 2 --distributed-executor-backend ray
 ```
 
-!!! note
-    Currently, pipeline parallelism on Lazy mode requires the `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0` flag.
+Since pipeline parallelism runs a `pp_size` number of virtual engines on each device, you have to lower `max_num_seqs` accordingly, as it acts as a micro batch for each virtual engine.
+
+Currently, pipeline parallelism on the lazy mode requires the `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=0` flag.