From 82a7cf7965ebe54dc9048f3bc7b8d2457ef7d6a1 Mon Sep 17 00:00:00 2001
From: Benjamin Ironside Goldstein <benjamin.ironside@elastic.co>
Date: Fri, 7 Nov 2025 16:04:43 -0600
Subject: [PATCH 1/5] Creates vLLM connection guide

---
 .../security/ai/connect-to-own-local-llm.md   |  2 +-
 solutions/security/ai/connect-to-vLLM.md      | 55 +++++++++++++++++++
 ...onnectors-for-large-language-models-llm.md |  5 +-
 solutions/toc.yml                             |  1 +
 4 files changed, 60 insertions(+), 3 deletions(-)
 create mode 100644 solutions/security/ai/connect-to-vLLM.md

diff --git a/solutions/security/ai/connect-to-own-local-llm.md b/solutions/security/ai/connect-to-own-local-llm.md
index ac38c2dab8..985f24ea41 100644
--- a/solutions/security/ai/connect-to-own-local-llm.md
+++ b/solutions/security/ai/connect-to-own-local-llm.md
@@ -11,7 +11,7 @@ products:
   - id: cloud-serverless
 ---
 
-# Connect to your own local LLM
+# Connect to your own local LLM using LM Studio
 
 This page provides instructions for setting up a connector to a large language model (LLM) of your choice using LM Studio. This allows you to use your chosen model within {{elastic-sec}}. You’ll first need to set up a reverse proxy to communicate with {{elastic-sec}}, then set up LM Studio on a server, and finally configure the connector in your Elastic deployment. [Learn more about the benefits of using a local LLM](https://www.elastic.co/blog/ai-assistant-locally-hosted-models).
 
diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
new file mode 100644
index 0000000000..8a696071d5
--- /dev/null
+++ b/solutions/security/ai/connect-to-vLLM.md
@@ -0,0 +1,55 @@
+---
+applies_to:
+  stack: all
+  serverless:
+    security: all
+products:
+  - id: security
+  - id: cloud-serverless
+---
+
+# Connect to your own LLM using vLLM (air gapped environments)
+This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container. 
+
+Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies. 
+
+## Requirements
+
+* Docker or Podman.
+* Necessary GPU drivers.
+
+## Server used in this example
+
+This example was tested using a GCP server configured as follows:
+
+* Operating system: Ubuntu 24.10
+* Machine type: a2-ultragpu-2g
+* vCPU: 24 (12 cores)
+* Architecture: x86/64
+* CPU Platform: Intel Cascade Lake
+* Memory: 340GB
+* Accelerator: 2 x NVIDIA A100 80GB GPUs
+* Reverse Proxy: Nginx
+
+## Outline
+The process involves four main steps:
+
+1. Configure your host server with the necessary GPU resources.
+2. Run the desired model in a vLLM container.
+3. Use a reverse proxy like Nginx to securely expose the endpoint to {{ecloud}}.
+4. Configure the OpenAI connector in your Elastic deployment.
+
+## Step 1: Configure your host server
+
+1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
+  1. Log in to your Hugging Face account.
+  2. Navigate to **Settings > Access Tokens**.
+  3. Create a new token with at least `read` permissions. Copy it to a secure location.
+2. Create an OpenAI-compatible secret token. You will need a secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy. Generate a strong, random string and save it in a secure location. You will need it both when configuring Nginx and when configuring the Elastic connector.
+
+## Step 2: Run your LLM with a vLLM container
+
+To pull and run your chosen vLLM image:
+
+1. Connect to your server via SSH.
+2. Run the vLLM Docker Container: Execute the following command in your terminal. This command will start the vLLM server, download the model, and expose it on port 8000.
\ No newline at end of file
diff --git a/solutions/security/ai/set-up-connectors-for-large-language-models-llm.md b/solutions/security/ai/set-up-connectors-for-large-language-models-llm.md
index 7c28c156f4..7cdc90c367 100644
--- a/solutions/security/ai/set-up-connectors-for-large-language-models-llm.md
+++ b/solutions/security/ai/set-up-connectors-for-large-language-models-llm.md
@@ -33,9 +33,10 @@ Follow these guides to connect to one or more third-party LLM providers:
 * [OpenAI](/solutions/security/ai/connect-to-openai.md)
 * [Google Vertex](/solutions/security/ai/connect-to-google-vertex.md)
 
-## Connect to a custom local LLM
+## Connect to a self-managed LLM
 
-You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
+- You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
+- For air-gapped environments, you can [connect to vLLM](/solutions/security/ai/connect-to-vLLM.md).
 
 ## Preconfigured connectors
 
diff --git a/solutions/toc.yml b/solutions/toc.yml
index df2ad35fcc..4f3e62f61d 100644
--- a/solutions/toc.yml
+++ b/solutions/toc.yml
@@ -575,6 +575,7 @@ toc:
               - file: security/ai/connect-to-openai.md
               - file: security/ai/connect-to-google-vertex.md
               - file: security/ai/connect-to-own-local-llm.md
+              - file: security/ai/connect-to-vLLM.md
           - file: security/ai/use-cases.md
             children:
               - file: security/ai/triage-alerts.md

From ada1c84222a7696f34d66a9b42e4ffe13f9415ae Mon Sep 17 00:00:00 2001
From: Benjamin Ironside Goldstein <benjamin.ironside@elastic.co>
Date: Fri, 7 Nov 2025 16:20:53 -0600
Subject: [PATCH 2/5] Update connect-to-vLLM.md

---
 solutions/security/ai/connect-to-vLLM.md | 44 +++++++++++++++++++++---
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
index 8a696071d5..38f83a7a11 100644
--- a/solutions/security/ai/connect-to-vLLM.md
+++ b/solutions/security/ai/connect-to-vLLM.md
@@ -20,7 +20,7 @@ Using this approach, you can power elastic's AI features with an LLM of your cho
 
 ## Server used in this example
 
-This example was tested using a GCP server configured as follows:
+This example uses a GCP server configured as follows:
 
 * Operating system: Ubuntu 24.10
 * Machine type: a2-ultragpu-2g
@@ -45,11 +45,45 @@ The process involves four main steps:
   1. Log in to your Hugging Face account.
   2. Navigate to **Settings > Access Tokens**.
   3. Create a new token with at least `read` permissions. Copy it to a secure location.
-2. Create an OpenAI-compatible secret token. You will need a secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy. Generate a strong, random string and save it in a secure location. You will need it both when configuring Nginx and when configuring the Elastic connector.
+2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy.
 
-## Step 2: Run your LLM with a vLLM container
+## Step 2: Run your vLLM container
 
 To pull and run your chosen vLLM image:
 
-1. Connect to your server via SSH.
-2. Run the vLLM Docker Container: Execute the following command in your terminal. This command will start the vLLM server, download the model, and expose it on port 8000.
\ No newline at end of file
+1. Connect to your server using SSH.
+2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000:
+
+```bash
+docker run --name Mistral-Small-3.2-24B --gpus all \
+-v /root/.cache/huggingface:/root/.cache/huggingface \
+--env HUGGING_FACE_HUB_TOKEN=xxxx \
+--env VLLM_API_KEY=xxxx \
+-p 8000:8000 \
+--ipc=host \
+vllm/vllm-openai:v0.9.1 \
+--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
+--tool-call-parser mistral \
+--tokenizer-mode mistral \
+--config-format mistral \
+--load-format mistral \
+--enable-auto-tool-choice \
+--gpu-memory-utilization 0.90 \
+--tensor-parallel-size 2
+```
+
+::::{admonition} Explanation of command
+`--gpus all`: Exposes all available GPUs to the container.
+`--name`: Set predefined name for the container, otherwise it’s going to be generated
+`-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`).
+`-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models).
+`--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM.
+`-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container.
+`–ipc=host`: Enables sharing memory between host and container.
+`vllm/vllm-openai:v0.9.1`: Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend.
+`--model`: ID of the Hugging Face model you wish to serve. In this example it represents the `Mistral-Small-3.2-24B` model.
+`--tool-call-parser mistral \`, `--tokenizer-mode mistral \`, `--config-format mistral \`, and `--load-format mistral`: Mistral specific parameters, refer to the Hugging Face model card for recommended values.
+`-enable-auto-tool-choice`: Enables automatic function calling.
+`--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available).
+`--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. 
+::::
\ No newline at end of file

From e526500d7db377706e2df5c805fa479b6d5be363 Mon Sep 17 00:00:00 2001
From: Benjamin Ironside Goldstein <benjamin.ironside@elastic.co>
Date: Fri, 7 Nov 2025 16:46:57 -0600
Subject: [PATCH 3/5] adds collapsible explanation section

---
 solutions/security/ai/connect-to-vLLM.md | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
index 38f83a7a11..3db1249895 100644
--- a/solutions/security/ai/connect-to-vLLM.md
+++ b/solutions/security/ai/connect-to-vLLM.md
@@ -44,8 +44,9 @@ The process involves four main steps:
 1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
   1. Log in to your Hugging Face account.
   2. Navigate to **Settings > Access Tokens**.
-  3. Create a new token with at least `read` permissions. Copy it to a secure location.
+  3. Create a new token with at least `read` permissions. Save it in a secure location.
 2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy.
+3. Install any necessary GPU drivers. 
 
 ## Step 2: Run your vLLM container
 
@@ -72,9 +73,11 @@ vllm/vllm-openai:v0.9.1 \
 --tensor-parallel-size 2
 ```
 
-::::{admonition} Explanation of command
+.**Click to expand an explanation of the command** 
+[%collapsible]
+=====
 `--gpus all`: Exposes all available GPUs to the container.
-`--name`: Set predefined name for the container, otherwise it’s going to be generated
+`--name`: Defines a name for the container.
 `-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`).
 `-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models).
 `--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM.
@@ -86,4 +89,5 @@ vllm/vllm-openai:v0.9.1 \
 `-enable-auto-tool-choice`: Enables automatic function calling.
 `--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available).
 `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. 
-::::
\ No newline at end of file
+=====
+

From 4ce881b868d2861312bf32994f4a8888ccd89bdc Mon Sep 17 00:00:00 2001
From: Benjamin Ironside Goldstein <benjamin.ironside@elastic.co>
Date: Fri, 7 Nov 2025 17:11:26 -0600
Subject: [PATCH 4/5] Update connect-to-vLLM.md

---
 solutions/security/ai/connect-to-vLLM.md | 63 +++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
index 3db1249895..ec9bb5d580 100644
--- a/solutions/security/ai/connect-to-vLLM.md
+++ b/solutions/security/ai/connect-to-vLLM.md
@@ -41,7 +41,7 @@ The process involves four main steps:
 
 ## Step 1: Configure your host server
 
-1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
+1. (Optional) If you plan to use a gated model (such as Llama 3.1) or a private model, create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
   1. Log in to your Hugging Face account.
   2. Navigate to **Settings > Access Tokens**.
   3. Create a new token with at least `read` permissions. Save it in a secure location.
@@ -73,7 +73,7 @@ vllm/vllm-openai:v0.9.1 \
 --tensor-parallel-size 2
 ```
 
-.**Click to expand an explanation of the command** 
+.**Click to expand a full explanation of the command** 
 [%collapsible]
 =====
 `--gpus all`: Exposes all available GPUs to the container.
@@ -91,3 +91,62 @@ vllm/vllm-openai:v0.9.1 \
 `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. 
 =====
 
+3. Verify the containers were created by running `docker ps -a`. The output should show the value you specified for the `--name` parameter.
+
+## Step 3: Expose the API with a reverse proxy
+
+This example uses Nginx to create a reverse proxy. This improves stability and enables monitoring by means of Elastic's native Nginx integration. The following example configuration forwards traffic to the vLLM container and uses a secret token for authentication.
+
+1. Install Nginx on your server.
+2. Create a configuration file, for example at `/etc/nginx/sites-available/default`. Give it the following content:
+
+```
+server {
+    listen 80;
+    server_name <yourdomainname.com>;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name <yourdomainname.com>;
+
+    ssl_certificate /etc/letsencrypt/live/<yourdomainname.com>/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/<yourdomainname.com>/privkey.pem;
+
+    location / {
+        if ($http_authorization != "Bearer <secret token>") {
+            return 401;
+        }
+        proxy_pass http://localhost:8000/;
+    }
+}
+```
+
+3. Enable and restart Nginx to apply the configuration.
+
+:::{note}
+For quick testing, you can use [ngrok](https://ngrok.com/) as an alternative to Nginx, but it is not recommended for production use.
+:::
+
+## Step 4: Configure the connector in your elastic deployment
+
+Finally, create the connector within your Elastic deployment to link it to your vLLM instance.
+
+1. Log in to {{kib}}.
+2. Navigate to the **Connectors** page, click **Create Connector**, and select **OpenAI**.
+3. Give the connector a descriptive name, such as `vLLM - Mistral Small 3.2`.
+4. In **Connector settings**, configure the following:
+  * For **Select an OpenAI provider**, select **Other (OpenAI Compatible Service)**.
+  * For **URL**, enter your server's public URL followed by `/v1/chat/completions`.
+5. For **Default Model**, enter `mistralai/Mistral-Small-3.2-24B-Instruct-2506` or the model ID you used during setup.
+6. For **Authentication**, configure the following:
+  * For **API key**, enter the secret token you created in Step 1 and specified in your Nginx configuration file.
+  * If your chosen model supports tool use, then turn on **Enable native function calling**.
+7. Click **Save**
+
+Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features, such as the AI Assistant. 
+
+:::{note}
+To run a different model, stop the current container and run a new one with an updated `--model` parameter.
+:::
\ No newline at end of file

From 7fbdd2c97b5b15b348450a22351b4367bfa63086 Mon Sep 17 00:00:00 2001
From: Benjamin Ironside Goldstein <benjamin.ironside@elastic.co>
Date: Fri, 7 Nov 2025 17:26:02 -0600
Subject: [PATCH 5/5] Adds final setup steps

---
 solutions/security/ai/connect-to-vLLM.md | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
index ec9bb5d580..383e2f4299 100644
--- a/solutions/security/ai/connect-to-vLLM.md
+++ b/solutions/security/ai/connect-to-vLLM.md
@@ -91,7 +91,7 @@ vllm/vllm-openai:v0.9.1 \
 `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. 
 =====
 
-3. Verify the containers were created by running `docker ps -a`. The output should show the value you specified for the `--name` parameter.
+3. Verify the container's status by running the `docker ps -a` command. The output should show the value you specified for the `--name` parameter.
 
 ## Step 3: Expose the API with a reverse proxy
 
@@ -144,8 +144,22 @@ Finally, create the connector within your Elastic deployment to link it to your
   * For **API key**, enter the secret token you created in Step 1 and specified in your Nginx configuration file.
   * If your chosen model supports tool use, then turn on **Enable native function calling**.
 7. Click **Save**
+8. Finally, open the **AI Assistant for Security** page using the navigation menu or the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md). 
+  * On the **Conversations** tab, turn off **Streaming**.
+  * If your model supports tool use, then on the **System prompts** page, create a new system prompt with a variation of the following prompt, to prevent your model from returning tool calls in AI Assistant conversations:
+  
+  ```
+  You are a model running under OpenAI-compatible tool calling mode.
+  
+  Rules:
+  1. When you want to invoke a tool, never describe the call in text.
+  2. Always return the invocation in the `tool_calls` field.
+  3. The `content` field must remain empty for any assistant message that performs a tool call.
+  4. Only use tool calls defined in the "tools" parameter.
+  ```
+
+Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features.
 
-Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features, such as the AI Assistant. 
 
 :::{note}
 To run a different model, stop the current container and run a new one with an updated `--model` parameter.