Skip to content
Open
2 changes: 1 addition & 1 deletion solutions/security/ai/connect-to-own-local-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ products:
- id: cloud-serverless
---

# Connect to your own local LLM
# Connect to your own local LLM using LM Studio

This page provides instructions for setting up a connector to a large language model (LLM) of your choice using LM Studio. This allows you to use your chosen model within {{elastic-sec}}. You’ll first need to set up a reverse proxy to communicate with {{elastic-sec}}, then set up LM Studio on a server, and finally configure the connector in your Elastic deployment. [Learn more about the benefits of using a local LLM](https://www.elastic.co/blog/ai-assistant-locally-hosted-models).

Expand Down
166 changes: 166 additions & 0 deletions solutions/security/ai/connect-to-vLLM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
applies_to:
stack: all
serverless:
security: all
products:
- id: security
- id: cloud-serverless
---

# Connect to your own LLM using vLLM (air gapped environments)
This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container.

Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies.
Comment on lines +12 to +14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you're missing other callouts from this guide. Is something like this TMI? Also what about calling this a guide instead of an example?

This guide shows you how to run an OpenAI-compatible large language model with vLLM and connect it to Elastic. The setup runs inside Docker or Podman, is served through an Nginx reverse proxy, and does not require any outbound network access. This makes it a safe option for air-gapped environments or deployments with strict network controls.

The steps below show one example configuration, but you can use any model supported by vLLM, including private and gated models on Hugging Face.

Suggested change
This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container.
Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies.
This guide shows you how to run an OpenAI-compatible large language model with vLLM and connect it to Elastic. The setup runs inside Docker or Podman, is served through an Nginx reverse proxy, and does not require any outbound network access. This makes it a safe option for air-gapped environments or deployments with strict network controls.
The steps below show one example configuration, but you can use any model supported by vLLM, including private and gated models on Hugging Face.


## Requirements

* Docker or Podman.
* Necessary GPU drivers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include a model here as well?

  • Access to a model that works with vLLM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about NGINX or a as a requirement?

  • A reverse proxy, like NGINX


## Server used in this example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could get rid of the passive voice:

Suggested change
## Server used in this example
## Example server configuration


This example uses a GCP server configured as follows:

* Operating system: Ubuntu 24.10
* Machine type: a2-ultragpu-2g
* vCPU: 24 (12 cores)
* Architecture: x86/64
* CPU Platform: Intel Cascade Lake
* Memory: 340GB
* Accelerator: 2 x NVIDIA A100 80GB GPUs
Comment on lines +26 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wowzer this is a big machine 👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this section after reading the whole guide. What's the importance of this Server? AFAICT, these values aren't referenced later on in the guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to show the type of server that may be required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a subsection of step 2?

* Reverse Proxy: Nginx

## Outline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview? Idk that I've seen outline used in other guides

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure this section is necessary

The process involves four main steps:

1. Configure your host server with the necessary GPU resources.
2. Run the desired model in a vLLM container.
3. Use a reverse proxy like Nginx to securely expose the endpoint to {{ecloud}}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just Elastic Cloud that this works with? Not other deployment types?

4. Configure the OpenAI connector in your Elastic deployment.

## Step 1: Configure your host server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using the stepper component?


1. (Optional) If you plan to use a gated model (such as Llama 3.1) or a private model, create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
1. Log in to your Hugging Face account.
2. Navigate to **Settings > Access Tokens**.
3. Create a new token with at least `read` permissions. Save it in a secure location.
2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy.
3. Install any necessary GPU drivers.

## Step 2: Run your vLLM container

To pull and run your chosen vLLM image:

1. Connect to your server using SSH.
2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000:

```bash
docker run --name Mistral-Small-3.2-24B --gpus all \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is is something we will be able to update shortly? I mean we should avoid recommending Mistral-Small-3.2-24B as it has a lot of issues with Security Assistant tool calling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this any time. For now, since this model isn't recommended, I replaced it with [YOUR_MODEL_ID]. Make sense to you?

-v /root/.cache/huggingface:/root/.cache/huggingface \
--env HUGGING_FACE_HUB_TOKEN=xxxx \
--env VLLM_API_KEY=xxxx \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.9.1 \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--tool-call-parser mistral \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 2
```

.**Click to expand a full explanation of the command**
[%collapsible]
=====
`--gpus all`: Exposes all available GPUs to the container.
`--name`: Defines a name for the container.
`-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`).
`-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models).
`--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM.
`-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container.
`–ipc=host`: Enables sharing memory between host and container.
`vllm/vllm-openai:v0.9.1`: Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend.
`--model`: ID of the Hugging Face model you wish to serve. In this example it represents the `Mistral-Small-3.2-24B` model.
`--tool-call-parser mistral \`, `--tokenizer-mode mistral \`, `--config-format mistral \`, and `--load-format mistral`: Mistral specific parameters, refer to the Hugging Face model card for recommended values.
`-enable-auto-tool-choice`: Enables automatic function calling.
`--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available).
`--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems.
=====

3. Verify the container's status by running the `docker ps -a` command. The output should show the value you specified for the `--name` parameter.

## Step 3: Expose the API with a reverse proxy

This example uses Nginx to create a reverse proxy. This improves stability and enables monitoring by means of Elastic's native Nginx integration. The following example configuration forwards traffic to the vLLM container and uses a secret token for authentication.

1. Install Nginx on your server.
2. Create a configuration file, for example at `/etc/nginx/sites-available/default`. Give it the following content:

```
server {
listen 80;
server_name <yourdomainname.com>;
return 301 https://$server_name$request_uri;
}

server {
listen 443 ssl http2;
server_name <yourdomainname.com>;

ssl_certificate /etc/letsencrypt/live/<yourdomainname.com>/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/<yourdomainname.com>/privkey.pem;

location / {
if ($http_authorization != "Bearer <secret token>") {
return 401;
}
proxy_pass http://localhost:8000/;
}
}
```

3. Enable and restart Nginx to apply the configuration.

:::{note}
For quick testing, you can use [ngrok](https://ngrok.com/) as an alternative to Nginx, but it is not recommended for production use.
:::

## Step 4: Configure the connector in your elastic deployment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Step 4: Configure the connector in your elastic deployment
## Step 4: Configure the connector in your Elastic deployment


Finally, create the connector within your Elastic deployment to link it to your vLLM instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say finally twice in this section


1. Log in to {{kib}}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it typical to say "Log into Kibana" or is that assumed? I think the general practice is to just say, "In Kibana, navigate to the Connectors page..."

2. Navigate to the **Connectors** page, click **Create Connector**, and select **OpenAI**.
3. Give the connector a descriptive name, such as `vLLM - Mistral Small 3.2`.
4. In **Connector settings**, configure the following:
* For **Select an OpenAI provider**, select **Other (OpenAI Compatible Service)**.
* For **URL**, enter your server's public URL followed by `/v1/chat/completions`.
5. For **Default Model**, enter `mistralai/Mistral-Small-3.2-24B-Instruct-2506` or the model ID you used during setup.
6. For **Authentication**, configure the following:
* For **API key**, enter the secret token you created in Step 1 and specified in your Nginx configuration file.
* If your chosen model supports tool use, then turn on **Enable native function calling**.
7. Click **Save**
8. Finally, open the **AI Assistant for Security** page using the navigation menu or the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md).
* On the **Conversations** tab, turn off **Streaming**.
* If your model supports tool use, then on the **System prompts** page, create a new system prompt with a variation of the following prompt, to prevent your model from returning tool calls in AI Assistant conversations:

```
You are a model running under OpenAI-compatible tool calling mode.

Rules:
1. When you want to invoke a tool, never describe the call in text.
2. Always return the invocation in the `tool_calls` field.
3. The `content` field must remain empty for any assistant message that performs a tool call.
4. Only use tool calls defined in the "tools" parameter.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Following https://github.com/elastic/sdh-security-team/issues/1417 to confirm if this system prompt fix works

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since 9.1.7 it seems it is not needed anymore, but we can keep it until we change the recommended model,
more important in this case is to make sure they add

feature_flags.overrides:
  securitySolution.inferenceChatModelDisabled: true

to config/kibana.yml otherwise Mistral is not going to work with Security Assistant (more details in linked SDH above)

```

Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features.


:::{note}
To run a different model, stop the current container and run a new one with an updated `--model` parameter.
:::
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ Follow these guides to connect to one or more third-party LLM providers:
* [OpenAI](/solutions/security/ai/connect-to-openai.md)
* [Google Vertex](/solutions/security/ai/connect-to-google-vertex.md)

## Connect to a custom local LLM
## Connect to a self-managed LLM

You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
- You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
- For air-gapped environments, you can [connect to vLLM](/solutions/security/ai/connect-to-vLLM.md).

## Preconfigured connectors

Expand Down
1 change: 1 addition & 0 deletions solutions/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -575,6 +575,7 @@ toc:
- file: security/ai/connect-to-openai.md
- file: security/ai/connect-to-google-vertex.md
- file: security/ai/connect-to-own-local-llm.md
- file: security/ai/connect-to-vLLM.md
- file: security/ai/use-cases.md
children:
- file: security/ai/triage-alerts.md
Expand Down