-
Notifications
You must be signed in to change notification settings - Fork 176
vLLM custom connector setup guide #3858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
benironside
wants to merge
5
commits into
main
Choose a base branch
from
3474-vLLM-guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+171
−3
Draft
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
82a7cf7
Creates vLLM connection guide
benironside ada1c84
Update connect-to-vLLM.md
benironside e526500
adds collapsible explanation section
benironside 4ce881b
Update connect-to-vLLM.md
benironside 7fbdd2c
Adds final setup steps
benironside File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| --- | ||
| applies_to: | ||
| stack: all | ||
| serverless: | ||
| security: all | ||
| products: | ||
| - id: security | ||
| - id: cloud-serverless | ||
| --- | ||
|
|
||
| # Connect to your own LLM using vLLM (air gapped environments) | ||
| This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container. | ||
|
|
||
| Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies. | ||
|
|
||
| ## Requirements | ||
|
|
||
| * Docker or Podman. | ||
| * Necessary GPU drivers. | ||
|
|
||
| ## Server used in this example | ||
|
|
||
| This example uses a GCP server configured as follows: | ||
|
|
||
| * Operating system: Ubuntu 24.10 | ||
| * Machine type: a2-ultragpu-2g | ||
| * vCPU: 24 (12 cores) | ||
| * Architecture: x86/64 | ||
| * CPU Platform: Intel Cascade Lake | ||
| * Memory: 340GB | ||
| * Accelerator: 2 x NVIDIA A100 80GB GPUs | ||
| * Reverse Proxy: Nginx | ||
|
|
||
| ## Outline | ||
| The process involves four main steps: | ||
|
|
||
| 1. Configure your host server with the necessary GPU resources. | ||
| 2. Run the desired model in a vLLM container. | ||
| 3. Use a reverse proxy like Nginx to securely expose the endpoint to {{ecloud}}. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it just Elastic Cloud that this works with? Not other deployment types? |
||
| 4. Configure the OpenAI connector in your Elastic deployment. | ||
|
|
||
| ## Step 1: Configure your host server | ||
|
|
||
| 1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens). | ||
| 1. Log in to your Hugging Face account. | ||
| 2. Navigate to **Settings > Access Tokens**. | ||
| 3. Create a new token with at least `read` permissions. Copy it to a secure location. | ||
| 2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy. | ||
|
|
||
| ## Step 2: Run your vLLM container | ||
|
|
||
| To pull and run your chosen vLLM image: | ||
|
|
||
| 1. Connect to your server using SSH. | ||
| 2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000: | ||
|
|
||
| ```bash | ||
| docker run --name Mistral-Small-3.2-24B --gpus all \ | ||
| -v /root/.cache/huggingface:/root/.cache/huggingface \ | ||
| --env HUGGING_FACE_HUB_TOKEN=xxxx \ | ||
| --env VLLM_API_KEY=xxxx \ | ||
| -p 8000:8000 \ | ||
| --ipc=host \ | ||
| vllm/vllm-openai:v0.9.1 \ | ||
| --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \ | ||
| --tool-call-parser mistral \ | ||
| --tokenizer-mode mistral \ | ||
| --config-format mistral \ | ||
| --load-format mistral \ | ||
| --enable-auto-tool-choice \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --tensor-parallel-size 2 | ||
| ``` | ||
|
|
||
| ::::{admonition} Explanation of command | ||
| `--gpus all`: Exposes all available GPUs to the container. | ||
| `--name`: Set predefined name for the container, otherwise it’s going to be generated | ||
| `-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`). | ||
| `-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models). | ||
| `--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM. | ||
| `-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container. | ||
| `–ipc=host`: Enables sharing memory between host and container. | ||
| `vllm/vllm-openai:v0.9.1`: Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend. | ||
| `--model`: ID of the Hugging Face model you wish to serve. In this example it represents the `Mistral-Small-3.2-24B` model. | ||
| `--tool-call-parser mistral \`, `--tokenizer-mode mistral \`, `--config-format mistral \`, and `--load-format mistral`: Mistral specific parameters, refer to the Hugging Face model card for recommended values. | ||
| `-enable-auto-tool-choice`: Enables automatic function calling. | ||
| `--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available). | ||
| `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. | ||
| :::: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a few places throughout the guide that referred to a Docker container. I changed those to just refer to a container since it seems Podman is an acceptable alternative too.