Skip to content

Commit 95f30f6

Browse files
committed
update GAIE to slo aware routing
1 parent 60726b0 commit 95f30f6

File tree

6 files changed

+681
-0
lines changed

6 files changed

+681
-0
lines changed
Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# SLO-Aware Routing with Latency Prediction
2+
3+
This document describes the modifications made to the InferencePool Helm chart to support SLO-aware routing with latency prediction sidecars.
4+
5+
## Overview
6+
7+
The SLO-aware routing feature enables intelligent request routing based on predicted latency using machine learning models. The system consists of:
8+
9+
1. **EPP (Endpoint Picker) Container**: Main routing logic with latency prediction enabled
10+
2. **Training Server Sidecar**: Continuously trains XGBoost models on observed latency metrics
11+
3. **Prediction Server Sidecars**: Multiple replicas that serve latency predictions for TTFT (Time to First Token) and TPOT (Time Per Output Token)
12+
13+
## Architecture
14+
15+
```
16+
┌─────────────────────────────────────────────────────┐
17+
│ EPP Pod │
18+
├──────────────┬──────────────┬──────────────────────┤
19+
│ EPP │ Training │ Prediction Servers │
20+
│ Container │ Server │ (3 replicas) │
21+
│ │ │ │
22+
│ Port 9002 │ Port 8000 │ Ports 8001-8003 │
23+
│ (ext-proc) │ (training) │ (prediction) │
24+
└──────────────┴──────────────┴──────────────────────┘
25+
│ │ │
26+
│ └──────┬───────────┘
27+
│ │
28+
│ Model Training
29+
│ & Synchronization
30+
31+
Routing Decision
32+
(with latency prediction)
33+
```
34+
35+
## Modified Files
36+
37+
### 1. `templates/epp-deployment.yaml`
38+
- Added support for `sidecars.trainingServer` configuration
39+
- Added support for `sidecars.predictionServers` with configurable replicas
40+
- Automatically creates volumes for model storage
41+
- Injects ConfigMaps for training and prediction server configuration
42+
43+
### 2. `templates/epp-service.yaml`
44+
- Automatically exposes ports for training server (8000)
45+
- Automatically exposes ports for prediction servers (8001-8003 by default)
46+
- Ports are only added when sidecars are enabled
47+
48+
### 3. `templates/latency-predictor-config.yaml` (NEW)
49+
- Creates ConfigMap for training server configuration
50+
- Creates ConfigMap for prediction server configuration
51+
- Supports customizable model paths, retraining intervals, and other parameters
52+
53+
### 4. `values.yaml`
54+
- Added comprehensive `sidecars` section with commented examples
55+
- Supports configuration for training and prediction server images, resources, and behavior
56+
57+
### 5. `values-slo-example.yaml` (NEW)
58+
- Complete working example of SLO-aware routing configuration
59+
- Demonstrates all required settings including EPP flags, environment variables, and plugin configuration
60+
61+
## Usage
62+
63+
### Quick Start with Example Configuration
64+
65+
```bash
66+
# Install with SLO-aware routing enabled
67+
helm install my-slo-pool oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \
68+
--namespace inference \
69+
--values values-slo-example.yaml \
70+
--set inferencePool.modelServers.matchLabels.app=my-model-server
71+
```
72+
73+
### Custom Configuration
74+
75+
Create a custom values file:
76+
77+
```yaml
78+
inferenceExtension:
79+
image:
80+
hub: quay.io/your-org
81+
name: epp
82+
tag: slo-experimental
83+
84+
flags:
85+
- name: enable-latency-predictor
86+
value: "true"
87+
- name: v
88+
value: "4"
89+
90+
env:
91+
- name: PREDICTION_SERVER_URL
92+
value: "http://localhost:8001,http://localhost:8002,http://localhost:8003"
93+
- name: TRAINING_SERVER_URL
94+
value: "http://localhost:8000"
95+
- name: LATENCY_MAX_SAMPLE_SIZE
96+
value: "10000"
97+
98+
pluginsCustomConfig:
99+
slo-plugins.yaml: |
100+
apiVersion: inference.networking.x-k8s.io/v1alpha1
101+
kind: EndpointPickerConfig
102+
plugins:
103+
- type: slo-request-tracker
104+
- type: slo-scorer
105+
- type: slo-aware-profile-handler
106+
schedulingProfiles:
107+
- name: slo
108+
plugins:
109+
- pluginRef: slo-request-tracker
110+
- pluginRef: slo-scorer
111+
112+
sidecars:
113+
trainingServer:
114+
enabled: true
115+
image:
116+
hub: quay.io/your-org
117+
name: latency-training
118+
tag: latest
119+
resources:
120+
requests:
121+
cpu: "2000m"
122+
memory: "4Gi"
123+
limits:
124+
cpu: "4000m"
125+
memory: "8Gi"
126+
127+
predictionServers:
128+
enabled: true
129+
replicas: 3
130+
image:
131+
hub: quay.io/your-org
132+
name: latency-prediction
133+
tag: latest
134+
resources:
135+
requests:
136+
cpu: "500m"
137+
memory: "1Gi"
138+
limits:
139+
cpu: "1000m"
140+
memory: "2Gi"
141+
```
142+
143+
## Configuration Reference
144+
145+
### Training Server Configuration
146+
147+
| Parameter | Description | Default |
148+
|-----------|-------------|---------|
149+
| `sidecars.trainingServer.enabled` | Enable training server sidecar | `false` |
150+
| `sidecars.trainingServer.image.hub` | Container registry | `us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension` |
151+
| `sidecars.trainingServer.image.name` | Image name | `latency-training` |
152+
| `sidecars.trainingServer.image.tag` | Image tag | `latest` |
153+
| `sidecars.trainingServer.config.retrainingIntervalSec` | Retraining interval in seconds | `1` |
154+
| `sidecars.trainingServer.config.minSamplesForRetrain` | Minimum samples before retraining | `100` |
155+
| `sidecars.trainingServer.config.modelType` | ML model type | `xgboost` |
156+
| `sidecars.trainingServer.persistence.enabled` | Enable persistent storage for models | `false` |
157+
158+
### Prediction Server Configuration
159+
160+
| Parameter | Description | Default |
161+
|-----------|-------------|---------|
162+
| `sidecars.predictionServers.enabled` | Enable prediction server sidecars | `false` |
163+
| `sidecars.predictionServers.replicas` | Number of prediction server replicas | `3` |
164+
| `sidecars.predictionServers.image.hub` | Container registry | `us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension` |
165+
| `sidecars.predictionServers.image.name` | Image name | `latency-prediction` |
166+
| `sidecars.predictionServers.image.tag` | Image tag | `latest` |
167+
| `sidecars.predictionServers.config.modelSyncIntervalSec` | Model sync interval in seconds | `10` |
168+
| `sidecars.predictionServers.config.modelType` | ML model type | `xgboost` |
169+
170+
### EPP Environment Variables
171+
172+
| Variable | Description | Default |
173+
|----------|-------------|---------|
174+
| `PREDICTION_SERVER_URL` | Comma-separated prediction server URLs | `http://localhost:8001,http://localhost:8002,http://localhost:8003` |
175+
| `TRAINING_SERVER_URL` | Training server URL | `http://localhost:8000` |
176+
| `LATENCY_MAX_SAMPLE_SIZE` | Maximum sample size for latency prediction | `10000` |
177+
| `NEG_HEADROOM_TPOT_WEIGHT` | Weight for TPOT in negative headroom calculation | `0.2` |
178+
| `NEG_HEADROOM_TTFT_WEIGHT` | Weight for TTFT in negative headroom calculation | `0.8` |
179+
180+
## Building Container Images
181+
182+
### Prerequisites
183+
184+
```bash
185+
cd /path/to/gateway-api-inference-extension
186+
git checkout slo-prediction-experimental
187+
```
188+
189+
### Build EPP Image
190+
191+
```bash
192+
export IMAGE_REGISTRY="quay.io/your-org"
193+
export EPP_TAG="slo-experimental"
194+
make image-build image-push
195+
```
196+
197+
### Build Latency Predictor Images
198+
199+
```bash
200+
cd latencypredictor-v1
201+
202+
# Edit build-deploy.sh to set your registry
203+
# Then build and push:
204+
./build-deploy.sh build
205+
206+
# Tag and push manually
207+
docker tag latencypredictor-v2-training-server:latest ${IMAGE_REGISTRY}/latency-training:slo-experimental
208+
docker tag latencypredictor-v2-prediction-server:latest ${IMAGE_REGISTRY}/latency-prediction:slo-experimental
209+
docker push ${IMAGE_REGISTRY}/latency-training:slo-experimental
210+
docker push ${IMAGE_REGISTRY}/latency-prediction:slo-experimental
211+
```
212+
213+
## Verification
214+
215+
After deployment, verify all containers are running:
216+
217+
```bash
218+
# Check pod status
219+
kubectl get pods -n your-namespace
220+
221+
# Expected: 1 pod with 5 containers (1 EPP + 1 training + 3 prediction)
222+
223+
# Check EPP logs
224+
kubectl logs -n your-namespace <pod-name> -c epp
225+
226+
# Check training server logs
227+
kubectl logs -n your-namespace <pod-name> -c training-server
228+
229+
# Check prediction server logs
230+
kubectl logs -n your-namespace <pod-name> -c prediction-server-1
231+
```
232+
233+
## Service Ports
234+
235+
When sidecars are enabled, the service automatically exposes these ports:
236+
237+
- `9002`: EPP gRPC ext-proc (always)
238+
- `9090`: EPP metrics (always)
239+
- `8000`: Training server (when `trainingServer.enabled: true`)
240+
- `8001-800N`: Prediction servers (when `predictionServers.enabled: true`, N = replicas)
241+
242+
## Plugins
243+
244+
The SLO-aware routing requires these plugins:
245+
246+
- `slo-request-tracker`: Tracks request SLO requirements
247+
- `slo-scorer`: Scores endpoints based on predicted latency vs SLO
248+
- `slo-aware-profile-handler`: Handles different scheduling profiles
249+
- `max-score-picker`: Selects endpoint with maximum score
250+
251+
### Scheduling Profiles
252+
253+
- **default**: Standard routing with queue and kv-cache scoring
254+
- **slo**: SLO-aware routing using latency predictions
255+
256+
## Troubleshooting
257+
258+
### Sidecars Not Starting
259+
260+
Check if images are accessible:
261+
```bash
262+
kubectl describe pod <pod-name> -n your-namespace
263+
```
264+
265+
### Training Server Issues
266+
267+
Check ConfigMap and logs:
268+
```bash
269+
kubectl get configmap latency-predictor-config -n your-namespace -o yaml
270+
kubectl logs <pod-name> -c training-server -n your-namespace
271+
```
272+
273+
### Prediction Server Issues
274+
275+
Verify prediction servers can reach training server:
276+
```bash
277+
kubectl exec <pod-name> -c prediction-server-1 -n your-namespace -- \
278+
curl http://localhost:8000/healthz
279+
```
280+
281+
## Integration with llm-d
282+
283+
To use this chart in llm-d, update your helmfile:
284+
285+
```yaml
286+
releases:
287+
- name: gaie-slo
288+
namespace: llm-d-slo
289+
chart: oci://quay.io/your-org/charts/inferencepool
290+
version: v1.0.1-slo
291+
values:
292+
- gaie-slo/values.yaml
293+
- gaie-slo/values-slo.yaml
294+
```
295+
296+
See the main documentation for complete integration instructions.

0 commit comments

Comments
 (0)