Conformance: Adds Data Parallelism Test #1769

danehans · 2025-10-24T21:39:28Z

What type of PR is this?

/kind test
/area conformance-test

What this PR does / why we need it:

Updates the header-based-testing-filter to filter selects pods whose "IP" or "IP:port" matches any value in the "test-epp-endpoint-selection" header. If a port is provided, only an exact "IP:port" match is accepted.
Adds a conformance test that asserts data parallelism by using the "X-Echo-HTTP-Port" response header introduced in Conformance: Adds Port Response Header to Echo Server gateway-api#4230.
Bumps the EPP image tag in conformance to the tip of main, e.g. v20251105-cbb8928.

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

Conformance: Updates the header-based-testing-filter to filter selects pods whose "IP" or "IP:port" matches any value in the "test-epp-endpoint-selection" header. Adds test to exercise data parallelism routing. Bumps conformance EPP image tag to `v20251105-cbb8928`.

k8s-ci-robot · 2025-10-24T21:39:33Z

@danehans: The label(s) kind/test cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?
/kind test
/area conformance-test

What this PR does / why we need it:

Adds a conformance test that tests routing to endpoints with data parallelism enabled.

Bumps the EPP image tag in conformance to v20251023-d788a2c.

Which issue(s) this PR fixes:

Fixes #1680

Does this PR introduce a user-facing change?:
Conformance: Adds test to exercise data parallelism routing. Bumps confomrance EPP image tag to `v20251023-d788a2c`.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-10-24T21:40:01Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`53bfda7`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/690e7c94850d4900086079a8
😎 Deploy Preview	https://deploy-preview-1769--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

danehans · 2025-10-24T21:40:30Z

/cc @shmuelk @robscott @zetxqx

shmuelk · 2025-10-28T15:01:04Z

conformance/resources/base.yaml

+  targetPorts:
+    - number: 3000
+    - number: 3002
+    - number: 3004


~~While this is an interesting configuration, I don't think you could do this with a real vLLM server~~

I was wrong

@shmuelk can you elaborate? Are you referring to the non-contiguous port numbers? If so, the reason for this configuration is because the backend is an echo server which listens on multiple ports (xref).

I had thought that when one launched vLLM with Data Parallel in a configuration that listened on multiple ports, that the "vLLM Launcher" started up all of the processes.

I was wrong. They are started individually and each one can listen on any port one wants.

shmuelk · 2025-10-28T15:02:41Z

conformance/resources/base.yaml

+    spec:
+      containers:
+      - name: echoserver-3000
+        image: gcr.io/k8s-staging-gateway-api/echo-basic:v20240412-v1.0.0-394-g40c666fd


Other than your use of non-contiguous ports here, why not not use llm-d-inference-sim which supports --data-parallel-size=N ?

I don't think we want to swap out the backend to support a single conformance test. @robscott llm-d/llm-d-inference-sim#230 added support for echo'ing the port number. I don't see this supported in the echo-basic image that we currently use. I submitted kubernetes-sigs/gateway-api#4230 upstream to add the required functionality to the echo server. cc: @robscott

shmuelk · 2025-10-28T15:04:19Z

conformance/tests/gateway_following_epp_routing_dp.go

+			{
+				name:                                  "DP routes only to all pods (EPP returns all; ranks balanced internally)",
+				podIPsToBeReturnedByEPP:               []string{podIPs[0], podIPs[1], podIPs[2]},
+				expectAllRequestsRoutedWithinPodNames: []string{podNames[0], podNames[1], podNames[2]},


You are only checking Pod IPs. Shouldn't youalso be checking Pod Ports?

We can't compare the port number in the response with the backend ports since the request/response is being proxied. We need the echo server to support echo'ing the port number back in a response header, i.e. llm-d/llm-d-inference-sim#232, or switch to vllm-sim. cc: @robscott @zetxqx

Commit 0d46ef5 updates the conformance test to compare the response IP:port with the the IP:port set in the "test-epp-endpoint-selection" header. Note that kubernetes-sigs/gateway-api#4230 is required for the echo server to return the port number in the "X-Echo-HTTP-Port" response header.

shmuelk · 2025-10-28T15:07:01Z

pkg/epp/scheduling/framework/plugins/test/filter/request_header_based_filter.go


-// Filter selects pods that match the IP addresses specified in the request header.
+// Filter selects pods whose IPs match any value in the "test-epp-endpoint-selection" header.
+// Values may be "IP" or "IP:port"; ports (ranks) are ignored here because DP fan-out happens later.


Why do you want to ignore the port? I would have thought you would add code that if a port was specified, it filters by both IP and port.

Without that how do you know that DP really worked and sent requests to a "non-base" of the model server?

Commit 0d46ef5 updated this method to filter selects pods whose IP or IP:port matches any value in the "test-epp-endpoint-selection" header. If a port is provided, only an exact IP:port match is accepted.

danehans · 2025-11-05T21:44:11Z

/test pull-gateway-api-inference-extension-test-unit-main

zetxqx

Sorry for the delay review, looks good, just some nit comments.

I'm also setting up GKE to run against this conformance test as well.

zetxqx · 2025-11-07T21:04:38Z

go.mod

 	// Update the CONTROLLER_TOOLS_VERSION in Makefile when bumping controller-tools.
 	sigs.k8s.io/controller-tools v0.19.0
-	sigs.k8s.io/gateway-api v1.4.0
+	sigs.k8s.io/gateway-api v1.3.1-0.20251106052652-079e4774d76b


I'm wondering can this be 1.4.0-something? this will be synced back to our internal pipeline, and it already updated to 1.4.0

I thought it would be a v1.4.0-something too, but ^ is how the dep gets resolved when I:

$ REQ_COMMIT=079e4774d76b909a8d43eae7dba570e2208cc9a7 $ go get sigs.k8s.io/gateway-api@$REQ_COMMIT

zetxqx · 2025-11-07T21:05:26Z

pkg/epp/scheduling/framework/plugins/test/consts.go

+	//   - "IP:port"     — selects only pods whose IP and port both match exactly.
+	//                     Ports correspond to data-parallel ranks or specific targetPorts.
+	//
+	// IPv6 addresses are supported, with or without brackets (e.g. "fd00::1" or "[fd00::1]:3002").


Thank you for adding the ipv6 support!

conformance/tests/gateway_following_epp_routing_dp.go

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

zetxqx

I run this test successfully with GKE as well.

command

go test -v ./conformance -args -debug -gateway-class gke-l7-regional-external-managed -cleanup-base-resources=false -allow-crds-mismatch=true -run-test GatewayFollowingEPPRoutingWithDataParallelism

=== RUN   TestConformance/InferencePoolResolvedRefsCondition
    conformance.go:66: Skipping InferencePoolResolvedRefsCondition: test explicitly skipped
--- PASS: TestConformance (33.63s)
    --- SKIP: TestConformance/EppUnAvailableFailOpen (0.00s)
    --- SKIP: TestConformance/GatewayFollowingEPPRouting (0.00s)
    --- PASS: TestConformance/GatewayFollowingEPPRoutingWithDataParallelism (31.70s)
        --- PASS: TestConformance/GatewayFollowingEPPRoutingWithDataParallelism/DP_routes_only_to_one_designated_pod_(any_rank) (0.77s)
        --- PASS: TestConformance/GatewayFollowingEPPRoutingWithDataParallelism/DP_routes_only_to_two_designated_pods;_one_has_a_fixed_rank (0.76s)
        --- PASS: TestConformance/GatewayFollowingEPPRoutingWithDataParallelism/DP_routes_to_all_pods;_one_pod_restricted_to_3000,3004 (0.93s)
    --- SKIP: TestConformance/GatewayWeightedAcrossTwoInferencePools (0.00s)
    --- SKIP: TestConformance/HTTPRouteInvalidInferencePoolRef (0.00s)
    --- SKIP: TestConformance/HTTPRouteMultipleGatewaysDifferentPools (0.00s)
    --- SKIP: TestConformance/InferencePoolAccepted (0.00s)
    --- SKIP: TestConformance/InferencePoolHTTPRoutePortValidation (0.00s)
    --- SKIP: TestConformance/InferencePoolInvalidEPPService (0.00s)
    --- SKIP: TestConformance/HTTPRouteMultipleRulesDifferentPools (0.00s)
    --- SKIP: TestConformance/InferencePoolResolvedRefsCondition (0.00s)
PASS
ok  	sigs.k8s.io/gateway-api-inference-extension/conformance	33.840s

k8s-ci-robot · 2025-11-08T23:09:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans, zetxqx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danehans]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zetxqx · 2025-11-08T23:10:17Z

/hold

in case others want to take another look @robscott @shmuelk

zetxqx · 2025-11-08T23:17:13Z

/lgtm

shmuelk · 2025-11-09T14:30:03Z

This looks good to me. As I am not an official reviewer, I'm not adding slash-lgtm and slash-approve

k8s-ci-robot added the area/conformance-test Issues or PRs related to Conformance tests. label Oct 24, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2025

k8s-ci-robot requested review from ahg-g and elevran October 24, 2025 21:39

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 24, 2025

k8s-ci-robot requested review from robscott, shmuelk and zetxqx October 24, 2025 21:40

danehans force-pushed the issue_1680 branch from 4b2a410 to 78e6755 Compare October 27, 2025 15:18

danehans mentioned this pull request Oct 27, 2025

Conformance: Support Pod Port in Data Parallelism Tests #1773

Open

shmuelk reviewed Oct 28, 2025

View reviewed changes

danehans force-pushed the issue_1680 branch from 78e6755 to ba01b5b Compare November 5, 2025 20:12

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 5, 2025

danehans force-pushed the issue_1680 branch from ba01b5b to 0d46ef5 Compare November 5, 2025 20:14

danehans requested a review from shmuelk November 5, 2025 20:22

danehans force-pushed the issue_1680 branch from 0d46ef5 to 65299a3 Compare November 5, 2025 20:34

danehans force-pushed the issue_1680 branch from 4809247 to c2f2195 Compare November 6, 2025 17:12

danehans self-assigned this Nov 6, 2025

zetxqx reviewed Nov 7, 2025

View reviewed changes

danehans added 2 commits November 7, 2025 15:03

Conformance: Adds Data Parallelism Test

8c758f9

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

Use port from json body and go.mod replace

53bfda7

Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io>

danehans force-pushed the issue_1680 branch from c2f2195 to 53bfda7 Compare November 7, 2025 23:11

danehans requested a review from zetxqx November 7, 2025 23:11

zetxqx approved these changes Nov 8, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2025

k8s-ci-robot assigned zetxqx Nov 8, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2025

Conformance: Adds Data Parallelism Test #1769

Are you sure you want to change the base?

Conformance: Adds Data Parallelism Test #1769

Conversation

danehans commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 24, 2025

Uh oh!

netlify bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

danehans commented Oct 24, 2025

Uh oh!

shmuelk Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danehans commented Nov 5, 2025

Uh oh!

zetxqx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zetxqx left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 8, 2025

Uh oh!

zetxqx commented Nov 8, 2025

Uh oh!

zetxqx commented Nov 8, 2025

Uh oh!

shmuelk commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danehans commented Oct 24, 2025 •

edited

Loading

netlify bot commented Oct 24, 2025 •

edited

Loading

shmuelk Oct 28, 2025 •

edited

Loading