test/integration: introduce robust harness and migrate EPP suite #2022

LukeAVanDrie · 2025-12-18T00:33:34Z

What type of PR is this?
/kind test
/kind cleanup

What this PR does / why we need it:

This PR completes the migration of the integration test suite to the new isolated harness architecture. Following the pattern established in the Body-Based Routing (BBR) migration (#1959), this PR overhauls the EPP hermetic test suite.

The EPP tests are significantly more complex than BBR as they involve the full Kubernetes Controller Runtime lifecycle (Manager, Informers, Caches) and the Scheduler logic. This refactor introduces strict namespace isolation to prevent resource pollution between test cases and promotes server lifecycle logic to shared utilities. This will also be useful for future Flow Control tests which must modify pod metrics to toggle between saturated and unsaturated pool states.

Key Improvements

Namespace-Scoped Controller Manager:
- Introduces TestHarness, which generates a unique Namespace (via UUID) for every test case.
- Configures the controller-runtime Manager with a namespace-scoped cache. This ensures that the EPP server under test only sees Pods and Pools relevant to the current test case, eliminating "ghost" resources from previous runs affecting scheduling decisions.
- Note: EPP tests currently run serially (t.Parallel() is disabled) because they rely on the singleton global Prometheus registry.
Domain-Specific Language (DSL):
- Introduces EPP-specific builders in epp/util_test.go (e.g., ReqSubset, P() for pod states) and assertions (e.g., ExpectRouteTo, ExpectBufferResp).
- Reduces visual noise in table-driven tests, making the scheduling logic (Queue vs. KV Cache vs. LoRA Affinity) immediately obvious.
Deterministic Synchronization:
- Adds WaitForReadyPodsMetric. This blocks test execution until the internal scheduler state (via Prometheus metrics) matches the Kubernetes API state, preventing race conditions where requests are sent before the scheduler has indexed the backend pods.
Shared Infrastructure:
- Promotes server lifecycle logic (startup, strict 127.0.0.1 binding, TCP readiness checks) to the shared test/integration package.
- Updates the existing BBR harness to use these shared utilities, reducing boilerplate.

Refactoring Note

test/integration/util.go still retains some logic specific to the EPP domain. While these belong conceptually in test/integration/epp/util_test.go, I opted to keep them in the shared util.go package for this PR to minimize the already large diff size.

Test Coverage Mapping

Old Test Case Name	New Test Case Name	Changes / Notes
Standard Routing Logic
`select lower queue and kv cache, no active lora`	`select lower queue and kv cache`	Direct mapping.
`select active lora, low queue`	`select active lora, low queue`	Direct mapping.
`select lora despite higher kv cache usage`	`select lora despite higher kv cache (affinity)`	Direct mapping.
`don't shed requests by default`	`do not shed requests by default`	Direct mapping.
Error Handling & Protocol
`invalid json; return body`	`invalid json body`	Direct mapping.
`body sent over multiple requests...`	`split body across chunks`	Renamed for clarity. Tests that EPP correctly reassembles fragmented JSON bodies.
`no backend pods are available`	`no backend pods available`	Direct mapping.
`request don't contains invalid payload, model not exist`	`request missing model field`	Renamed to fix grammar; asserts 400 when JSON is valid but missing mandatory fields.
`simple GET Request`	`protocol: simple GET (header only)`	Renamed to clarify intent.
Subsetting & Metadata
`select active lora with subsetting tag, all pods available`	`subsetting: select best from subset`	Renamed for clarity.
`select active lora with subsetting tag, some pods match`	`subsetting: partial match`	Renamed for clarity.
`select active lora with subsetting tag, no pods available`	`subsetting: no pods match`	Renamed for clarity.
Passthrough & Rewrites
`inferenceobjective's modelName is not translated, passthrough`	`passthrough: model not in objectives`	Renamed for clarity.
`rewrite request model`	`rewrite request model`	Direct mapping.
Response Processing
`responsebody sent over multiple requests, content-type is json, buffer`	`response buffering: multi-chunk JSON`	Renamed for clarity.
`Response is invalid json; return body`	`response buffering: invalid JSON`	Renamed for clarity.
`responsebody sent over a single request...` (JSON variant)	`response buffering: empty EOS chunk (JSON)`	Renamed. Old name was verbose; tests Envoy behavior of sending empty data frame to signal EOS.
`responsebody sent over a single request...` (SSE variant)	`response streaming: SSE token counting`	Renamed. Old name was a duplicate copy-paste error; tests token counting logic on SSE streams.

Which issue(s) this PR fixes:
Refactoring groundwork necessary to support complex state-mutation tests (e.g., Flow Control) which require a strictly isolated environment.

Does this PR introduce a user-facing change?:

NONE

k8s-ci-robot · 2025-12-18T00:33:38Z

@LukeAVanDrie: The label(s) kind/test cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?
/kind test
/kind cleanup

What this PR does / why we need it:

This PR completes the migration of the integration test suite to the new isolated harness architecture. Following the pattern established in the Body-Based Routing (BBR) migration, this PR overhauls the EPP hermetic test suite.

The EPP tests are significantly more complex than BBR as they involve the full Kubernetes Controller Runtime lifecycle (Manager, Informers, Caches) and the Scheduler logic. This refactor introduces strict namespace isolation to prevent resource pollution between test cases and promotes server lifecycle logic to shared utilities. This will also be useful for future Flow Control tests which must modify pod metrics to toggle between saturated and unsaturated pool states.

Key Improvements

Namespace-Scoped Controller Manager:

Introduces TestHarness, which generates a unique Namespace (via UUID) for every test case.

Configures the controller-runtime Manager with a namespace-scoped cache. This ensures that the EPP server under test only sees Pods and Pools relevant to the current test case, eliminating "ghost" resources from previous runs affecting scheduling decisions.

Note: EPP tests currently run serially (t.Parallel() is disabled) because they rely on the singleton global Prometheus registry.

Domain-Specific Language (DSL):

Introduces EPP-specific builders in epp/util_test.go (e.g., ReqSubset, P() for pod states) and assertions (e.g., ExpectRouteTo, ExpectBufferResp).

Reduces visual noise in table-driven tests, making the scheduling logic (Queue vs. KV Cache vs. LoRA Affinity) immediately obvious.

Deterministic Synchronization:

Adds WaitForReadyPodsMetric. This blocks test execution until the internal scheduler state (via Prometheus metrics) matches the Kubernetes API state, preventing race conditions where requests are sent before the scheduler has indexed the backend pods.

Shared Infrastructure:

Promotes server lifecycle logic (startup, strict 127.0.0.1 binding, TCP readiness checks) to the shared test/integration package.

Updates the existing BBR harness to use these shared utilities, reducing boilerplate.

Refactoring Note

test/integration/util.go still retains some logic specific to the EPP domain. While these belong conceptually in test/integration/epp/util_test.go, I opted to keep them in the shared util.go package for this PR to minimize the already large diff size.

Test Coverage Mapping

Old Test Case Name New Test Case Name Changes / Notes

Standard Routing Logic

select lower queue and kv cache, no active lora select lower queue and kv cache Direct mapping.

select active lora, low queue select active lora, low queue Direct mapping.

select lora despite higher kv cache usage select lora despite higher kv cache (affinity) Direct mapping.

don't shed requests by default do not shed requests by default Direct mapping.

Error Handling & Protocol

invalid json; return body invalid json body Direct mapping.

body sent over multiple requests... split body across chunks Renamed for clarity. Tests that EPP correctly reassembles fragmented JSON bodies.

no backend pods are available no backend pods available Direct mapping.

request don't contains invalid payload, model not exist request missing model field Renamed to fix grammar; asserts 400 when JSON is valid but missing mandatory fields.

simple GET Request protocol: simple GET (header only) Renamed to clarify intent.

Subsetting & Metadata

select active lora with subsetting tag, all pods available subsetting: select best from subset Renamed for clarity.

select active lora with subsetting tag, some pods match subsetting: partial match Renamed for clarity.

select active lora with subsetting tag, no pods available subsetting: no pods match Renamed for clarity.

Passthrough & Rewrites

inferenceobjective's modelName is not translated, passthrough passthrough: model not in objectives Renamed for clarity.

rewrite request model rewrite request model Direct mapping.

Response Processing

responsebody sent over multiple requests, content-type is json, buffer response buffering: multi-chunk JSON Renamed for clarity.

Response is invalid json; return body response buffering: invalid JSON Renamed for clarity.

responsebody sent over a single request... (JSON variant) response buffering: empty EOS chunk (JSON) Renamed. Old name was verbose; tests Envoy behavior of sending empty data frame to signal EOS.

responsebody sent over a single request... (SSE variant) response streaming: SSE token counting Renamed. Old name was a duplicate copy-paste error; tests token counting logic on SSE streams.

Which issue(s) this PR fixes:
Refactoring groundwork necessary to support complex state-mutation tests (e.g., Flow Control) which require a strictly isolated environment.

Does this PR introduce a user-facing change?:
NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-12-18T00:33:41Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`6e1742e`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69487393e2c3e00008268fa0
😎 Deploy Preview	https://deploy-preview-2022--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-18T00:33:44Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2025-12-18T00:34:31Z

/assign @ahg-g

I included a test mapping in the PR description which should hopefully make reviewing hermetic_test.go a bit easier. It's probably best to review as a full rewrite. I have ensured existing test cases and assertions strictly match.

ahg-g · 2025-12-19T02:46:29Z

/ok-to-test

k8s-ci-robot · 2025-12-19T07:02:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2025-12-19T07:03:08Z

Fixed the boilerplate issue on last push.

Refactor the integration test utilities to provide a standardized way to start hermetic gRPC servers, and migrate the BBR harness to use it. - Add `StartExtProcServer`: A helper for background server management with strict TCP readiness checks and 127.0.0.1 binding. - Update `GetFreePort`: Return `int` instead of `*net.TCPAddr` for easier consumption. - Refactor `BBRHarness`: Remove manual server startup boilerplate in favor of the new shared helper, reducing code duplication.

Add a new testing infrastructure for the Endpoint Picker (EPP) to enable hermetic, namespace-isolated testing. - Add `TestHarness`: Manages a namespace-scoped Controller Manager and server instance per test case. This ensures resources from one test do not pollute the scheduler state of another. - Add DSL: Introduce intent-based builders (e.g., `ReqSubset`, `P`) and assertions (e.g., `ExpectRouteTo`) to decouple test logic from gRPC boilerplate. - Support deterministic synchronization by waiting for Prometheus metrics to settle before generating traffic.

Rewrite the hermetic integration suite to use the new isolated harness and DSL. - Migrate all existing test cases from epp_test.go to hermetic_test.go. - Enable pre-loading of base CRDs in TestMain to optimize setup time. - Remove legacy setup code and global variable reliance.

LukeAVanDrie · 2025-12-21T22:24:41Z

resolved merge conflict after rebasing

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Dec 18, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 18, 2025

k8s-ci-robot requested review from danehans and elevran December 18, 2025 00:33

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 18, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 18, 2025

k8s-ci-robot assigned ahg-g Dec 18, 2025

LukeAVanDrie mentioned this pull request Dec 18, 2025

fix: streaming metrics and header parsing bugs #1627

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 19, 2025

LukeAVanDrie force-pushed the refactor/epp-hermetic-tests branch from 06fd22f to 5177995 Compare December 19, 2025 07:02

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2025

LukeAVanDrie added 4 commits December 21, 2025 22:21

resolve conflicts from kubernetes-sigs#1976

6e1742e

LukeAVanDrie force-pushed the refactor/epp-hermetic-tests branch from 5177995 to 6e1742e Compare December 21, 2025 22:24

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test/integration: introduce robust harness and migrate EPP suite #2022

test/integration: introduce robust harness and migrate EPP suite #2022

LukeAVanDrie commented Dec 18, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 18, 2025

Key Improvements

Refactoring Note

Test Coverage Mapping

Uh oh!

netlify bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 18, 2025

Uh oh!

LukeAVanDrie commented Dec 18, 2025

Uh oh!

ahg-g commented Dec 19, 2025

Uh oh!

k8s-ci-robot commented Dec 19, 2025

Uh oh!

LukeAVanDrie commented Dec 19, 2025

Uh oh!

LukeAVanDrie commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

test/integration: introduce robust harness and migrate EPP suite #2022

Are you sure you want to change the base?

test/integration: introduce robust harness and migrate EPP suite #2022

Conversation

LukeAVanDrie commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Improvements

Refactoring Note

Test Coverage Mapping

Uh oh!

k8s-ci-robot commented Dec 18, 2025

Key Improvements

Refactoring Note

Test Coverage Mapping

Uh oh!

netlify bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Dec 18, 2025

Uh oh!

LukeAVanDrie commented Dec 18, 2025

Uh oh!

ahg-g commented Dec 19, 2025

Uh oh!

k8s-ci-robot commented Dec 19, 2025

Uh oh!

LukeAVanDrie commented Dec 19, 2025

Uh oh!

LukeAVanDrie commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LukeAVanDrie commented Dec 18, 2025 •

edited

Loading

netlify bot commented Dec 18, 2025 •

edited

Loading