Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

@LukeAVanDrie LukeAVanDrie commented Dec 18, 2025

What type of PR is this?
/kind test
/kind cleanup

What this PR does / why we need it:

This PR completes the migration of the integration test suite to the new isolated harness architecture. Following the pattern established in the Body-Based Routing (BBR) migration (#1959), this PR overhauls the EPP hermetic test suite.

The EPP tests are significantly more complex than BBR as they involve the full Kubernetes Controller Runtime lifecycle (Manager, Informers, Caches) and the Scheduler logic. This refactor introduces strict namespace isolation to prevent resource pollution between test cases and promotes server lifecycle logic to shared utilities. This will also be useful for future Flow Control tests which must modify pod metrics to toggle between saturated and unsaturated pool states.

Key Improvements

  1. Namespace-Scoped Controller Manager:

    • Introduces TestHarness, which generates a unique Namespace (via UUID) for every test case.
    • Configures the controller-runtime Manager with a namespace-scoped cache. This ensures that the EPP server under test only sees Pods and Pools relevant to the current test case, eliminating "ghost" resources from previous runs affecting scheduling decisions.
    • Note: EPP tests currently run serially (t.Parallel() is disabled) because they rely on the singleton global Prometheus registry.
  2. Domain-Specific Language (DSL):

    • Introduces EPP-specific builders in epp/util_test.go (e.g., ReqSubset, P() for pod states) and assertions (e.g., ExpectRouteTo, ExpectBufferResp).
    • Reduces visual noise in table-driven tests, making the scheduling logic (Queue vs. KV Cache vs. LoRA Affinity) immediately obvious.
  3. Deterministic Synchronization:

    • Adds WaitForReadyPodsMetric. This blocks test execution until the internal scheduler state (via Prometheus metrics) matches the Kubernetes API state, preventing race conditions where requests are sent before the scheduler has indexed the backend pods.
  4. Shared Infrastructure:

    • Promotes server lifecycle logic (startup, strict 127.0.0.1 binding, TCP readiness checks) to the shared test/integration package.
    • Updates the existing BBR harness to use these shared utilities, reducing boilerplate.

Refactoring Note

test/integration/util.go still retains some logic specific to the EPP domain. While these belong conceptually in test/integration/epp/util_test.go, I opted to keep them in the shared util.go package for this PR to minimize the already large diff size.

Test Coverage Mapping

Old Test Case Name New Test Case Name Changes / Notes
Standard Routing Logic
select lower queue and kv cache, no active lora select lower queue and kv cache Direct mapping.
select active lora, low queue select active lora, low queue Direct mapping.
select lora despite higher kv cache usage select lora despite higher kv cache (affinity) Direct mapping.
don't shed requests by default do not shed requests by default Direct mapping.
Error Handling & Protocol
invalid json; return body invalid json body Direct mapping.
body sent over multiple requests... split body across chunks Renamed for clarity. Tests that EPP correctly reassembles fragmented JSON bodies.
no backend pods are available no backend pods available Direct mapping.
request don't contains invalid payload, model not exist request missing model field Renamed to fix grammar; asserts 400 when JSON is valid but missing mandatory fields.
simple GET Request protocol: simple GET (header only) Renamed to clarify intent.
Subsetting & Metadata
select active lora with subsetting tag, all pods available subsetting: select best from subset Renamed for clarity.
select active lora with subsetting tag, some pods match subsetting: partial match Renamed for clarity.
select active lora with subsetting tag, no pods available subsetting: no pods match Renamed for clarity.
Passthrough & Rewrites
inferenceobjective's modelName is not translated, passthrough passthrough: model not in objectives Renamed for clarity.
rewrite request model rewrite request model Direct mapping.
Response Processing
responsebody sent over multiple requests, content-type is json, buffer response buffering: multi-chunk JSON Renamed for clarity.
Response is invalid json; return body response buffering: invalid JSON Renamed for clarity.
responsebody sent over a single request... (JSON variant) response buffering: empty EOS chunk (JSON) Renamed. Old name was verbose; tests Envoy behavior of sending empty data frame to signal EOS.
responsebody sent over a single request... (SSE variant) response streaming: SSE token counting Renamed. Old name was a duplicate copy-paste error; tests token counting logic on SSE streams.

Which issue(s) this PR fixes:
Refactoring groundwork necessary to support complex state-mutation tests (e.g., Flow Control) which require a strictly isolated environment.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Dec 18, 2025
@k8s-ci-robot
Copy link
Contributor

@LukeAVanDrie: The label(s) kind/test cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?
/kind test
/kind cleanup

What this PR does / why we need it:

This PR completes the migration of the integration test suite to the new isolated harness architecture. Following the pattern established in the Body-Based Routing (BBR) migration, this PR overhauls the EPP hermetic test suite.

The EPP tests are significantly more complex than BBR as they involve the full Kubernetes Controller Runtime lifecycle (Manager, Informers, Caches) and the Scheduler logic. This refactor introduces strict namespace isolation to prevent resource pollution between test cases and promotes server lifecycle logic to shared utilities. This will also be useful for future Flow Control tests which must modify pod metrics to toggle between saturated and unsaturated pool states.

Key Improvements

  1. Namespace-Scoped Controller Manager:
  • Introduces TestHarness, which generates a unique Namespace (via UUID) for every test case.
  • Configures the controller-runtime Manager with a namespace-scoped cache. This ensures that the EPP server under test only sees Pods and Pools relevant to the current test case, eliminating "ghost" resources from previous runs affecting scheduling decisions.
  • Note: EPP tests currently run serially (t.Parallel() is disabled) because they rely on the singleton global Prometheus registry.
  1. Domain-Specific Language (DSL):
  • Introduces EPP-specific builders in epp/util_test.go (e.g., ReqSubset, P() for pod states) and assertions (e.g., ExpectRouteTo, ExpectBufferResp).
  • Reduces visual noise in table-driven tests, making the scheduling logic (Queue vs. KV Cache vs. LoRA Affinity) immediately obvious.
  1. Deterministic Synchronization:
  • Adds WaitForReadyPodsMetric. This blocks test execution until the internal scheduler state (via Prometheus metrics) matches the Kubernetes API state, preventing race conditions where requests are sent before the scheduler has indexed the backend pods.
  1. Shared Infrastructure:
  • Promotes server lifecycle logic (startup, strict 127.0.0.1 binding, TCP readiness checks) to the shared test/integration package.
  • Updates the existing BBR harness to use these shared utilities, reducing boilerplate.

Refactoring Note

test/integration/util.go still retains some logic specific to the EPP domain. While these belong conceptually in test/integration/epp/util_test.go, I opted to keep them in the shared util.go package for this PR to minimize the already large diff size.

Test Coverage Mapping

Old Test Case Name New Test Case Name Changes / Notes
Standard Routing Logic
select lower queue and kv cache, no active lora select lower queue and kv cache Direct mapping.
select active lora, low queue select active lora, low queue Direct mapping.
select lora despite higher kv cache usage select lora despite higher kv cache (affinity) Direct mapping.
don't shed requests by default do not shed requests by default Direct mapping.
Error Handling & Protocol
invalid json; return body invalid json body Direct mapping.
body sent over multiple requests... split body across chunks Renamed for clarity. Tests that EPP correctly reassembles fragmented JSON bodies.
no backend pods are available no backend pods available Direct mapping.
request don't contains invalid payload, model not exist request missing model field Renamed to fix grammar; asserts 400 when JSON is valid but missing mandatory fields.
simple GET Request protocol: simple GET (header only) Renamed to clarify intent.
Subsetting & Metadata
select active lora with subsetting tag, all pods available subsetting: select best from subset Renamed for clarity.
select active lora with subsetting tag, some pods match subsetting: partial match Renamed for clarity.
select active lora with subsetting tag, no pods available subsetting: no pods match Renamed for clarity.
Passthrough & Rewrites
inferenceobjective's modelName is not translated, passthrough passthrough: model not in objectives Renamed for clarity.
rewrite request model rewrite request model Direct mapping.
Response Processing
responsebody sent over multiple requests, content-type is json, buffer response buffering: multi-chunk JSON Renamed for clarity.
Response is invalid json; return body response buffering: invalid JSON Renamed for clarity.
responsebody sent over a single request... (JSON variant) response buffering: empty EOS chunk (JSON) Renamed. Old name was verbose; tests Envoy behavior of sending empty data frame to signal EOS.
responsebody sent over a single request... (SSE variant) response streaming: SSE token counting Renamed. Old name was a duplicate copy-paste error; tests token counting logic on SSE streams.

Which issue(s) this PR fixes:
Refactoring groundwork necessary to support complex state-mutation tests (e.g., Flow Control) which require a strictly isolated environment.

Does this PR introduce a user-facing change?:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@netlify
Copy link

netlify bot commented Dec 18, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 6e1742e
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/69487393e2c3e00008268fa0
😎 Deploy Preview https://deploy-preview-2022--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 18, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 18, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 18, 2025
@LukeAVanDrie
Copy link
Contributor Author

/assign @ahg-g

I included a test mapping in the PR description which should hopefully make reviewing hermetic_test.go a bit easier. It's probably best to review as a full rewrite. I have ensured existing test cases and assertions strictly match.

@ahg-g
Copy link
Contributor

ahg-g commented Dec 19, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 19, 2025
@LukeAVanDrie LukeAVanDrie force-pushed the refactor/epp-hermetic-tests branch from 06fd22f to 5177995 Compare December 19, 2025 07:02
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LukeAVanDrie
Copy link
Contributor Author

Fixed the boilerplate issue on last push.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2025
Refactor the integration test utilities to provide a standardized way to
start hermetic gRPC servers, and migrate the BBR harness to use it.

- Add `StartExtProcServer`: A helper for background server management
  with strict TCP readiness checks and 127.0.0.1 binding.
- Update `GetFreePort`: Return `int` instead of `*net.TCPAddr` for
  easier consumption.
- Refactor `BBRHarness`: Remove manual server startup boilerplate in
  favor of the new shared helper, reducing code duplication.
Add a new testing infrastructure for the Endpoint Picker (EPP) to enable
hermetic, namespace-isolated testing.

- Add `TestHarness`: Manages a namespace-scoped Controller Manager and
  server instance per test case. This ensures resources from one test
  do not pollute the scheduler state of another.
- Add DSL: Introduce intent-based builders (e.g., `ReqSubset`, `P`) and
  assertions (e.g., `ExpectRouteTo`) to decouple test logic from gRPC
  boilerplate.
- Support deterministic synchronization by waiting for Prometheus
  metrics to settle before generating traffic.
Rewrite the hermetic integration suite to use the new isolated harness
and DSL.

- Migrate all existing test cases from epp_test.go to hermetic_test.go.
- Enable pre-loading of base CRDs in TestMain to optimize setup time.
- Remove legacy setup code and global variable reliance.
@LukeAVanDrie LukeAVanDrie force-pushed the refactor/epp-hermetic-tests branch from 5177995 to 6e1742e Compare December 21, 2025 22:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2025
@LukeAVanDrie
Copy link
Contributor Author

resolved merge conflict after rebasing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants