fix: job retry mechanism not triggering #4961

stuartp44 · 2025-12-18T10:09:51Z

This pull request adds comprehensive tests for the retry mechanism in the scaleUp functionality and reintroduces the publishRetryMessage call to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.

Testing and Retry Mechanism Enhancements:

Added a new test suite "Retry mechanism tests" in scale-up.test.ts to cover scenarios where publishRetryMessage should be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.

Other code Updates:

Fixed logic to skip runner creation if no new runners are needed by checking if newRunners <= 0 instead of comparing counts, improving clarity and correctness.

Example scenarios for the above bug

Scenario 1

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

Scenario 2

RUNNERS_MAXIMUM_COUNT=5
Someone manually launches 8 EC2 instances with runner tags
New jobs arrive
Bug triggers: newRunners = Math.min(2, 5-8) = -3
Code tries to call createRunners({numberOfRunners: -3}) and fails

Scenario 3

Admin sets RUNNERS_MAXIMUM_COUNT=20
System scales up to 15 active runners
Admin reduces RUNNERS_MAXIMUM_COUNT=10 (cost control, policy change)
Before those 15 runners terminate, new jobs arrive
Bug triggers: newRunners = Math.min(scaleUp, 10-15) = -5
Code tries to call createRunners({numberOfRunners: -5}) and fails

We tested this in our staging environment and verified it's working.

Closes #4960

github-actions · 2025-12-18T10:10:09Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Co-authored-by: Brend Smits <brend.smits@philips.com>

Copilot

Pull request overview

This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.

Key Changes

Re-introduced the publishRetryMessage call in the scale-up loop to ensure retry messages are published for queued jobs
Fixed the condition for skipping runner creation from missingInstanceCount === scaleUp to newRunners <= 0, preventing attempts to create negative numbers of runners
Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`lambdas/functions/control-plane/src/scale-runners/scale-up.ts`	Imports `publishRetryMessage`, calls it for each queued message, and fixes the skip condition to handle negative `newRunners` values
`lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts`	Adds mock setup for `publishRetryMessage` and a new test suite with 7 tests covering retry mechanism behavior in various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts

npalm · 2025-12-18T20:27:37Z

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

      }

      scaleUp++;
+      await publishRetryMessage(message as ActionRequestMessageRetry);


Good catch, the call got lost in the PR that introces the btaches.

I doubt a bit if we should publish the message hre. At this point no runner is created. Only the count is increased. But on the other hand we cannot match the created runners with the event.

I think it would be a bit safter to call the publishRetry messages at the end for each message that is not marked as invalid. Since the invalid once are going back to the queue and will be retried. This will lead that more retry messages are created for the same event.

@copilot what do you think?

I'm not sure if it's worth it to add all that additional logic (and an additional for-loop) into this function. It will make it way more complex as well..

In case the message is marked as invalid, worst case it goes through the retry mechanism, ends up in the same scale-up function and it still is an invalid message. Leading to minimal waste.
IMO a good trade-off to avoid making this function even more complex.

What do you think @npalm?

@npalm, what if your feeling about what @Brend-Smits said?

No I do not agree, problem is in case of failure by pushing all messages back the presure on APIs only quickly increase. So I think it will be better to filter. I did an quick experiment on a branch with some copilot refactory. But had not the time to dig in futher.

Copilot · 2025-12-18T20:43:00Z

@npalm I've opened a new pull request, #4966, to work on those changes. Once the pull request is ready, I'll request review from you.

iainlane

Cheers - right now I just had one request!

lambdas/functions/control-plane/src/scale-runners/scale-up.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2026-01-05T12:53:02Z

@stuartp44 I've opened a new pull request, #4980, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

stuartp44 requested a review from a team as a code owner December 18, 2025 10:09

stuartp44 force-pushed the stu/fix_job_retry branch from 584b846 to b02ebc2 Compare December 18, 2025 10:13

fix: improve readability of retry message assertions in tests

3eee9b2

Co-authored-by: Brend Smits <brend.smits@philips.com>

stuartp44 force-pushed the stu/fix_job_retry branch from b02ebc2 to 3eee9b2 Compare December 18, 2025 10:14

stuartp44 added the bug Something isn't working label Dec 18, 2025

npalm requested a review from Copilot December 18, 2025 20:20

Copilot started reviewing on behalf of npalm December 18, 2025 20:20 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts Outdated Show resolved Hide resolved

lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Show resolved Hide resolved

npalm reviewed Dec 18, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 18, 2025

fix: Move publishRetryMessage to end of processing loop to avoid duplicate retries #4966

Draft

11 tasks

iainlane reviewed Dec 19, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts Show resolved Hide resolved

stuartp44 and others added 2 commits December 21, 2025 19:58

Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts

019ef82

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into stu/fix_job_retry

20ec578

Copilot AI mentioned this pull request Jan 5, 2026

Add listEC2Runners verification to maximum runners test #4980

Merged

Add listEC2Runners verification to maximum runners test (#4980)

a218024

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>

stuartp44 requested a review from a team as a code owner January 5, 2026 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: job retry mechanism not triggering #4961

fix: job retry mechanism not triggering #4961

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

npalm Dec 18, 2025

Uh oh!

npalm Dec 18, 2025

Uh oh!

Brend-Smits Dec 19, 2025

Uh oh!

stuartp44 Dec 21, 2025

Uh oh!

npalm Jan 6, 2026

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Uh oh!

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: job retry mechanism not triggering #4961

Are you sure you want to change the base?

fix: job retry mechanism not triggering #4961

Conversation

stuartp44 commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

npalm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Brend-Smits Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

stuartp44 Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

npalm Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 18, 2025

Uh oh!

iainlane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Dec 18, 2025 •

edited

Loading