-
Notifications
You must be signed in to change notification settings - Fork 699
fix: job retry mechanism not triggering #4961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
584b846 to
b02ebc2
Compare
Co-authored-by: Brend Smits <brend.smits@philips.com>
b02ebc2 to
3eee9b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes a critical bug where the job retry mechanism was not being triggered during the scale-up process. The fix re-introduces the publishRetryMessage call and corrects the logic for skipping runner creation when the maximum runner count is exceeded or when newRunners would be negative.
Key Changes
- Re-introduced the
publishRetryMessagecall in the scale-up loop to ensure retry messages are published for queued jobs - Fixed the condition for skipping runner creation from
missingInstanceCount === scaleUptonewRunners <= 0, preventing attempts to create negative numbers of runners - Added comprehensive test coverage for the retry mechanism with 7 new test cases covering various scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
lambdas/functions/control-plane/src/scale-runners/scale-up.ts |
Imports publishRetryMessage, calls it for each queued message, and fixes the skip condition to handle negative newRunners values |
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts |
Adds mock setup for publishRetryMessage and a new test suite with 7 tests covering retry mechanism behavior in various scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| scaleUp++; | ||
| await publishRetryMessage(message as ActionRequestMessageRetry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, the call got lost in the PR that introces the btaches.
I doubt a bit if we should publish the message hre. At this point no runner is created. Only the count is increased. But on the other hand we cannot match the created runners with the event.
I think it would be a bit safter to call the publishRetry messages at the end for each message that is not marked as invalid. Since the invalid once are going back to the queue and will be retried. This will lead that more retry messages are created for the same event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth it to add all that additional logic (and an additional for-loop) into this function. It will make it way more complex as well..
In case the message is marked as invalid, worst case it goes through the retry mechanism, ends up in the same scale-up function and it still is an invalid message. Leading to minimal waste.
IMO a good trade-off to avoid making this function even more complex.
What do you think @npalm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npalm, what if your feeling about what @Brend-Smits said?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I do not agree, problem is in case of failure by pushing all messages back the presure on APIs only quickly increase. So I think it will be better to filter. I did an quick experiment on a branch with some copilot refactory. But had not the time to dig in futher.
iainlane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers - right now I just had one request!
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@stuartp44 I've opened a new pull request, #4980, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stuartp44 <1926002+stuartp44@users.noreply.github.com>
This pull request adds comprehensive tests for the retry mechanism in the
scaleUpfunctionality and reintroduces thepublishRetryMessagecall to the scale-up process. The tests ensure that the retry logic works correctly under various scenarios, such as when jobs are queued, when the maximum number of runners is reached, and when queue checks are disabled.Testing and Retry Mechanism Enhancements:
scale-up.test.tsto cover scenarios wherepublishRetryMessageshould be called, including: when jobs are queued, when maximum runners are reached, with correct message structure, and when job queue checks are disabled.Other code Updates:
newRunners <= 0instead of comparing counts, improving clarity and correctness.Example scenarios for the above bug
Scenario 1
Scenario 2
Scenario 3
We tested this in our staging environment and verified it's working.
Closes #4960