Skip to content

Conversation

@krishna-st
Copy link
Contributor

Implements a drain mode for Pinot Minion instances, allowing operators to gracefully scale down minions by preventing new task assignments while allowing existing tasks to complete.

  • Extended PUT /instances/{instanceName}/state to support DRAIN state for minion instances
  • Draining changes minion tag from minion_untaggedminion_drained, preventing Helix from assigning new tasks
  • Leverages Helix's native tag-based task assignment mechanism (.setInstanceGroupTag())

@krishna-st krishna-st changed the title Kk/minion drain Add Support to tag a minion as Drained Dec 16, 2025
@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2025

Codecov Report

❌ Patch coverage is 63.41463% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.27%. Comparing base (82b7d2a) to head (e532cfa).

Files with missing lines Patch % Lines
...ntroller/helix/core/PinotHelixResourceManager.java 66.66% 4 Missing and 7 partials ⚠️
...er/api/resources/PinotInstanceRestletResource.java 42.85% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17375      +/-   ##
============================================
+ Coverage     63.25%   63.27%   +0.01%     
+ Complexity     1475     1474       -1     
============================================
  Files          3162     3162              
  Lines        188668   188708      +40     
  Branches      28869    28880      +11     
============================================
+ Hits         119348   119397      +49     
+ Misses        60074    60057      -17     
- Partials       9246     9254       +8     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.23% <63.41%> (+0.01%) ⬆️
java-21 55.57% <ø> (-7.66%) ⬇️
temurin 63.27% <63.41%> (+0.01%) ⬆️
unittests 63.26% <63.41%> (+0.01%) ⬆️
unittests1 55.60% <ø> (+0.01%) ⬆️
unittests2 34.00% <63.41%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a drain mode for Pinot Minion instances to enable graceful scale-down operations. The implementation allows operators to prevent new task assignments to specific minions while existing tasks complete, using Helix's tag-based assignment mechanism.

Key Changes:

  • Added DRAIN state support to the instance state API endpoint
  • Implemented tag switching mechanism (minion_untagged → minion_drained) to prevent new task assignments
  • Added comprehensive test coverage for drain operations including edge cases and concurrency scenarios

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
CommonConstants.java Added DRAINED_MINION_INSTANCE constant for the new drain tag
PinotHelixResourceManagerMinionDrainTest.java Comprehensive unit tests covering basic drain operations, edge cases (multiple tags, empty tags, already drained), and concurrent safety
PinotInstanceRestletResourceTest.java Integration tests verifying API behavior for drain operations including error cases
PinotHelixResourceManager.java Core drain logic with drainMinionInstance() method and helper for tag replacement
StateType.java Extended enum to include DRAIN state
PinotInstanceRestletResource.java API endpoint updates to handle DRAIN state with validation for minion instances only

@krishna-st krishna-st requested a review from Copilot December 16, 2025 16:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

* @param instanceName Name of the minion instance to drain
* @return Response indicating success or failure
*/
public synchronized PinotResourceManagerResponse drainMinionInstance(String instanceName) {
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synchronized method-level lock could become a bottleneck under concurrent drain requests across different minion instances. Consider using instance-level locking or a more granular synchronization mechanism (e.g., ConcurrentHashMap with per-instance locks) to allow concurrent draining of different minions.

Copilot uses AI. Check for mistakes.
@ApiOperation(value = "Enable/disable an instance", notes = "Enable/disable an instance")
@ApiOperation(value = "Enable/disable/drain an instance", notes = "Enable/disable/drain an instance. "
+ "DRAIN state is only applicable to minion instances and retags them from minion_untagged to minion_drained, "
+ "preventing new task assignments while allowing existing tasks to complete.")
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should mention that the drain operation is idempotent and preserves other tags on the instance. This is important operational context for users of the API.

Suggested change
+ "preventing new task assignments while allowing existing tasks to complete.")
+ "preventing new task assignments while allowing existing tasks to complete. "
+ "The drain operation is idempotent and preserves other tags on the instance.")

Copilot uses AI. Check for mistakes.
@Consumes(MediaType.TEXT_PLAIN)
@ApiOperation(value = "Enable/disable an instance", notes = "Enable/disable an instance")
@ApiOperation(value = "Enable/disable/drain an instance", notes = "Enable/disable/drain an instance. "
+ "DRAIN state is only applicable to minion instances and retags them from minion_untagged to minion_drained, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean if a minion is tagged to something else, then we cannot drain this node?

Copy link
Contributor Author

@krishna-st krishna-st Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, yes, because ideally, if the minion is tagged with something else, then there are no tasks that are currently being assigned to it, unless the table config has minionInstanceTag present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiangfu0 noticed that we can custom-tag minions so that specific tasks get executed on those. So replacing all the tags with minion_drained

"DRAIN state only applies to minion instances. Instance '" + instanceName + "' is not a minion.",
Response.Status.BAD_REQUEST);
}
PinotResourceManagerResponse response = _pinotHelixResourceManager.drainMinionInstance(instanceName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a minion is in drained mode, when the same minion come up again, the tag is still drained, then it won't be able to serve tasks right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. In case a minion is in drained mode, you would need to retag them with minion_untagged so that Helix can start assigning tasks. This is only when the minion is drained, but not removed from the controller

@krishna-st krishna-st requested a review from xiangfu0 December 22, 2025 06:03
LOGGER.info("Restored pre-drain tags for minion instance: {}", instanceName);
// Return success immediately for drained minions - no need to wait for partition states
// since minions don't have partitions like servers do
return PinotResourceManagerResponse.success("Successfully restored tags for minion instance: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to process the code for enableInstance(xxxxxx).

}

public PinotResourceManagerResponse enableInstance(String instanceName) {
// If enabling a drained minion, restore to its previous tags
Copy link
Contributor

@xiangfu0 xiangfu0 Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this restore minion tags logic into a method and put it in to the:

private PinotResourceManagerResponse enableInstance(String instanceName, boolean enableInstance, long timeOutMs)

// Validate that minion is not already drained
List<String> currentTags = instanceConfig.getTags();
if (currentTags != null && currentTags.contains(Helix.DRAINED_MINION_INSTANCE)) {
throw new UnsupportedOperationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exception won't be caught by the external caller in pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotInstanceRestletResource.java#toggleInstanceState()

I feel return is better:

return PinotResourceManagerResponse.failure("Minion instance " + instanceName + " is already drained");

Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants