Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Dec 18, 2025

Closes #660

  • Add connection limit (default 100 concurrent connections)
  • Add message size limit (default 1MB per message)
  • Add stale peer cleanup (removes data for peers disconnected >1 hour)
  • Make all limits configurable via RemoteCommsOptions
  • Add ResourceLimitError for limit violations
  • Add comprehensive tests for all resource limits

This prevents memory exhaustion and manages system resources by:

  • Rejecting new connections when limit is reached
  • Rejecting messages exceeding size limit
  • Periodically cleaning up stale peer data

Note

Introduces resource enforcement in remote comms and a dedicated error type, with robust reconnection/race handling and cleanup.

  • Enforces max concurrent connections (default 100) and max message size (default 1MB) in network.ts; rejects excess with ResourceLimitError; all limits configurable via RemoteCommsOptions (maxConcurrentConnections, maxMessageSizeBytes, cleanupIntervalMs, stalePeerTimeoutMs).
  • Adds periodic stale peer cleanup (default every 15m; peers idle >1h) clearing queues, hints, reconnection state; integrates wake handling and intentional-close logic.
  • Improves reconnection: ignores stale channel errors, reuses existing channels, rechecks limits post-dial, flushes queues on channel replacement.
  • Adds ConnectionFactory.closeChannel to explicitly close/abort underlying streams and uses it for rejected/replaced channels.
  • Adds ResourceLimitError (code RESOURCE_LIMIT_ERROR) with marshal/unmarshal validation and exports.
  • Expands tests across limits, reconnection races, inbound handling, channel replacement/closing; adjusts coverage thresholds for ocap-kernel.

Written by Cursor Bugbot for commit 71fe98b. This will update automatically on new commits. Configure here.

@sirtimid sirtimid requested a review from a team as a code owner December 18, 2025 14:38
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from a40d3eb to 96f26e2 Compare December 18, 2025 15:05
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from b6c23e0 to 25e81ac Compare December 19, 2025 19:55
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from 8e04986 to ce1e774 Compare January 5, 2026 15:35
@rekmarks
Copy link
Member

rekmarks commented Jan 5, 2026

@cursor review

- Add connection limit (default 100 concurrent connections)
- Add message size limit (default 1MB per message)
- Add stale peer cleanup (removes data for peers disconnected >1 hour)
- Make all limits configurable via RemoteCommsOptions
- Add ResourceLimitError for limit violations
- Add comprehensive tests for all resource limits

This prevents memory exhaustion and manages system resources by:
- Rejecting new connections when limit is reached
- Rejecting messages exceeding size limit
- Periodically cleaning up stale peer data
@sirtimid sirtimid force-pushed the sirtimid/remote-comms-resource-limits branch from ce1e774 to 8323837 Compare January 5, 2026 20:51
@sirtimid sirtimid requested a review from FUDCo January 5, 2026 20:51
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine for what it is.

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

@sirtimid
Copy link
Contributor Author

sirtimid commented Jan 6, 2026

One lingering question I have is that as we get more careful about detecting error conditions (or proclaiming them, in the case of configurable resource limits), are we potentially setting ourselves up for situations where an error bubbles up to user code at some location other than the location that is actually responsible for causing it? In other words, will a message transmission error always find its way back to the actual send operation that triggered it?

More generally, could user code find itself in an unrecoverable error state (by that I don't mean a state where there's an error that you can't get rid of -- things can always break unfixably, e.g., a remote host dies forever -- but rather a state where you don't actually know you're stuck). It's entirely plausible to me that everything is fine, but I can't tell from reading the tests whether our tests give us reason to believe we are ok on this score.

@FUDCo Two parts to your question:

Error attribution: Yeah transmission errors are always caught within the specific sendRemoteMessage() call that triggered them. Each call is independent with its own try/catch, so errors won't bubble up to the wrong location. The only errors that surface to callers are synchronous validation errors (ResourceLimitError, intentional close), which also come from the correct call site.

Stuck without knowing: Currently sendRemoteMessage is fire-and-forget—messages queue during reconnection, but if retries are exhausted, they're silently lost. The caller never knows. This is exactly what the Message Acknowledgment work addresses: sendRemoteMessage will resolve on ACK or reject after retries, so user code will know definitively.

@sirtimid sirtimid enabled auto-merge (squash) January 6, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote comms: Implement Resource Limits

4 participants