Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Jan 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 1:34am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 1:34am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 1:34am
rivet-site Ignored Ignored Preview Jan 6, 2026 1:34am

Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Jan 6, 2026

Pull Request Review: Fix Memory Leak with Local Subscribers

Overview

This PR addresses a memory leak in the UniversalPubSub (UPS) system by replacing RwLock<HashMap> with lock-free scc::HashMap data structures and implementing proper cleanup mechanisms. The changes also add comprehensive metrics for monitoring subscriber counts.


✅ Strengths

  1. Root Cause Analysis: The PR correctly identifies and fixes the memory leak issue - subscribers weren't being cleaned up properly.

  2. Lock-Free Concurrency: Migration to scc::HashMap eliminates lock contention and potential deadlocks, which is excellent for a high-throughput pubsub system.

  3. Comprehensive Metrics: Added metrics for tracking subscriber counts at multiple layers (local, reply, memory driver, postgres driver) aids in debugging and monitoring.

  4. Proper Cleanup in Drop: The Subscriber::Drop implementation (pubsub.rs:365-381) ensures immediate cleanup when subscribers are dropped.

  5. GC Task Pattern: Both memory driver and pubsub layer now have periodic GC tasks with weak references to prevent GC task from keeping the driver alive.


🔍 Potential Issues & Suggestions

Critical: Race Condition in Subscriber::Drop

Location: engine/packages/universalpubsub/src/pubsub.rs:365-381

The Drop implementation spawns an async task without ensuring it completes. If a subscriber is dropped and immediately recreated for the same subject, the cleanup task might remove the new subscriber's entry.

Scenario:

  1. Subscriber A is dropped → spawns cleanup task
  2. Subscriber B subscribes to same subject → creates new entry
  3. Cleanup task from A wakes up, sees receiver_count==0, removes entry
  4. Subscriber B never receives messages

Recommendation:

  • The GC task already handles this cleanup safely every 60 seconds
  • Consider removing the Drop cleanup entirely, OR
  • Add a generation counter/timestamp to entries to prevent removing newer entries

Performance: Redundant Metric Updates

Location: Multiple files

Metrics are updated on every HashMap operation. .len() on scc::HashMap requires iterating all entries (O(n) operation). Calling this frequently could impact performance.

Recommendation:

  • Use atomic counters and increment/decrement on add/remove
  • OR let the GC task update metrics periodically instead of on every operation

Code Quality: Inconsistent Error Handling

Location: engine/packages/universalpubsub/src/driver/memory/mod.rs:100-101

Send failures are silently ignored. While closed channels are cleaned up by GC, this makes debugging difficult.

Recommendation: Add debug logging for send failures


Postgres Driver: Potential Memory Leak in Cleanup Task

Location: engine/packages/universalpubsub/src/driver/postgres/mod.rs:356-371

If receiver_count() > 0 when the token is cancelled, this task exits without removing the subscription entry. The entry remains forever.

Scenario:

  1. Multiple subscribers to same subject created
  2. All drop their guards (cancelling the token)
  3. Cleanup task runs but receiver_count() > 0
  4. Task exits, entry never gets cleaned up

Recommendation: Restructure to ensure cleanup happens when receiver_count reaches 0


Documentation: Missing Safety Comments

The weak reference pattern in GC tasks is critical for preventing memory leaks but isn't documented. Add comments explaining why weak references are needed.


Testing: No Test Coverage

The PR doesn't include tests for the memory leak fix.

Recommendation: Add tests that verify:

  1. Subscribers are cleaned up when dropped
  2. GC tasks remove closed subscribers
  3. Metrics accurately reflect subscriber counts
  4. No memory leaks under rapid subscribe/unsubscribe patterns

Minor: UniversalDB Metrics Rename

The rename from udb_ping_durationudb_fdb_ping_duration is a breaking change for monitoring/alerting. Document this in PR description or release notes.


🔒 Security Considerations

No security concerns identified.


📊 Performance Considerations

Positive:

  • Lock-free HashMap eliminates lock contention
  • GC tasks prevent unbounded memory growth

Concerns:

  • Frequent .len() calls on HashMap
  • Memory overhead of spawning cleanup tasks in Drop

🎯 Overall Assessment

This is a solid fix for a critical memory leak issue. The migration to lock-free data structures is the right approach. However, address these issues before merging:

  1. Fix the Subscriber::Drop race condition (critical)
  2. Optimize metric updates (performance)
  3. Fix Postgres cleanup task (potential leak)
  4. Add tests (quality)

📋 Checklist

  • ✅ Code follows Rust best practices
  • ✅ Uses workspace dependencies correctly
  • ⚠️ Missing test coverage
  • ✅ Error handling follows project patterns (mostly)
  • ✅ Logging uses structured tracing
  • ⚠️ Race conditions present in Drop implementation
  • ✅ Metrics added for observability

Overall: Approve with requested changes - The core fix is sound, but address the race condition and consider the performance optimizations.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 6, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3752

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3752

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3752

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3752

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3752

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3752

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3752

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3752

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3752

commit: 634f682

@MasterPtato MasterPtato force-pushed the 01-05-chore_fmt_fix_config_properties branch from cb04567 to 39100b1 Compare January 6, 2026 01:33
@MasterPtato MasterPtato force-pushed the 01-05-fix_ups_fix_memory_leak_with_local_subscribers branch from 634f682 to 271abff Compare January 6, 2026 01:33
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 6, 2026

Merge activity

  • Jan 6, 1:36 AM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 6, 1:37 AM UTC: CI is running for this pull request on a draft pull request (#3753) due to your merge queue CI optimization settings.
  • Jan 6, 1:37 AM UTC: Merged by the Graphite merge queue via draft PR: #3753.

@claude
Copy link

claude bot commented Jan 6, 2026

PR Review: Fix Memory Leak with Local Subscribers (UPDATED)

This PR addresses a memory leak in universalpubsub by replacing std::collections::HashMap with scc::HashMap and implementing proper garbage collection.

Strengths

  • Root cause correctly identified: RwLock prevented cleanup
  • scc::HashMap is the right choice for lock-free async access
  • Comprehensive metrics at all layers (local, reply, memory driver, postgres)
  • Good test coverage for both memory optimization paths

Critical Issues

  1. MemoryDriver GC spawns per instance (memory/mod.rs:44) - each new() spawns a 60s GC task. Consider singleton pattern or document this.

  2. Drop handler spawns tasks (pubsub.rs:365) - cleanup tasks may not run during tokio shutdown. Add comment explaining.

  3. Metrics temporarily inconsistent - scc::HashMap::len() may be inaccurate during concurrent ops. Probably acceptable but worth noting.

Code Quality

  1. Postgres client simplified (postgres/mod.rs:56) - removed unnecessary Arc layer, good!

  2. GC interval hardcoded at 60s - consider making configurable or document rationale

  3. Race in drop cleanup (pubsub.rs:372) - checking receiver_count() then removing has window, but GC handles it

Recommendations

  • Add memory leak regression test (create/drop many subs, verify cleanup)
  • Document GC strategy and interval choice
  • Add comment in Drop impl about shutdown behavior
  • Consider concurrent stress test

Impact

  • Risk: Medium (core functionality, lock-free concurrency)
  • Performance: Positive (removes RwLock contention)
  • Breaking: None (internal only)

Verdict

Approve with minor suggestions - Solid fix with appropriate solution. Metrics will help monitor in prod.

Great work fixing this leak!

@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the 01-05-fix_ups_fix_memory_leak_with_local_subscribers branch January 6, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants