[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

graphite-app · 2026-01-06T01:37:00Z

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

…3721)

vercel · 2026-01-06T01:37:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Jan 6, 2026 1:37am

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Jan 6, 2026 1:37am
rivet-inspector	Ignored	Preview	Jan 6, 2026 1:37am
rivet-site	Ignored	Preview	Jan 6, 2026 1:37am

claude · 2026-01-06T01:38:57Z

Pull Request Review

This is a Graphite merge queue PR combining 7 individual PRs (#3655, #3656, #3657, #3660, #3721, #3749, #3752). Here's my comprehensive review:

Overall Assessment

✅ Generally Good: The PR includes important fixes (memory leak, metrics revert) and infrastructure improvements (dashboard updates, config cleanup). The memory leak fix in PR #3752 is particularly critical.

Critical Issues

1. Memory Leak Fix (#3752) - Critical But Well Implemented

File: engine/packages/universalpubsub/src/pubsub.rs, src/driver/memory/mod.rs, src/driver/postgres/mod.rs

Positive Points:

✅ Switched from RwLock<HashMap> to lock-free scc::HashMap - excellent performance improvement
✅ Added proper GC intervals for cleanup (60s)
✅ Added metrics tracking for subscriber counts
✅ Implemented proper Drop trait for immediate cleanup

Concerns:

⚠️ The Drop implementation spawns a tokio task without checking if runtime is available:

impl Drop for Subscriber {
    fn drop(&mut self) {
        if let Some(subject) = &self.subject {
            let pubsub = self.pubsub.clone();
            let subject = subject.clone();
            tokio::spawn(async move { // <- Could panic if runtime is shutting down

Recommendation: Use `tokio::runtime::Handle::try_current()" to check if runtime is available before spawning.

⚠️ Weak reference pattern in GC task is good, but consider logging when GC starts/stops for debugging
⚠️ retain_async operations could be expensive under high load - consider rate limiting or throttling

Code Quality Issues

2. Inconsistent Error Handling Pattern

File: engine/packages/guard-core/tests/streaming_response.rs:256

} else {
    use rivet_guard_core::proxy_service::StructuredResponse;
    Ok(RoutingOutput::Response(StructuredResponse {
        status: StatusCode::NOT_FOUND,
        message: std::borrow::Cow::Borrowed("Not found"),
        docs: None,
    }))
}

Changed to:

}

bail!("bad path");

Issue: This change removes structured error response in favor of generic bail!. This is inconsistent with the API pattern and provides less useful error information.

Recommendation: Keep the structured response or at least use a descriptive error message that matches the previous "Not found" semantics.

3. Metrics Naming Convention Change (#3660)

File: engine/packages/api-builder/src/metrics.rs, various dashboard files

Changes:

Renamed error_code label to error across metrics
Renamed UDB metrics from udb_ping_* to udb_fdb_ping_*

Concerns:

⚠️ Breaking change for existing monitoring/alerting rules that rely on error_code label
⚠️ No deprecation period or migration guide provided
❓ The UDB rename suggests these metrics are specific to FoundationDB, but the dashboard for FoundationDB was deleted

Recommendations:

Document this as a breaking change in release notes
Provide migration guide for teams updating dashboards/alerts
Consider keeping both labels temporarily with deprecation warning

4. Cache Metric Calculation Fix (#3660)

File: engine/docker/template/grafana-dashboards/cache.json

Old (incorrect):

sum(irate(rivet_cache_value_miss_total{...}[]) / rate(rivet_cache_value_total{...}[])) by (key)

New (correct):

sum by (key) (rate(rivet_cache_value_miss_total{...}[])) / sum by (key) (rate(rivet_cache_value_total{...}[]))

Assessment: ✅ Good fix! The old query was aggregating ratios which is mathematically incorrect. The new version properly calculates the ratio of aggregated values.

Performance Considerations

5. Lock-Free Data Structures ✅

The switch to scc::HashMap is excellent:

Removes lock contention
Better scalability under concurrent access
Async-friendly with entry_async, retain_async methods

Benchmark suggestion: Consider load testing the pubsub system to verify memory leak is fixed and performance improved.

6. Postgres Driver Connection Handling

File: engine/packages/universalpubsub/src/driver/postgres/mod.rs

Changes:

Changed from Arc<tokio_postgres::Client> to just tokio_postgres::Client in mutex
Simplified cloning logic

Assessment: ✅ Good simplification, but verify:

Connection recovery works correctly under network failures
Re-subscription logic is race-free during reconnection
Multiple concurrent subscribe calls during reconnection don't cause issues

Testing recommendation: Add integration tests for connection recovery scenarios.

Security Concerns

7. No Security Issues Identified ✅

No SQL injection, XSS, or command injection risks introduced
Proper use of parameterized queries maintained
No secrets exposed in code

Test Coverage

8. Missing Tests ⚠️

Critical gaps:

No tests for the memory leak fix verification
No tests for Drop behavior in Subscriber
No tests for GC cleanup logic
No tests for connection recovery in Postgres driver

Recommendations:

// Add tests like:
#[tokio::test]
async fn test_subscriber_cleanup_on_drop() {
    // Verify local_subscribers map is cleaned up when subscriber drops
}

#[tokio::test]
async fn test_gc_removes_closed_senders() {
    // Verify GC properly removes closed senders
}

#[tokio::test]
async fn test_postgres_reconnection_resubscribes() {
    // Verify LISTEN commands are re-executed after reconnection
}

Documentation

9. Good Addition ✅

File: docs/engine/MEMORY_DEBUGGING.md

Excellent addition for debugging memory issues with heaptrack. Clear, actionable instructions.

Minor suggestion: Add a troubleshooting section for common issues.

10. CLAUDE.md Update ✅

File: CLAUDE.md

Good addition about not documenting deltas in comments. This aligns with clean code principles.

Infrastructure Changes

11. Docker Configuration Updates ✅

The Prometheus and OTEL collector configuration changes look good:

Removed redundant otel-collector-client services
Consolidated metric collection
Updated Grafana datasources

Verification needed: Ensure all services can still report metrics correctly after consolidation.

Style & Conventions

12. Dependency Management ✅

File: engine/packages/universalpubsub/Cargo.toml

Dependencies properly alphabetized and using workspace versions. Follows CLAUDE.md guidelines.

Recommendations Summary

Before Merging:

⚠️ Fix the tokio::spawn in Drop to handle runtime shutdown gracefully
⚠️ Document breaking changes in metrics labels (error_code → error)
⚠️ Add tests for memory leak fix verification
⚠️ Reconsider the bail!("bad path") change in test code

After Merging:

Monitor memory usage to verify leak is fixed
Update any existing alerting rules for renamed metrics
Load test the pubsub system
Add integration tests for edge cases

Conclusion

This is a solid PR with important fixes, especially the memory leak resolution. The switch to lock-free data structures and improved GC logic should significantly improve performance. However, address the Drop implementation concern and add tests before merging.

Overall Rating: 7.5/10
Recommendation: Approve with minor fixes requested

Files Requiring Attention

engine/packages/universalpubsub/src/pubsub.rs:365-379 - Drop implementation
engine/packages/guard-core/tests/streaming_response.rs:256 - Error handling regression
Add test files for memory leak verification

Let me know if you'd like me to create issues for any of these items!

MasterPtato added 7 commits January 6, 2026 01:36

chore: update dashboards (#3655)

b894bf8

fix: revert otel metrics to prometheus (#3656)

8966908

fix: add wf and guard metrics (#3657)

6c03bd5

chore: clean up metrics (#3660)

615342d

fix: fix gateway token reading, rivetkit path token, vergen git sha (#…

6a73c66

…3721)

chore: fmt, fix config properties (#3749)

e0ebf34

fix(ups): fix memory leak with local subscribers (#3752)

787c723

graphite-app bot closed this Jan 6, 2026

graphite-app bot deleted the gtmq_spec_f90959_1767663418335-7777de91-6ea1-4599-9770-4ec0b8630b4f branch January 6, 2026 01:37

vercel bot had a problem deploying to Preview – rivetkit-serverless January 6, 2026 01:37 Failure

vercel bot had a problem deploying to Production – rivetkit-serverless January 6, 2026 01:38 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

Uh oh!

graphite-app bot commented Jan 6, 2026

Uh oh!

vercel bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

[Graphite MQ] Draft PR GROUP:spec_f90959 (PRs 3655, 3656, 3657, 3660, 3721, 3749, 3752) #3753

Uh oh!

Conversation

graphite-app bot commented Jan 6, 2026

Uh oh!

vercel bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Jan 6, 2026

Pull Request Review

Overall Assessment

Critical Issues

1. Memory Leak Fix (#3752) - Critical But Well Implemented

Code Quality Issues

2. Inconsistent Error Handling Pattern

3. Metrics Naming Convention Change (#3660)

4. Cache Metric Calculation Fix (#3660)

Performance Considerations

5. Lock-Free Data Structures ✅

6. Postgres Driver Connection Handling

Security Concerns

7. No Security Issues Identified ✅

Test Coverage

8. Missing Tests ⚠️

Documentation

9. Good Addition ✅

10. CLAUDE.md Update ✅

Infrastructure Changes

11. Docker Configuration Updates ✅

Style & Conventions

12. Dependency Management ✅

Recommendations Summary

Conclusion

Files Requiring Attention

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Jan 6, 2026 •

edited

Loading