Skip to content

Conversation

@graphite-app
Copy link
Contributor

@graphite-app graphite-app bot commented Jan 6, 2026

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

@vercel
Copy link

vercel bot commented Jan 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 6, 2026 1:37am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 6, 2026 1:37am
rivet-inspector Ignored Ignored Preview Jan 6, 2026 1:37am
rivet-site Ignored Ignored Preview Jan 6, 2026 1:37am

@graphite-app graphite-app bot closed this Jan 6, 2026
@graphite-app graphite-app bot deleted the gtmq_spec_f90959_1767663418335-7777de91-6ea1-4599-9770-4ec0b8630b4f branch January 6, 2026 01:37
@claude
Copy link

claude bot commented Jan 6, 2026

Pull Request Review

This is a Graphite merge queue PR combining 7 individual PRs (#3655, #3656, #3657, #3660, #3721, #3749, #3752). Here's my comprehensive review:

Overall Assessment

✅ Generally Good: The PR includes important fixes (memory leak, metrics revert) and infrastructure improvements (dashboard updates, config cleanup). The memory leak fix in PR #3752 is particularly critical.


Critical Issues

1. Memory Leak Fix (#3752) - Critical But Well Implemented

File: engine/packages/universalpubsub/src/pubsub.rs, src/driver/memory/mod.rs, src/driver/postgres/mod.rs

Positive Points:

  • ✅ Switched from RwLock<HashMap> to lock-free scc::HashMap - excellent performance improvement
  • ✅ Added proper GC intervals for cleanup (60s)
  • ✅ Added metrics tracking for subscriber counts
  • ✅ Implemented proper Drop trait for immediate cleanup

Concerns:

  • ⚠️ The Drop implementation spawns a tokio task without checking if runtime is available:

    impl Drop for Subscriber {
        fn drop(&mut self) {
            if let Some(subject) = &self.subject {
                let pubsub = self.pubsub.clone();
                let subject = subject.clone();
                tokio::spawn(async move { // <- Could panic if runtime is shutting down

    Recommendation: Use `tokio::runtime::Handle::try_current()" to check if runtime is available before spawning.

  • ⚠️ Weak reference pattern in GC task is good, but consider logging when GC starts/stops for debugging

  • ⚠️ retain_async operations could be expensive under high load - consider rate limiting or throttling


Code Quality Issues

2. Inconsistent Error Handling Pattern

File: engine/packages/guard-core/tests/streaming_response.rs:256

} else {
    use rivet_guard_core::proxy_service::StructuredResponse;
    Ok(RoutingOutput::Response(StructuredResponse {
        status: StatusCode::NOT_FOUND,
        message: std::borrow::Cow::Borrowed("Not found"),
        docs: None,
    }))
}

Changed to:

}

bail!("bad path");

Issue: This change removes structured error response in favor of generic bail!. This is inconsistent with the API pattern and provides less useful error information.

Recommendation: Keep the structured response or at least use a descriptive error message that matches the previous "Not found" semantics.


3. Metrics Naming Convention Change (#3660)

File: engine/packages/api-builder/src/metrics.rs, various dashboard files

Changes:

  • Renamed error_code label to error across metrics
  • Renamed UDB metrics from udb_ping_* to udb_fdb_ping_*

Concerns:

  • ⚠️ Breaking change for existing monitoring/alerting rules that rely on error_code label
  • ⚠️ No deprecation period or migration guide provided
  • ❓ The UDB rename suggests these metrics are specific to FoundationDB, but the dashboard for FoundationDB was deleted

Recommendations:

  1. Document this as a breaking change in release notes
  2. Provide migration guide for teams updating dashboards/alerts
  3. Consider keeping both labels temporarily with deprecation warning

4. Cache Metric Calculation Fix (#3660)

File: engine/docker/template/grafana-dashboards/cache.json

Old (incorrect):

sum(irate(rivet_cache_value_miss_total{...}[]) / rate(rivet_cache_value_total{...}[])) by (key)

New (correct):

sum by (key) (rate(rivet_cache_value_miss_total{...}[])) / sum by (key) (rate(rivet_cache_value_total{...}[]))

Assessment: ✅ Good fix! The old query was aggregating ratios which is mathematically incorrect. The new version properly calculates the ratio of aggregated values.


Performance Considerations

5. Lock-Free Data Structures

The switch to scc::HashMap is excellent:

  • Removes lock contention
  • Better scalability under concurrent access
  • Async-friendly with entry_async, retain_async methods

Benchmark suggestion: Consider load testing the pubsub system to verify memory leak is fixed and performance improved.


6. Postgres Driver Connection Handling

File: engine/packages/universalpubsub/src/driver/postgres/mod.rs

Changes:

  • Changed from Arc<tokio_postgres::Client> to just tokio_postgres::Client in mutex
  • Simplified cloning logic

Assessment: ✅ Good simplification, but verify:

  • Connection recovery works correctly under network failures
  • Re-subscription logic is race-free during reconnection
  • Multiple concurrent subscribe calls during reconnection don't cause issues

Testing recommendation: Add integration tests for connection recovery scenarios.


Security Concerns

7. No Security Issues Identified

  • No SQL injection, XSS, or command injection risks introduced
  • Proper use of parameterized queries maintained
  • No secrets exposed in code

Test Coverage

8. Missing Tests ⚠️

Critical gaps:

  1. No tests for the memory leak fix verification
  2. No tests for Drop behavior in Subscriber
  3. No tests for GC cleanup logic
  4. No tests for connection recovery in Postgres driver

Recommendations:

// Add tests like:
#[tokio::test]
async fn test_subscriber_cleanup_on_drop() {
    // Verify local_subscribers map is cleaned up when subscriber drops
}

#[tokio::test]
async fn test_gc_removes_closed_senders() {
    // Verify GC properly removes closed senders
}

#[tokio::test]
async fn test_postgres_reconnection_resubscribes() {
    // Verify LISTEN commands are re-executed after reconnection
}

Documentation

9. Good Addition

File: docs/engine/MEMORY_DEBUGGING.md

Excellent addition for debugging memory issues with heaptrack. Clear, actionable instructions.

Minor suggestion: Add a troubleshooting section for common issues.


10. CLAUDE.md Update

File: CLAUDE.md

Good addition about not documenting deltas in comments. This aligns with clean code principles.


Infrastructure Changes

11. Docker Configuration Updates

The Prometheus and OTEL collector configuration changes look good:

  • Removed redundant otel-collector-client services
  • Consolidated metric collection
  • Updated Grafana datasources

Verification needed: Ensure all services can still report metrics correctly after consolidation.


Style & Conventions

12. Dependency Management

File: engine/packages/universalpubsub/Cargo.toml

Dependencies properly alphabetized and using workspace versions. Follows CLAUDE.md guidelines.


Recommendations Summary

Before Merging:

  1. ⚠️ Fix the tokio::spawn in Drop to handle runtime shutdown gracefully
  2. ⚠️ Document breaking changes in metrics labels (error_codeerror)
  3. ⚠️ Add tests for memory leak fix verification
  4. ⚠️ Reconsider the bail!("bad path") change in test code

After Merging:

  1. Monitor memory usage to verify leak is fixed
  2. Update any existing alerting rules for renamed metrics
  3. Load test the pubsub system
  4. Add integration tests for edge cases

Conclusion

This is a solid PR with important fixes, especially the memory leak resolution. The switch to lock-free data structures and improved GC logic should significantly improve performance. However, address the Drop implementation concern and add tests before merging.

Overall Rating: 7.5/10
Recommendation: Approve with minor fixes requested


Files Requiring Attention

  1. engine/packages/universalpubsub/src/pubsub.rs:365-379 - Drop implementation
  2. engine/packages/guard-core/tests/streaming_response.rs:256 - Error handling regression
  3. Add test files for memory leak verification

Let me know if you'd like me to create issues for any of these items!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants