-
Notifications
You must be signed in to change notification settings - Fork 83
ENG-506 Add performance metrics tracking for key operations #3857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENG-506 Add performance metrics tracking for key operations #3857
Conversation
- Introduced a new system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities. - Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled. - Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations. - Added documentation detailing implemented metrics and their usage. This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.
- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics. - Updated the .gitignore file to exclude new directories: data/, logs/, and storage/. - Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks. - Added documentation outlining the new metrics and their implementation details.
jose-blockchain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated recommendations:
- Fix the deadlock in wallet.go before merge - this will freeze the node if triggered, is confirmed
- Add context cancellation to
monitorQueueSizes- minor resource leak, not urgent but good to fix - Document that metrics endpoint should be firewalled - standard practice, just worth noting in docs
the code doesn't introduce direct vulnerabilities like injection or auth bypass. The metrics are useful operational data that node operators need. Just ensure port 9601 isn't exposed publicly (standard practice for any metrics endpoint).
| pm.registerAllMetrics() | ||
|
|
||
| // Register gauge observers for all gauges | ||
| go pm.observeGauges() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for clarity, this starts a goroutine that calls observeGauges() which is essentially empty (line 1077-1080). might want to either remove the goroutine or add a TODO comment explaining future plans for it?
pkg/clientinfo/rpc_health.go
Outdated
|
|
||
| // Configuration | ||
| checkInterval time.Duration | ||
| timeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor comment, nice that you have a timeout field configured, but it doesn't seem to be used anywhere in the actual health checks. was this intended for wrapping the RPC calls with context timeout? might be worth adding or removing if not needed.
| } | ||
|
|
||
| // monitorQueueSizes periodically records queue sizes as metrics. | ||
| func (c *channel) monitorQueueSizes(recorder interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential suggestion: the monitorQueueSizes function creates its own context but there's no way to stop it when the channel is closed. it'll keep running forever once started. maybe consider passing in a context from the channel or using a done channel?
| p.broadcastChannelManager.setMetricsRecorder(recorder) | ||
| } | ||
| // Update notifiee with metrics recorder | ||
| p.host.Network().Notify(buildNotifiee(p.host, recorder)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like buildNotifiee gets called twice... once at connection time with nil metrics, and again in SetMetricsRecorder. the second call adds a new notifiee but doesn't remove the first one. this should work fine but you'll have two notifiees registered. just flagging in case that wasn't intentional.
|
|
||
| // Update metrics | ||
| if wd.metricsRecorder != nil { | ||
| wd.actionsMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mutex wd.actionsMutex is already held from other lock calls. calling Lock() again here will deadlock since Go mutexes aren't reentrant.
suggest removing the lock/unlock and just using len(wd.actions) directly, if this makes sense (maybe not)
| r.ethLastCheck = startTime | ||
| r.ethLastError = err | ||
| r.ethMutex.Unlock() | ||
| rpcHealthLogger.Warnf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages and RPC response times are logged and exposed as metrics:
rpc_eth_health_status- reveals if Ethereum RPC is downrpc_btc_health_status- reveals if Bitcoin RPC is downrpc_eth_response_time_seconds- reveals RPC latency
an attacker monitoring these metrics knows:- when to attack (RPC is slow/degraded)
- which backend service to target
- when their DoS attack on RPC is succeeding
not sure this is expected
- Updated the performance metrics initialization to accept an existing instance, preventing duplicate registrations. - Improved error handling in the metrics observer to log duplicate registrations at the debug level instead of warnings. - Added a method to periodically observe gauge metrics, ensuring better monitoring capabilities.
The Keep Core node now exposes 31+ performance metrics via the
/metricsendpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.Integrated Metrics by Category
1. DKG (Distributed Key Generation) Metrics (6 metrics)
Counters:
performance_dkg_joined_total- Total number of DKG joins (members joined)performance_dkg_failed_total- Total number of failed DKG executionsperformance_dkg_validation_total- Total number of DKG result validations performedperformance_dkg_challenges_submitted_total- Total number of DKG challenges submitted on-chainperformance_dkg_approvals_submitted_total- Total number of DKG approvals submitted on-chainDuration Metrics:
performance_dkg_duration_seconds- Average duration of DKG operationsperformance_dkg_duration_seconds_count- Total count of DKG operationsPerformance Insights:
dkg_joined_total / (dkg_joined_total + dkg_failed_total)- Monitor DKG participation and success ratesdkg_duration_secondsexceeds 300 seconds (5 minutes) - indicates slow DKG operationsdkg_challenges_submitted_totalanddkg_approvals_submitted_totalto monitor dispute resolution activitydkg_validation_totalrelative to joins indicates active validation of DKG results2. Signing Operations Metrics (5 metrics)
Counters:
performance_signing_operations_total- Total number of signing operations attemptedperformance_signing_success_total- Total number of successful signing operationsperformance_signing_failed_total- Total number of failed signing operationsperformance_signing_timeouts_total- Total number of signing operations that timed outDuration Metrics:
performance_signing_duration_seconds- Average duration of signing operationsperformance_signing_duration_seconds_count- Total count of signing operationsPerformance Insights:
signing_success_total / signing_operations_total- Critical metric for node reliabilitysigning_failed_totalrate > 10% of total operationssigning_timeouts_total / signing_operations_total- Indicates network or coordination issuessigning_duration_secondsexceeds 60 seconds - indicates slow signing operationssigning_operations_totalrate to understand signing workload3. Wallet Dispatcher Metrics (6 metrics)
Counters:
performance_wallet_actions_total- Total number of wallet actions dispatchedperformance_wallet_action_success_total- Total number of successfully completed wallet actionsperformance_wallet_action_failed_total- Total number of failed wallet actionsperformance_wallet_dispatcher_rejected_total- Total number of wallet actions rejected (wallet busy)performance_wallet_heartbeat_failures_total- Total number of wallet heartbeat failuresGauges:
performance_wallet_dispatcher_active_actions- Current number of wallets with active actionsDuration Metrics:
performance_wallet_action_duration_seconds- Average duration of wallet actionsperformance_wallet_action_duration_seconds_count- Total count of wallet actionsPerformance Insights:
wallet_dispatcher_rejected_total / wallet_actions_total- Alert if > 5% indicates wallet saturationwallet_action_success_total / wallet_actions_total- Monitor wallet action reliabilitywallet_dispatcher_active_actionsshows current wallet workloadwallet_heartbeat_failures_totalindicates wallet connectivity issues4. Coordination Operations Metrics (4 metrics)
Counters:
performance_coordination_windows_detected_total- Total number of coordination windows detectedperformance_coordination_procedures_executed_total- Total number of coordination procedures executed successfullyperformance_coordination_failed_total- Total number of failed coordination proceduresDuration Metrics:
performance_coordination_duration_seconds- Average duration of coordination proceduresperformance_coordination_duration_seconds_count- Total count of coordination proceduresPerformance Insights:
coordination_procedures_executed_total / coordination_windows_detected_total- Success rate of coordinationcoordination_failed_totalrate > 5% of detected windowscoordination_windows_detected_totalto understand coordination frequencycoordination_duration_secondsto identify slow coordination operations5. Network Operations Metrics (10 metrics)
Peer Connection Metrics:
performance_peer_connections_total- Total number of peer connections establishedperformance_peer_disconnections_total- Total number of peer disconnectionsMessage Metrics:
performance_message_broadcast_total- Total number of messages broadcast to the networkperformance_message_received_total- Total number of messages received from the networkQueue Size Metrics (Gauges):
performance_incoming_message_queue_size- Current size of incoming message queue (withchannellabel)performance_message_handler_queue_size- Current size of message handler queues (withchannelandhandlerlabels)Ping Test Metrics:
performance_ping_test_total- Total number of ping tests performedperformance_ping_test_success_total- Total number of successful ping testsperformance_ping_test_failed_total- Total number of failed ping testsperformance_ping_test_duration_seconds- Average duration of ping testsperformance_ping_test_duration_seconds_count- Total count of ping testsPerformance Insights:
peer_connections_totalvspeer_disconnections_total- Monitor connection stabilitymessage_broadcast_totalandmessage_received_totalratesincoming_message_queue_size> 3000 (75% of 4096 capacity) - indicates message processing bottleneckmessage_handler_queue_size> 400 (75% of 512 capacity) - indicates handler saturationping_test_duration_secondsshows network round-trip timeping_test_failed_totalrate > 10% of ping tests - indicates network issues