ENG-506 Add performance metrics tracking for key operations #3857

lionakhnazarov · 2025-12-17T12:23:01Z

The Keep Core node now exposes 31+ performance metrics via the /metrics endpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.

Integrated Metrics by Category

1. DKG (Distributed Key Generation) Metrics (6 metrics)

Counters:

performance_dkg_joined_total - Total number of DKG joins (members joined)
performance_dkg_failed_total - Total number of failed DKG executions
performance_dkg_validation_total - Total number of DKG result validations performed
performance_dkg_challenges_submitted_total - Total number of DKG challenges submitted on-chain
performance_dkg_approvals_submitted_total - Total number of DKG approvals submitted on-chain

Duration Metrics:

performance_dkg_duration_seconds - Average duration of DKG operations
performance_dkg_duration_seconds_count - Total count of DKG operations

Performance Insights:

Success Rate: dkg_joined_total / (dkg_joined_total + dkg_failed_total) - Monitor DKG participation and success rates
Duration Monitoring: Alert if dkg_duration_seconds exceeds 300 seconds (5 minutes) - indicates slow DKG operations
On-chain Activity: Track dkg_challenges_submitted_total and dkg_approvals_submitted_total to monitor dispute resolution activity
Validation Rate: High dkg_validation_total relative to joins indicates active validation of DKG results

2. Signing Operations Metrics (5 metrics)

Counters:

performance_signing_operations_total - Total number of signing operations attempted
performance_signing_success_total - Total number of successful signing operations
performance_signing_failed_total - Total number of failed signing operations
performance_signing_timeouts_total - Total number of signing operations that timed out

Duration Metrics:

performance_signing_duration_seconds - Average duration of signing operations
performance_signing_duration_seconds_count - Total count of signing operations

Performance Insights:

Success Rate: signing_success_total / signing_operations_total - Critical metric for node reliability
Failure Rate: Alert if signing_failed_total rate > 10% of total operations
Timeout Rate: signing_timeouts_total / signing_operations_total - Indicates network or coordination issues
Performance: Alert if signing_duration_seconds exceeds 60 seconds - indicates slow signing operations
Throughput: Monitor signing_operations_total rate to understand signing workload

3. Wallet Dispatcher Metrics (6 metrics)

Counters:

performance_wallet_actions_total - Total number of wallet actions dispatched
performance_wallet_action_success_total - Total number of successfully completed wallet actions
performance_wallet_action_failed_total - Total number of failed wallet actions
performance_wallet_dispatcher_rejected_total - Total number of wallet actions rejected (wallet busy)
performance_wallet_heartbeat_failures_total - Total number of wallet heartbeat failures

Gauges:

performance_wallet_dispatcher_active_actions - Current number of wallets with active actions

Duration Metrics:

performance_wallet_action_duration_seconds - Average duration of wallet actions
performance_wallet_action_duration_seconds_count - Total count of wallet actions

Performance Insights:

Rejection Rate: wallet_dispatcher_rejected_total / wallet_actions_total - Alert if > 5% indicates wallet saturation
Success Rate: wallet_action_success_total / wallet_actions_total - Monitor wallet action reliability
Utilization: wallet_dispatcher_active_actions shows current wallet workload
Bottleneck Detection: High rejection rate + high active actions = wallet bottleneck
Health Monitoring: wallet_heartbeat_failures_total indicates wallet connectivity issues

4. Coordination Operations Metrics (4 metrics)

Counters:

performance_coordination_windows_detected_total - Total number of coordination windows detected
performance_coordination_procedures_executed_total - Total number of coordination procedures executed successfully
performance_coordination_failed_total - Total number of failed coordination procedures

Duration Metrics:

performance_coordination_duration_seconds - Average duration of coordination procedures
performance_coordination_duration_seconds_count - Total count of coordination procedures

Performance Insights:

Execution Rate: coordination_procedures_executed_total / coordination_windows_detected_total - Success rate of coordination
Failure Rate: Alert if coordination_failed_total rate > 5% of detected windows
Window Detection: Monitor coordination_windows_detected_total to understand coordination frequency
Performance: Track coordination_duration_seconds to identify slow coordination operations

5. Network Operations Metrics (10 metrics)

Peer Connection Metrics:

performance_peer_connections_total - Total number of peer connections established
performance_peer_disconnections_total - Total number of peer disconnections

Message Metrics:

performance_message_broadcast_total - Total number of messages broadcast to the network
performance_message_received_total - Total number of messages received from the network

Queue Size Metrics (Gauges):

performance_incoming_message_queue_size - Current size of incoming message queue (with channel label)
performance_message_handler_queue_size - Current size of message handler queues (with channel and handler labels)

Ping Test Metrics:

performance_ping_test_total - Total number of ping tests performed
performance_ping_test_success_total - Total number of successful ping tests
performance_ping_test_failed_total - Total number of failed ping tests
performance_ping_test_duration_seconds - Average duration of ping tests
performance_ping_test_duration_seconds_count - Total count of ping tests

Performance Insights:

Network Health: peer_connections_total vs peer_disconnections_total - Monitor connection stability
Message Throughput: Track message_broadcast_total and message_received_total rates
Queue Backlog: Alert if incoming_message_queue_size > 3000 (75% of 4096 capacity) - indicates message processing bottleneck
Handler Backlog: Alert if message_handler_queue_size > 400 (75% of 512 capacity) - indicates handler saturation
Network Latency: ping_test_duration_seconds shows network round-trip time
Connectivity: Alert if ping_test_failed_total rate > 10% of ping tests - indicates network issues
Message Balance: Compare broadcast vs received to detect message loss

- Introduced a new system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities. - Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled. - Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations. - Added documentation detailing implemented metrics and their usage. This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.

- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics. - Updated the .gitignore file to exclude new directories: data/, logs/, and storage/. - Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks. - Added documentation outlining the new metrics and their implementation details.

jose-blockchain

Updated recommendations:

Fix the deadlock in wallet.go before merge - this will freeze the node if triggered, is confirmed
Add context cancellation to monitorQueueSizes - minor resource leak, not urgent but good to fix
Document that metrics endpoint should be firewalled - standard practice, just worth noting in docs

the code doesn't introduce direct vulnerabilities like injection or auth bypass. The metrics are useful operational data that node operators need. Just ensure port 9601 isn't exposed publicly (standard practice for any metrics endpoint).

jose-blockchain · 2026-01-05T20:11:39Z

pkg/clientinfo/performance.go

+	pm.registerAllMetrics()
+
+	// Register gauge observers for all gauges
+	go pm.observeGauges()


just for clarity, this starts a goroutine that calls observeGauges() which is essentially empty (line 1077-1080). might want to either remove the goroutine or add a TODO comment explaining future plans for it?

jose-blockchain · 2026-01-05T20:17:44Z

pkg/clientinfo/rpc_health.go

+
+	// Configuration
+	checkInterval time.Duration
+	timeout       time.Duration


just a minor comment, nice that you have a timeout field configured, but it doesn't seem to be used anywhere in the actual health checks. was this intended for wrapping the RPC calls with context timeout? might be worth adding or removing if not needed.

jose-blockchain · 2026-01-05T20:19:52Z

pkg/net/libp2p/channel.go

+}
+
+// monitorQueueSizes periodically records queue sizes as metrics.
+func (c *channel) monitorQueueSizes(recorder interface {


potential suggestion: the monitorQueueSizes function creates its own context but there's no way to stop it when the channel is closed. it'll keep running forever once started. maybe consider passing in a context from the channel or using a done channel?

jose-blockchain · 2026-01-05T20:22:08Z

pkg/net/libp2p/libp2p.go

+		p.broadcastChannelManager.setMetricsRecorder(recorder)
+	}
+	// Update notifiee with metrics recorder
+	p.host.Network().Notify(buildNotifiee(p.host, recorder))


looks like buildNotifiee gets called twice... once at connection time with nil metrics, and again in SetMetricsRecorder. the second call adds a new notifiee but doesn't remove the first one. this should work fine but you'll have two notifiees registered. just flagging in case that wasn't intentional.

jose-blockchain · 2026-01-05T20:27:20Z

pkg/tbtc/wallet.go


+	// Update metrics
+	if wd.metricsRecorder != nil {
+		wd.actionsMutex.Lock()


the mutex wd.actionsMutex is already held from other lock calls. calling Lock() again here will deadlock since Go mutexes aren't reentrant.
suggest removing the lock/unlock and just using len(wd.actions) directly, if this makes sense (maybe not)

jose-blockchain · 2026-01-05T20:39:17Z

pkg/clientinfo/rpc_health.go

+		r.ethLastCheck = startTime
+		r.ethLastError = err
+		r.ethMutex.Unlock()
+		rpcHealthLogger.Warnf(


Error messages and RPC response times are logged and exposed as metrics:

rpc_eth_health_status - reveals if Ethereum RPC is down

rpc_btc_health_status - reveals if Bitcoin RPC is down

rpc_eth_response_time_seconds - reveals RPC latency
an attacker monitoring these metrics knows:

when to attack (RPC is slow/degraded)

which backend service to target

when their DoS attack on RPC is succeeding
not sure this is expected

- Updated the performance metrics initialization to accept an existing instance, preventing duplicate registrations. - Improved error handling in the metrics observer to log duplicate registrations at the debug level instead of warnings. - Added a method to periodically observe gauge metrics, ensuring better monitoring capabilities.

lionakhnazarov added 2 commits December 17, 2025 12:21

eth and btc connectivity metrics

f8eec80

lionakhnazarov marked this pull request as ready for review December 31, 2025 18:43

lionakhnazarov added 2 commits December 31, 2025 18:48

remove redundant docs

86a62e8

jose-blockchain suggested changes Jan 5, 2026

View reviewed changes

lionakhnazarov added 2 commits January 6, 2026 15:01

fixes after review

a277e6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENG-506 Add performance metrics tracking for key operations #3857

ENG-506 Add performance metrics tracking for key operations #3857

lionakhnazarov commented Dec 17, 2025 •

edited

Loading

Uh oh!

jose-blockchain left a comment

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

jose-blockchain Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ENG-506 Add performance metrics tracking for key operations #3857

Are you sure you want to change the base?

ENG-506 Add performance metrics tracking for key operations #3857

Conversation

lionakhnazarov commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integrated Metrics by Category

1. DKG (Distributed Key Generation) Metrics (6 metrics)

2. Signing Operations Metrics (5 metrics)

3. Wallet Dispatcher Metrics (6 metrics)

4. Coordination Operations Metrics (4 metrics)

5. Network Operations Metrics (10 metrics)

Uh oh!

jose-blockchain left a comment

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jose-blockchain Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lionakhnazarov commented Dec 17, 2025 •

edited

Loading