Skip to content

Conversation

@DmarshalTU
Copy link

Feat: Embedded HTTP API Server #7 requirements. Here's what was accomplished:

Changes:

  1. SRE Server Implementation (internal/sre/server.go):
    • Complete HTTP API server with REST endpoints
    • Real-time event streaming via Server-Sent Events (SSE)
    • Alert management and status tracking
    • Integration with Kubernetes Hook resources
  2. Workflow Integration:
    • Updated workflow/coordinator.go to accept SRE server parameter
    • Updated workflow/workflow_manager.go to pass SRE server to processor
    • Updated pipeline/processor.go to integrate with SRE server
    • Updated cmd/main.go to start the SRE server on port 8082

Enhanced API Implementation (FR1-FR6):

  1. FR1: Embedded HTTP API Server
    • ✅ Configurable port (default: 8082): Implemented
    • ✅ Graceful startup/shutdown: Implemented with context cancellation
    • ✅ Request routing, middleware, error handling: Implemented with HTTP mux
    • ⚠️ Authentication/authorization: Basic structure (needs enhancement)
    • ✅ Request logging: Implemented
  2. FR2: Event Query Endpoints
    • ✅ GET /api/v1/events: Implemented with query parameter filtering
    • ✅ GET /api/v1/events/{namespace}/{name}/events: Implemented
    • ✅ Query Parameters: namespace, eventType, resourceName, status filtering
    • ✅ Response format: JSON with data and total count
      FR3: Event Statistics and Aggregation
    • ✅ GET /api/v1/stats/events/summary: Implemented
    • ✅ GET /api/v1/stats/events/by-type: Implemented with percentages
    • ✅ Response format: Matches the specified JSON structure
      FR4: Real-time Event Streaming
    • ✅ Server-Sent Events: /api/v1/events/stream implemented
    • ✅ Connection management: Client channel management with cleanup
    • ✅ Heartbeat mechanism: 30-second heartbeat for connection health
    • ✅ CORS headers: Proper CORS configuration
      FR5: Hook Configuration Management
    • ✅ GET /api/v1/hooks: Lists all Hook configurations
    • ✅ Hook status integration: Syncs with active events
    • ⚠️ Individual hook endpoints: Basic structure (needs enhancement)
      FR6: Health and Diagnostics
    • ✅ GET /api/v1/health: Health check endpoint
    • ✅ GET /api/v1/diagnostics: Comprehensive diagnostics
    • ✅ GET /api/v1/metrics: Prometheus-style metrics
    • ✅ Health information: API status, memory usage, connection counts

🔧 Key Features Implemented:

  1. Dual API Structure: Both /api/v1/* (new) and /api/* (legacy) endpoints
  2. Real-time Integration: Events automatically create alerts in the SRE server
  3. Comprehensive Filtering: Query parameters for namespace, event type, resource name, status
  4. Statistics & Metrics: Event counts, percentages, severity breakdowns
  5. Streaming Support: Server-Sent Events for real-time updates
  6. Health Monitoring: Detailed diagnostics and metrics endpoints

How to Use:

  1. Start the controller: The SRE server will automatically start on port 8082
  2. Access the API:
  3. Query Examples:
    • Filter by namespace: GET /api/v1/events?namespace=production
    • Filter by event type: GET /api/v1/events?eventType=pod-restart
    • Filter by status: GET /api/v1/events?status=firing

@DmarshalTU
Copy link
Author

This is general structure, don't approve till all functionality added

Denis Tu and others added 5 commits September 19, 2025 15:57
Signed-off-by: Denis Tu <dmarshaltu@Deniss-MacBook-Air.local>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@Deniss-MacBook-Air.local>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@Deniss-MacBook-Air.local>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Signed-off-by: Denis Tu <dmarshaltu@gmail.com>
Copy link
Collaborator

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super awesome, thanks so much for getting this started! This is gonna unlock so many awesome use-cases :)

I think there are a few conversations we need to have before merging, but otherwise super happy about this.

Comment on lines +36 to +38
// AgentId specifies the Kagent agent to call when this event occurs (legacy format)
// +kubebuilder:validation:Optional
AgentId string `json:"agentId,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added back in, this format has been deprecated and has many issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a rebase gone wrong. @DmarshalTU - we switched to AgentRef instead of AgentId to be aligned with the rest of Kagent

deduplicationManager interfaces.DeduplicationManager,
kagentClient interfaces.KagentClient,
statusManager interfaces.StatusManager,
sreServer interface{},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an interface{} and not a concrete type?

agentRef := types.NamespacedName{
Name: match.Configuration.AgentRef.Name,
Namespace: agentRefNs,
// Handle both agentId (legacy) and agentRef (new) formats
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this as I mentioned above

// Continue even if status recording fails
}

// Add alert to SRE server if available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, why do we have a callout to the server from here, rather than the server sharing the same datasource as this component?

@@ -0,0 +1,1590 @@
package sre
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specifically called sre. I think we should think about this as a general httpserver for khook rather than anything specifically for your UI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiServer would also work

Comment on lines +42 to +50
// AlertSummary represents alert statistics
type AlertSummary struct {
Total int `json:"total"`
Firing int `json:"firing"`
Resolved int `json:"resolved"`
Acknowledged int `json:"acknowledged"`
BySeverity map[string]int `json:"bySeverity"`
ByEventType map[string]int `json:"byEventType"`
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rely on a db for this similar to kagent. We can use sqlite by default, but rely on the same Postgres instance of kagent if it's available.


// convertAgentIDFormat converts agent ID from kagent/k8s-agent to kagent__NS__k8s_agent format
// This matches the Python identifier format used by kagent
func convertAgentIDFormat(agentRef string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have libraries for this in kagent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of general comments about this file:

  1. Alerting is really an entirely new feature set, I'd personally prefer that we merge a v1 of the httpserver before adding alerting.
  2. This file is very very large, can we split the handlers up by feature set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, would also like the api server and alerting as 2 separate features. And yes, splitting the handlers would make this easier to comprehend and change.

@DmarshalTU
Copy link
Author

This is super awesome, thanks so much for getting this started! This is gonna unlock so many awesome use-cases :)

I think there are a few conversations we need to have before merging, but otherwise super happy about this.

So basically I showed case to @antweiss my use case for khook in the ide, and unfortunately I lost the source code, so this is fast attempt to recreate it for me and share my thoughts and needs of that, so I think no need to merge at least at that state and we should discuss main guidance and code new one but keep this PR for reference.

Copy link
Collaborator

@antweiss antweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmarshalTU in general - this is great. 2 important features many folks I'm sure can benefit from!
To summarize:

  1. check out @EItanya comments regarding keeping in line with the rest of kagent - it's important we keep things uniform
  2. lets change all sre references to somehting more generic like apiserver or httpserver
  3. let's limit this PR to only adding an API server for existing types and add alerting as a separate feature (which I'm very thankful for)
  4. Use agentRef rather then agentId

anything I missed?
I do realize these are massive changes. But then - this is a massive feature!

Comment on lines +36 to +38
// AgentId specifies the Kagent agent to call when this event occurs (legacy format)
// +kubebuilder:validation:Optional
AgentId string `json:"agentId,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a rebase gone wrong. @DmarshalTU - we switched to AgentRef instead of AgentId to be aligned with the rest of Kagent

os.Exit(1)
}

// Start SRE-IDE server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an issue with nomenclature - I know you built this specifically for your use case but would like to make stuff generic so other people can potentially build on them. I think just calling it "API server" will work better.

@@ -0,0 +1,1590 @@
package sre
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiServer would also work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, would also like the api server and alerting as 2 separate features. And yes, splitting the handlers would make this easier to comprehend and change.

EventTimeoutDuration = 10 * time.Minute
// NotificationSuppressionDuration is the window to suppress re-sending after success
NotificationSuppressionDuration = 10 * time.Minute
NotificationSuppressionDuration = 30 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be discussed. While the original 10 minutes may be too long, 30 seconds is IMHO too short. Maybe we should at least make this configurable by values/configmap

}

// determineSeverity determines alert severity based on event type
func (s *Server) determineSeverity(eventType string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should hard code severity here. My plan for khook is to support pluggable event sources and - so potentially many more eventTypes in the future. Also - severity definitions may change for different teams/environments. I'd sugges to add an optional "severity" field on eventConfigurations in hook - then we can set the default values to what's defined here and allow to change this according to user's needs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000%, in fact I’d love to potentially start planning how we’d extend the API/system for multiple sources.

@DmarshalTU DmarshalTU mentioned this pull request Nov 8, 2025
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants