monitoring

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 21, 2025 License: MIT Imports: 15 Imported by: 0

README

Go Backend Kit - 监控系统

本项目提供了完整的监控解决方案,包含Prometheus指标收集、Grafana可视化和SLA监控。

核心问题解决方案

1. API响应耗时监控

问题: 如何准确监控每个API的响应耗时?

解决方案: 使用专门的Gin中间件 GinMetricsMiddleware

// 配置监控中间件
metricsConfig := &monitoring.MetricsConfig{
    SLAThresholds: map[string]float64{
        "/api/v1/users":         0.5,  // 500ms SLA
        "/api/v1/user/profile":  0.3,  // 300ms SLA
        "/api/v1/orders":        1.0,  // 1s SLA
    },
    SkipPaths: []string{
        "/health", "/metrics", "/ready", "/live",
    },
    PathGrouping: map[string]string{
        `/api/v1/users/\d+`: "/api/v1/users/:id",  // 路径参数归组
    },
}

// 在Gin中使用
router.Use(monitoring.GinMetricsMiddleware(metricsConfig))
2. 关键监控指标
HTTP请求指标
  • gin_requests_total - 总请求数 (按method, route, status_code分组)
  • gin_request_duration_seconds - 请求耗时直方图
  • gin_request_size_bytes - 请求体大小
  • gin_response_size_bytes - 响应体大小
  • gin_active_requests - 当前活跃请求数
API端点指标
  • api_endpoint_requests_total - API端点请求数
  • api_endpoint_duration_seconds - API端点响应时间
  • sla_violations_total - SLA违规次数
系统指标
  • cpu_usage_percent - CPU使用率
  • memory_usage_bytes - 内存使用量
  • goroutines_total - Goroutine数量
  • gc_duration_seconds - GC耗时
中间件指标
  • authentication_attempts_total - 认证尝试次数
  • rate_limit_hits_total - 限流命中次数
  • cache_hits_total / cache_misses_total - 缓存命中/未命中
业务指标
  • user_registrations_total - 用户注册数
  • user_logins_total - 用户登录数
  • active_users - 活跃用户数
  • errors_total - 错误总数

使用方法

1. 基本集成
import (
    "github.com/NHYCRaymond/go-backend-kit/monitoring"
    "github.com/gin-gonic/gin"
)

func main() {
    router := gin.New()
    
    // 1. 首先添加监控中间件
    router.Use(monitoring.GinMetricsMiddleware(nil)) // 使用默认配置
    
    // 2. 添加其他中间件
    router.Use(gin.Logger())
    router.Use(gin.Recovery())
    
    // 3. 启动监控服务器
    monitoring.StartServer(&config.MonitoringConfig{
        Port: 9090,
        Path: "/metrics",
    }, logger)
}
2. 自定义配置
config := &monitoring.MetricsConfig{
    SLAThresholds: map[string]float64{
        "/api/v1/users":    0.5,  // 500ms
        "/api/v1/orders":   1.0,  // 1s
    },
    SkipPaths: []string{"/health", "/metrics"},
    PathGrouping: map[string]string{
        `/api/v1/users/\d+`: "/api/v1/users/:id",
    },
}

router.Use(monitoring.GinMetricsMiddleware(config))
3. 记录业务指标
// 在业务逻辑中记录指标
func loginHandler(c *gin.Context) {
    // 业务逻辑
    success := performLogin(c)
    
    if success {
        monitoring.RecordUserLogin("success")
    } else {
        monitoring.RecordUserLogin("failed")
    }
}

func registerHandler(c *gin.Context) {
    // 业务逻辑
    performRegistration(c)
    
    monitoring.RecordUserRegistration()
}

Prometheus查询示例

1. API响应时间监控
# 95th百分位响应时间
histogram_quantile(0.95, rate(gin_request_duration_seconds_bucket[5m]))

# 按路由分组的平均响应时间
rate(gin_request_duration_seconds_sum[5m]) / rate(gin_request_duration_seconds_count[5m])

# 慢请求率(>1秒)
rate(gin_request_duration_seconds_bucket{le="1"}[5m])
2. 错误率监控
# 总错误率
rate(gin_requests_total{status_code=~"5.."}[5m]) / rate(gin_requests_total[5m])

# 按端点分组的错误率
rate(gin_requests_total{status_code=~"5..", route="/api/v1/users"}[5m])
3. SLA监控
# SLA违规率
rate(sla_violations_total[5m])

# 按端点分组的SLA违规
rate(sla_violations_total{endpoint="/api/v1/users"}[5m])
4. 系统资源监控
# 内存使用率
memory_usage_bytes / memory_total_bytes * 100

# Goroutine数量趋势
goroutines_total

# 活跃请求数
gin_active_requests

Grafana集成

1. 导入仪表板

使用提供的 monitoring/grafana/dashboard.json 文件导入预配置的仪表板。

2. 关键面板
  1. HTTP请求率 - 实时请求量
  2. 响应时间分布 - 95th/50th百分位
  3. 错误率趋势 - 4xx/5xx错误
  4. SLA违规监控 - 超时请求追踪
  5. 系统资源使用 - CPU/内存/Goroutine
  6. 业务指标 - 用户注册/登录/活跃用户

告警规则

1. 高响应时间告警
- alert: HighResponseTime
  expr: histogram_quantile(0.95, rate(gin_request_duration_seconds_bucket[5m])) > 1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "High response time detected"
2. 高错误率告警
- alert: HighErrorRate
  expr: rate(gin_requests_total{status_code=~"5.."}[5m]) / rate(gin_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
3. SLA违规告警
- alert: SLAViolation
  expr: rate(sla_violations_total[5m]) > 0.01
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "SLA violation detected"

最佳实践

1. 标签基数控制
  • 使用 PathGrouping 将相似路径归组
  • 避免使用高基数标签(如用户ID)
  • 限制标签值的数量
2. SLA设置
  • 根据业务需求设置合理的SLA阈值
  • 区分不同类型API的SLA要求
  • 定期评估和调整SLA阈值
3. 监控覆盖
  • 监控所有关键业务流程
  • 包含基础设施和应用层指标
  • 设置合适的告警阈值
4. 性能优化
  • 定期清理过期指标
  • 使用合适的抓取间隔
  • 监控Prometheus自身性能

故障排查

1. 指标缺失
  • 检查中间件是否正确注册
  • 确认路由配置正确
  • 验证Prometheus抓取配置
2. 高基数问题
  • 检查路径参数是否正确归组
  • 限制动态标签值
  • 使用聚合规则减少存储
3. 性能问题
  • 监控指标收集开销
  • 优化标签使用
  • 考虑采样策略

扩展功能

1. 分布式追踪

可以集成Jaeger或OpenTelemetry进行分布式追踪。

2. 日志聚合

可以集成Loki进行日志聚合和关联。

3. 自定义指标

可以根据业务需求添加自定义指标收集。

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// Gin specific metrics
	GinRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "gin_requests_total",
			Help: "Total number of HTTP requests processed by Gin",
		},
		[]string{"method", "route", "status_code"},
	)

	GinRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_request_duration_seconds",
			Help:    "Duration of HTTP requests processed by Gin",
			Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
		},
		[]string{"method", "route"},
	)

	GinRequestSize = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_request_size_bytes",
			Help:    "Size of HTTP requests processed by Gin",
			Buckets: prometheus.ExponentialBuckets(1024, 2, 10),
		},
		[]string{"method", "route"},
	)

	GinResponseSize = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_response_size_bytes",
			Help:    "Size of HTTP responses processed by Gin",
			Buckets: prometheus.ExponentialBuckets(1024, 2, 10),
		},
		[]string{"method", "route"},
	)

	GinActiveRequests = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "gin_active_requests",
			Help: "Number of active HTTP requests being processed by Gin",
		},
		[]string{"method", "route"},
	)

	// API endpoint specific metrics
	APIEndpointMetrics = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_endpoint_requests_total",
			Help: "Total number of requests per API endpoint",
		},
		[]string{"endpoint", "method", "status_code", "version"},
	)

	APIEndpointDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "api_endpoint_duration_seconds",
			Help:    "Duration of API endpoint requests",
			Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
		},
		[]string{"endpoint", "method", "version"},
	)

	// SLA metrics
	SLAViolations = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "sla_violations_total",
			Help: "Total number of SLA violations",
		},
		[]string{"endpoint", "threshold"},
	)
)
View Source
var (
	// HTTP metrics
	HTTPRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests",
		},
		[]string{"method", "endpoint", "status"},
	)

	HTTPRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "Duration of HTTP requests",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"method", "endpoint"},
	)

	// Database metrics
	DBConnectionsActive = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "db_connections_active",
			Help: "Number of active database connections",
		},
		[]string{"database"},
	)

	DBConnectionsIdle = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "db_connections_idle",
			Help: "Number of idle database connections",
		},
		[]string{"database"},
	)

	DBQueryDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "db_query_duration_seconds",
			Help:    "Duration of database queries",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"operation"},
	)

	DBQueryTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "db_query_total",
			Help: "Total number of database queries",
		},
		[]string{"operation", "status"},
	)

	// Redis metrics
	RedisCommandsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "redis_commands_total",
			Help: "Total number of Redis commands",
		},
		[]string{"command", "status"},
	)

	RedisCommandsDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "redis_commands_duration_seconds",
			Help:    "Duration of Redis commands",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"command"},
	)

	RedisConnectionsActive = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "redis_connections_active",
			Help: "Number of active Redis connections",
		},
	)

	// Message queue metrics
	MessageQueuePublishedTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "message_queue_published_total",
			Help: "Total number of messages published to queue",
		},
		[]string{"queue", "status"},
	)

	MessageQueueConsumedTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "message_queue_consumed_total",
			Help: "Total number of messages consumed from queue",
		},
		[]string{"queue", "status"},
	)

	MessageQueueProcessingDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "message_queue_processing_duration_seconds",
			Help:    "Duration of message processing",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"queue"},
	)

	// Application metrics
	AppInfo = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "app_info",
			Help: "Application information",
		},
		[]string{"version", "instance_id"},
	)

	AppUptime = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "app_uptime_seconds",
			Help: "Application uptime in seconds",
		},
	)

	// Custom business metrics
	ActiveUsers = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "active_users",
			Help: "Number of active users",
		},
	)

	RequestsInFlight = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "requests_in_flight",
			Help: "Number of requests currently being processed",
		},
	)

	// System metrics
	CPUUsage = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "cpu_usage_percent",
			Help: "Current CPU usage percentage",
		},
	)

	MemoryUsage = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "memory_usage_bytes",
			Help: "Current memory usage in bytes",
		},
	)

	MemoryTotal = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "memory_total_bytes",
			Help: "Total available memory in bytes",
		},
	)

	Goroutines = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "goroutines_total",
			Help: "Number of goroutines currently running",
		},
	)

	GCDuration = promauto.NewHistogram(
		prometheus.HistogramOpts{
			Name:    "gc_duration_seconds",
			Help:    "Time spent in garbage collection",
			Buckets: prometheus.DefBuckets,
		},
	)

	// Middleware metrics
	AuthenticationAttempts = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "authentication_attempts_total",
			Help: "Total number of authentication attempts",
		},
		[]string{"status", "method"},
	)

	RateLimitHits = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "rate_limit_hits_total",
			Help: "Total number of rate limit hits",
		},
		[]string{"key", "endpoint"},
	)

	CacheHits = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "cache_hits_total",
			Help: "Total number of cache hits",
		},
		[]string{"cache_type", "key"},
	)

	CacheMisses = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "cache_misses_total",
			Help: "Total number of cache misses",
		},
		[]string{"cache_type", "key"},
	)

	// Business domain metrics
	UserRegistrations = promauto.NewCounter(
		prometheus.CounterOpts{
			Name: "user_registrations_total",
			Help: "Total number of user registrations",
		},
	)

	UserLogins = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "user_logins_total",
			Help: "Total number of user logins",
		},
		[]string{"status"},
	)

	APICallsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_calls_total",
			Help: "Total number of API calls",
		},
		[]string{"service", "operation", "status"},
	)

	ErrorsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "errors_total",
			Help: "Total number of errors",
		},
		[]string{"type", "component", "severity"},
	)

	// External service metrics
	ExternalServiceCalls = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "external_service_calls_total",
			Help: "Total number of external service calls",
		},
		[]string{"service", "endpoint", "status"},
	)

	ExternalServiceDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "external_service_duration_seconds",
			Help:    "Duration of external service calls",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"service", "endpoint"},
	)
)

Functions

func GetAPIMetrics

func GetAPIMetrics() map[string]interface{}

GetAPIMetrics returns current API metrics for health checks

func GetCacheHitRatio

func GetCacheHitRatio(cacheType string) float64

GetCacheHitRatio calculates cache hit ratio

func GinMetricsMiddleware

func GinMetricsMiddleware(config *MetricsConfig) gin.HandlerFunc

GinMetricsMiddleware returns a Gin middleware that collects metrics

func HTTPMetricsMiddleware

func HTTPMetricsMiddleware() func(next http.Handler) http.Handler

HTTPMetricsMiddleware returns a middleware that collects HTTP metrics

func RecordAPICall

func RecordAPICall(service, operation, status string)

RecordAPICall records API call metrics

func RecordAPICallWithContext

func RecordAPICallWithContext(endpoint, method, status, version string, duration time.Duration)

RecordAPICallWithContext records an API call with additional context

func RecordAuthenticationAttempt

func RecordAuthenticationAttempt(status, method string)

RecordAuthenticationAttempt records authentication attempt metrics

func RecordCacheHit

func RecordCacheHit(cacheType, key string)

RecordCacheHit records cache hit metrics

func RecordCacheMiss

func RecordCacheMiss(cacheType, key string)

RecordCacheMiss records cache miss metrics

func RecordDBMetrics

func RecordDBMetrics(operation string, duration time.Duration, err error)

RecordDBMetrics records database metrics

func RecordError

func RecordError(errorType, component, severity string)

RecordError records error metrics

func RecordExternalServiceCall

func RecordExternalServiceCall(service, endpoint, status string, duration time.Duration)

RecordExternalServiceCall records external service call metrics

func RecordMessageQueueMetrics

func RecordMessageQueueMetrics(queue, operation string, duration time.Duration, err error)

RecordMessageQueueMetrics records message queue metrics

func RecordRateLimitHit

func RecordRateLimitHit(key, endpoint string)

RecordRateLimitHit records rate limit hit metrics

func RecordRedisMetrics

func RecordRedisMetrics(command string, duration time.Duration, err error)

RecordRedisMetrics records Redis metrics

func RecordSLAViolation

func RecordSLAViolation(endpoint string, threshold float64)

RecordSLAViolation records an SLA violation

func RecordUserLogin

func RecordUserLogin(status string)

RecordUserLogin records user login metrics

func RecordUserRegistration

func RecordUserRegistration()

RecordUserRegistration records user registration metrics

func SetActiveUsers

func SetActiveUsers(count float64)

SetActiveUsers sets the number of active users

func SetAppInfo

func SetAppInfo(version, instanceID string)

SetAppInfo sets application information

func SetDBConnections

func SetDBConnections(database string, active, idle int)

SetDBConnections sets database connection metrics

func SetRedisConnections

func SetRedisConnections(active int)

SetRedisConnections sets Redis connection metrics

func StartServer

func StartServer(cfg *config.MonitoringConfig, logger *slog.Logger)

StartServer starts the Prometheus metrics server

Types

type DatabaseMetricsCollector

type DatabaseMetricsCollector struct {
	// contains filtered or unexported fields
}

DatabaseMetricsCollector collects database metrics from database instances

func NewDatabaseMetricsCollector

func NewDatabaseMetricsCollector(databases map[string]interface{}) *DatabaseMetricsCollector

NewDatabaseMetricsCollector creates a new database metrics collector

func (*DatabaseMetricsCollector) CollectDatabaseMetrics

func (c *DatabaseMetricsCollector) CollectDatabaseMetrics()

CollectDatabaseMetrics collects metrics from all registered databases

type HealthCheckCollector

type HealthCheckCollector struct {
	// contains filtered or unexported fields
}

HealthCheckCollector provides health check metrics

func NewHealthCheckCollector

func NewHealthCheckCollector() *HealthCheckCollector

NewHealthCheckCollector creates a new health check collector

func (*HealthCheckCollector) CollectHealthMetrics

func (c *HealthCheckCollector) CollectHealthMetrics()

CollectHealthMetrics collects health check metrics

func (*HealthCheckCollector) RegisterHealthCheck

func (c *HealthCheckCollector) RegisterHealthCheck(name string, check func() error)

RegisterHealthCheck registers a health check function

type MetricsConfig

type MetricsConfig struct {
	// SLA thresholds for different endpoints (in seconds)
	SLAThresholds map[string]float64
	// Skip paths that should not be monitored
	SkipPaths []string
	// Group similar paths to reduce cardinality
	PathGrouping map[string]string
}

MetricsConfig holds configuration for metrics middleware

func DefaultMetricsConfig

func DefaultMetricsConfig() *MetricsConfig

DefaultMetricsConfig returns default configuration

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL