monitoring

package

v1.1.0 Latest Latest Go to latest Published: Aug 21, 2025 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NHYCRaymond/go-backend-kit

Links

Open Source Insights

README ¶

Go Backend Kit - 监控系统

本项目提供了完整的监控解决方案，包含Prometheus指标收集、Grafana可视化和SLA监控。

核心问题解决方案

1. API响应耗时监控

问题: 如何准确监控每个API的响应耗时？

解决方案: 使用专门的Gin中间件 GinMetricsMiddleware

// 配置监控中间件
metricsConfig := &monitoring.MetricsConfig{
    SLAThresholds: map[string]float64{
        "/api/v1/users":         0.5,  // 500ms SLA
        "/api/v1/user/profile":  0.3,  // 300ms SLA
        "/api/v1/orders":        1.0,  // 1s SLA
    },
    SkipPaths: []string{
        "/health", "/metrics", "/ready", "/live",
    },
    PathGrouping: map[string]string{
        `/api/v1/users/\d+`: "/api/v1/users/:id",  // 路径参数归组
    },
}

// 在Gin中使用
router.Use(monitoring.GinMetricsMiddleware(metricsConfig))

2. 关键监控指标

HTTP请求指标

gin_requests_total - 总请求数 (按method, route, status_code分组)
gin_request_duration_seconds - 请求耗时直方图
gin_request_size_bytes - 请求体大小
gin_response_size_bytes - 响应体大小
gin_active_requests - 当前活跃请求数

API端点指标

api_endpoint_requests_total - API端点请求数
api_endpoint_duration_seconds - API端点响应时间
sla_violations_total - SLA违规次数

系统指标

cpu_usage_percent - CPU使用率
memory_usage_bytes - 内存使用量
goroutines_total - Goroutine数量
gc_duration_seconds - GC耗时

中间件指标

authentication_attempts_total - 认证尝试次数
rate_limit_hits_total - 限流命中次数
cache_hits_total / cache_misses_total - 缓存命中/未命中

业务指标

user_registrations_total - 用户注册数
user_logins_total - 用户登录数
active_users - 活跃用户数
errors_total - 错误总数

使用方法

1. 基本集成

import (
    "github.com/NHYCRaymond/go-backend-kit/monitoring"
    "github.com/gin-gonic/gin"
)

func main() {
    router := gin.New()
    
    // 1. 首先添加监控中间件
    router.Use(monitoring.GinMetricsMiddleware(nil)) // 使用默认配置
    
    // 2. 添加其他中间件
    router.Use(gin.Logger())
    router.Use(gin.Recovery())
    
    // 3. 启动监控服务器
    monitoring.StartServer(&config.MonitoringConfig{
        Port: 9090,
        Path: "/metrics",
    }, logger)
}

2. 自定义配置

config := &monitoring.MetricsConfig{
    SLAThresholds: map[string]float64{
        "/api/v1/users":    0.5,  // 500ms
        "/api/v1/orders":   1.0,  // 1s
    },
    SkipPaths: []string{"/health", "/metrics"},
    PathGrouping: map[string]string{
        `/api/v1/users/\d+`: "/api/v1/users/:id",
    },
}

router.Use(monitoring.GinMetricsMiddleware(config))

3. 记录业务指标

// 在业务逻辑中记录指标
func loginHandler(c *gin.Context) {
    // 业务逻辑
    success := performLogin(c)
    
    if success {
        monitoring.RecordUserLogin("success")
    } else {
        monitoring.RecordUserLogin("failed")
    }
}

func registerHandler(c *gin.Context) {
    // 业务逻辑
    performRegistration(c)
    
    monitoring.RecordUserRegistration()
}

Prometheus查询示例

1. API响应时间监控

# 95th百分位响应时间
histogram_quantile(0.95, rate(gin_request_duration_seconds_bucket[5m]))

# 按路由分组的平均响应时间
rate(gin_request_duration_seconds_sum[5m]) / rate(gin_request_duration_seconds_count[5m])

# 慢请求率（>1秒）
rate(gin_request_duration_seconds_bucket{le="1"}[5m])

2. 错误率监控

# 总错误率
rate(gin_requests_total{status_code=~"5.."}[5m]) / rate(gin_requests_total[5m])

# 按端点分组的错误率
rate(gin_requests_total{status_code=~"5..", route="/api/v1/users"}[5m])

3. SLA监控

# SLA违规率
rate(sla_violations_total[5m])

# 按端点分组的SLA违规
rate(sla_violations_total{endpoint="/api/v1/users"}[5m])

4. 系统资源监控

# 内存使用率
memory_usage_bytes / memory_total_bytes * 100

# Goroutine数量趋势
goroutines_total

# 活跃请求数
gin_active_requests

Grafana集成

1. 导入仪表板

使用提供的 monitoring/grafana/dashboard.json 文件导入预配置的仪表板。

2. 关键面板

HTTP请求率 - 实时请求量
响应时间分布 - 95th/50th百分位
错误率趋势 - 4xx/5xx错误
SLA违规监控 - 超时请求追踪
系统资源使用 - CPU/内存/Goroutine
业务指标 - 用户注册/登录/活跃用户

告警规则

1. 高响应时间告警

- alert: HighResponseTime
  expr: histogram_quantile(0.95, rate(gin_request_duration_seconds_bucket[5m])) > 1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "High response time detected"

2. 高错误率告警

- alert: HighErrorRate
  expr: rate(gin_requests_total{status_code=~"5.."}[5m]) / rate(gin_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"

3. SLA违规告警

- alert: SLAViolation
  expr: rate(sla_violations_total[5m]) > 0.01
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "SLA violation detected"

最佳实践

1. 标签基数控制

使用 PathGrouping 将相似路径归组
避免使用高基数标签（如用户ID）
限制标签值的数量

2. SLA设置

根据业务需求设置合理的SLA阈值
区分不同类型API的SLA要求
定期评估和调整SLA阈值

3. 监控覆盖

监控所有关键业务流程
包含基础设施和应用层指标
设置合适的告警阈值

4. 性能优化

定期清理过期指标
使用合适的抓取间隔
监控Prometheus自身性能

故障排查

1. 指标缺失

检查中间件是否正确注册
确认路由配置正确
验证Prometheus抓取配置

2. 高基数问题

检查路径参数是否正确归组
限制动态标签值
使用聚合规则减少存储

3. 性能问题

监控指标收集开销
优化标签使用
考虑采样策略

扩展功能

1. 分布式追踪

可以集成Jaeger或OpenTelemetry进行分布式追踪。

2. 日志聚合

可以集成Loki进行日志聚合和关联。

3. 自定义指标

可以根据业务需求添加自定义指标收集。

Documentation ¶

Index ¶

Variables
func GetAPIMetrics() map[string]interface{}
func GetCacheHitRatio(cacheType string) float64
func GinMetricsMiddleware(config *MetricsConfig) gin.HandlerFunc
func HTTPMetricsMiddleware() func(next http.Handler) http.Handler
func RecordAPICall(service, operation, status string)
func RecordAPICallWithContext(endpoint, method, status, version string, duration time.Duration)
func RecordAuthenticationAttempt(status, method string)
func RecordCacheHit(cacheType, key string)
func RecordCacheMiss(cacheType, key string)
func RecordDBMetrics(operation string, duration time.Duration, err error)
func RecordError(errorType, component, severity string)
func RecordExternalServiceCall(service, endpoint, status string, duration time.Duration)
func RecordMessageQueueMetrics(queue, operation string, duration time.Duration, err error)
func RecordRateLimitHit(key, endpoint string)
func RecordRedisMetrics(command string, duration time.Duration, err error)
func RecordSLAViolation(endpoint string, threshold float64)
func RecordUserLogin(status string)
func RecordUserRegistration()
func SetActiveUsers(count float64)
func SetAppInfo(version, instanceID string)
func SetDBConnections(database string, active, idle int)
func SetRedisConnections(active int)
func StartServer(cfg *config.MonitoringConfig, logger *slog.Logger)
type DatabaseMetricsCollector
- func NewDatabaseMetricsCollector(databases map[string]interface{}) *DatabaseMetricsCollector
- func (c *DatabaseMetricsCollector) CollectDatabaseMetrics()
type HealthCheckCollector
- func NewHealthCheckCollector() *HealthCheckCollector
- func (c *HealthCheckCollector) CollectHealthMetrics()
- func (c *HealthCheckCollector) RegisterHealthCheck(name string, check func() error)
type MetricsConfig
- func DefaultMetricsConfig() *MetricsConfig

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// Gin specific metrics
	GinRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "gin_requests_total",
			Help: "Total number of HTTP requests processed by Gin",
		},
		[]string{"method", "route", "status_code"},
	)

	GinRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_request_duration_seconds",
			Help:    "Duration of HTTP requests processed by Gin",
			Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
		},
		[]string{"method", "route"},
	)

	GinRequestSize = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_request_size_bytes",
			Help:    "Size of HTTP requests processed by Gin",
			Buckets: prometheus.ExponentialBuckets(1024, 2, 10),
		},
		[]string{"method", "route"},
	)

	GinResponseSize = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gin_response_size_bytes",
			Help:    "Size of HTTP responses processed by Gin",
			Buckets: prometheus.ExponentialBuckets(1024, 2, 10),
		},
		[]string{"method", "route"},
	)

	GinActiveRequests = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "gin_active_requests",
			Help: "Number of active HTTP requests being processed by Gin",
		},
		[]string{"method", "route"},
	)

	// API endpoint specific metrics
	APIEndpointMetrics = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_endpoint_requests_total",
			Help: "Total number of requests per API endpoint",
		},
		[]string{"endpoint", "method", "status_code", "version"},
	)

	APIEndpointDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "api_endpoint_duration_seconds",
			Help:    "Duration of API endpoint requests",
			Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
		},
		[]string{"endpoint", "method", "version"},
	)

	// SLA metrics
	SLAViolations = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "sla_violations_total",
			Help: "Total number of SLA violations",
		},
		[]string{"endpoint", "threshold"},
	)
)

View Source

var (
	// HTTP metrics
	HTTPRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "http_requests_total",
			Help: "Total number of HTTP requests",
		},
		[]string{"method", "endpoint", "status"},
	)

	HTTPRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "http_request_duration_seconds",
			Help:    "Duration of HTTP requests",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"method", "endpoint"},
	)

	// Database metrics
	DBConnectionsActive = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "db_connections_active",
			Help: "Number of active database connections",
		},
		[]string{"database"},
	)

	DBConnectionsIdle = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "db_connections_idle",
			Help: "Number of idle database connections",
		},
		[]string{"database"},
	)

	DBQueryDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "db_query_duration_seconds",
			Help:    "Duration of database queries",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"operation"},
	)

	DBQueryTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "db_query_total",
			Help: "Total number of database queries",
		},
		[]string{"operation", "status"},
	)

	// Redis metrics
	RedisCommandsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "redis_commands_total",
			Help: "Total number of Redis commands",
		},
		[]string{"command", "status"},
	)

	RedisCommandsDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "redis_commands_duration_seconds",
			Help:    "Duration of Redis commands",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"command"},
	)

	RedisConnectionsActive = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "redis_connections_active",
			Help: "Number of active Redis connections",
		},
	)

	// Message queue metrics
	MessageQueuePublishedTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "message_queue_published_total",
			Help: "Total number of messages published to queue",
		},
		[]string{"queue", "status"},
	)

	MessageQueueConsumedTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "message_queue_consumed_total",
			Help: "Total number of messages consumed from queue",
		},
		[]string{"queue", "status"},
	)

	MessageQueueProcessingDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "message_queue_processing_duration_seconds",
			Help:    "Duration of message processing",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"queue"},
	)

	// Application metrics
	AppInfo = promauto.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "app_info",
			Help: "Application information",
		},
		[]string{"version", "instance_id"},
	)

	AppUptime = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "app_uptime_seconds",
			Help: "Application uptime in seconds",
		},
	)

	// Custom business metrics
	ActiveUsers = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "active_users",
			Help: "Number of active users",
		},
	)

	RequestsInFlight = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "requests_in_flight",
			Help: "Number of requests currently being processed",
		},
	)

	// System metrics
	CPUUsage = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "cpu_usage_percent",
			Help: "Current CPU usage percentage",
		},
	)

	MemoryUsage = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "memory_usage_bytes",
			Help: "Current memory usage in bytes",
		},
	)

	MemoryTotal = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "memory_total_bytes",
			Help: "Total available memory in bytes",
		},
	)

	Goroutines = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "goroutines_total",
			Help: "Number of goroutines currently running",
		},
	)

	GCDuration = promauto.NewHistogram(
		prometheus.HistogramOpts{
			Name:    "gc_duration_seconds",
			Help:    "Time spent in garbage collection",
			Buckets: prometheus.DefBuckets,
		},
	)

	// Middleware metrics
	AuthenticationAttempts = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "authentication_attempts_total",
			Help: "Total number of authentication attempts",
		},
		[]string{"status", "method"},
	)

	RateLimitHits = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "rate_limit_hits_total",
			Help: "Total number of rate limit hits",
		},
		[]string{"key", "endpoint"},
	)

	CacheHits = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "cache_hits_total",
			Help: "Total number of cache hits",
		},
		[]string{"cache_type", "key"},
	)

	CacheMisses = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "cache_misses_total",
			Help: "Total number of cache misses",
		},
		[]string{"cache_type", "key"},
	)

	// Business domain metrics
	UserRegistrations = promauto.NewCounter(
		prometheus.CounterOpts{
			Name: "user_registrations_total",
			Help: "Total number of user registrations",
		},
	)

	UserLogins = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "user_logins_total",
			Help: "Total number of user logins",
		},
		[]string{"status"},
	)

	APICallsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "api_calls_total",
			Help: "Total number of API calls",
		},
		[]string{"service", "operation", "status"},
	)

	ErrorsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "errors_total",
			Help: "Total number of errors",
		},
		[]string{"type", "component", "severity"},
	)

	// External service metrics
	ExternalServiceCalls = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "external_service_calls_total",
			Help: "Total number of external service calls",
		},
		[]string{"service", "endpoint", "status"},
	)

	ExternalServiceDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "external_service_duration_seconds",
			Help:    "Duration of external service calls",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"service", "endpoint"},
	)
)

Functions ¶

func GetAPIMetrics ¶

func GetAPIMetrics() map[string]interface{}

GetAPIMetrics returns current API metrics for health checks

func GetCacheHitRatio ¶

func GetCacheHitRatio(cacheType string) float64

GetCacheHitRatio calculates cache hit ratio

func GinMetricsMiddleware ¶

func GinMetricsMiddleware(config *MetricsConfig) gin.HandlerFunc

GinMetricsMiddleware returns a Gin middleware that collects metrics

func HTTPMetricsMiddleware ¶

func HTTPMetricsMiddleware() func(next http.Handler) http.Handler

HTTPMetricsMiddleware returns a middleware that collects HTTP metrics

func RecordAPICall ¶

func RecordAPICall(service, operation, status string)

RecordAPICall records API call metrics

func RecordAPICallWithContext ¶

func RecordAPICallWithContext(endpoint, method, status, version string, duration time.Duration)

RecordAPICallWithContext records an API call with additional context

func RecordAuthenticationAttempt ¶

func RecordAuthenticationAttempt(status, method string)

RecordAuthenticationAttempt records authentication attempt metrics

func RecordCacheHit ¶

func RecordCacheHit(cacheType, key string)

RecordCacheHit records cache hit metrics

func RecordCacheMiss ¶

func RecordCacheMiss(cacheType, key string)

RecordCacheMiss records cache miss metrics

func RecordDBMetrics ¶

func RecordDBMetrics(operation string, duration time.Duration, err error)

RecordDBMetrics records database metrics

func RecordError ¶

func RecordError(errorType, component, severity string)

RecordError records error metrics

func RecordExternalServiceCall ¶

func RecordExternalServiceCall(service, endpoint, status string, duration time.Duration)

RecordExternalServiceCall records external service call metrics

func RecordMessageQueueMetrics ¶

func RecordMessageQueueMetrics(queue, operation string, duration time.Duration, err error)

RecordMessageQueueMetrics records message queue metrics

func RecordRateLimitHit ¶

func RecordRateLimitHit(key, endpoint string)

RecordRateLimitHit records rate limit hit metrics

func RecordRedisMetrics ¶

func RecordRedisMetrics(command string, duration time.Duration, err error)

RecordRedisMetrics records Redis metrics

func RecordSLAViolation ¶

func RecordSLAViolation(endpoint string, threshold float64)

RecordSLAViolation records an SLA violation

func RecordUserLogin ¶

func RecordUserLogin(status string)

RecordUserLogin records user login metrics

func RecordUserRegistration ¶

func RecordUserRegistration()

RecordUserRegistration records user registration metrics

func SetActiveUsers ¶

func SetActiveUsers(count float64)

SetActiveUsers sets the number of active users

func SetAppInfo ¶

func SetAppInfo(version, instanceID string)

SetAppInfo sets application information

func SetDBConnections ¶

func SetDBConnections(database string, active, idle int)

SetDBConnections sets database connection metrics

func SetRedisConnections ¶

func SetRedisConnections(active int)

SetRedisConnections sets Redis connection metrics

func StartServer ¶

func StartServer(cfg *config.MonitoringConfig, logger *slog.Logger)

StartServer starts the Prometheus metrics server

Types ¶

type DatabaseMetricsCollector ¶

type DatabaseMetricsCollector struct {
	// contains filtered or unexported fields
}

DatabaseMetricsCollector collects database metrics from database instances

func NewDatabaseMetricsCollector ¶

func NewDatabaseMetricsCollector(databases map[string]interface{}) *DatabaseMetricsCollector

NewDatabaseMetricsCollector creates a new database metrics collector

func (*DatabaseMetricsCollector) CollectDatabaseMetrics ¶

func (c *DatabaseMetricsCollector) CollectDatabaseMetrics()

CollectDatabaseMetrics collects metrics from all registered databases

type HealthCheckCollector ¶

type HealthCheckCollector struct {
	// contains filtered or unexported fields
}

HealthCheckCollector provides health check metrics

func NewHealthCheckCollector ¶

func NewHealthCheckCollector() *HealthCheckCollector

NewHealthCheckCollector creates a new health check collector

func (*HealthCheckCollector) CollectHealthMetrics ¶

func (c *HealthCheckCollector) CollectHealthMetrics()

CollectHealthMetrics collects health check metrics

func (*HealthCheckCollector) RegisterHealthCheck ¶

func (c *HealthCheckCollector) RegisterHealthCheck(name string, check func() error)

RegisterHealthCheck registers a health check function

type MetricsConfig ¶

type MetricsConfig struct {
	// SLA thresholds for different endpoints (in seconds)
	SLAThresholds map[string]float64
	// Skip paths that should not be monitored
	SkipPaths []string
	// Group similar paths to reduce cardinality
	PathGrouping map[string]string
}

MetricsConfig holds configuration for metrics middleware

func DefaultMetricsConfig ¶

func DefaultMetricsConfig() *MetricsConfig

DefaultMetricsConfig returns default configuration

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL