evaluation

package

v0.0.0-...-140e820 Latest Latest Go to latest Published: Feb 4, 2026 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/scttfrdmn/agenkit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package evaluation provides A/B testing framework for comparing agent variants.

A/B Testing Framework provides statistical significance testing, effect size calculation, and automated experiment orchestration for comparing agent performance.

Example:

// Create A/B test
abTest := evaluation.NewABTest(
    "prompt_comparison",
    controlAgent,
    treatmentAgent,
    []string{"accuracy"},
    evaluation.SignificanceLevel005,
    evaluation.TestTypeTTest,
)

// Run experiment
results, _ := abTest.Run(testCases, 50, true)

// Check significance
if results["accuracy"].IsSignificant() {
    fmt.Printf("Winner: %s\n", results["accuracy"].Winner())
}

Package evaluation provides tools for evaluating and optimizing agent performance. Bayesian Optimization uses probabilistic models to efficiently find optimal hyperparameter configurations.

Package evaluation provides comprehensive evaluation capabilities for autonomous agents.

Designed for measuring agent quality and performance, with special focus on extreme-scale context evaluation (1M-25M+ tokens) for systems like endless.

Example:

evaluator := evaluation.NewEvaluator(agent, nil, "")
testCases := []map[string]interface{}{
    {"input": "What is 2+2?", "expected": "4"},
}
result, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", result.Accuracy)

Package evaluation provides detailed metrics tracking for agent evaluation.

This module extends core evaluation with enhanced metric tracking including: - Session status tracking (running, completed, failed, etc.) - Error collection and analysis - Metric type categorization - Cross-session aggregation

Key use case: "How do you know a long-running agent succeeded?"

Example:

result := evaluation.NewSessionResult("session-123", "my-agent")
result.AddMetricMeasurement(evaluation.NewMetricMeasurement(
    "accuracy",
    0.95,
    evaluation.MetricTypeSuccessRate,
))
result.SetStatus(evaluation.SessionStatusCompleted)

Package evaluation provides the Automated Optimization Framework.

This module provides intelligent optimization of agent configurations, prompts, and hyperparameters using Bayesian optimization, genetic algorithms, and other search strategies.

Interfaces:

Optimizer: Base interface for optimization algorithms

Implementations:

RandomSearchOptimizer: Baseline random search
BayesianOptimizer: Bayesian optimization (in bayesian_optimizer.go)

Example:

searchSpace := evaluation.NewSearchSpace()
searchSpace.AddContinuous("temperature", 0.0, 1.0)
searchSpace.AddContinuous("top_p", 0.0, 1.0)

optimizer := evaluation.NewRandomSearchOptimizer(
    objectiveFunc,
    searchSpace,
    true, // maximize
)

result, err := optimizer.Optimize(ctx, 50)
fmt.Printf("Best config: %v\n", result.BestConfig)
fmt.Printf("Best score: %.3f\n", result.BestScore)

Package evaluation provides the Prompt Optimization Framework.

Automatically improve prompts through systematic variation and testing using grid search, random search, or genetic algorithms.

Example:

template := `You are a {role}.
{instructions}`

variations := map[string][]string{
    "role":         {"helpful assistant", "expert advisor"},
    "instructions": {"Be concise.", "Be detailed."},
}

optimizer := evaluation.NewPromptOptimizer(
    template,
    variations,
    func(prompt string) agenkit.Agent { return MyAgent{SystemPrompt: prompt} },
    []string{"accuracy"},
    nil,
)

result, _ := optimizer.Optimize(ctx, testCases, "grid", nil)

Example (AccuracyMetric) ¶

Example_accuracyMetric demonstrates accuracy measurement

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewAccuracyMetric(nil, false)
	agent := &ExampleAgent{name: "qa-agent"}

	input := &agenkit.Message{
		Role:    "user",
		Content: "What is the capital of France?",
	}
	output := &agenkit.Message{
		Role:    "agent",
		Content: "The capital of France is Paris.",
	}

	ctx := map[string]interface{}{
		"expected": "Paris",
	}

	score, err := metric.Measure(agent, input, output, ctx)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Accuracy score: %.0f\n", score)

}

Output:

Accuracy score: 1

Example (BenchmarkSuite) ¶

Example_benchmarkSuite demonstrates running benchmark suites

package main

import (
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create standard benchmark suite
	suite := evaluation.BenchmarkSuiteStandard()

	// Generate all test cases
	testCases, err := suite.GenerateAllTestCases()
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Generated %d test cases from standard suite\n", len(testCases))

}

Output:

Generated 19 test cases from standard suite

Example (CompressionMetrics) ¶

Example_compressionMetrics demonstrates compression evaluation

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create metrics for small test lengths
	testLengths := []int{1000, 5000, 10000}
	metric := evaluation.NewCompressionMetrics(testLengths, 3)

	fmt.Printf("Metric name: %s\n", metric.Name())
	fmt.Printf("Testing %d scale points\n", len(testLengths))

}

Output:

Metric name: compression_quality
Testing 3 scale points

Example (ContextMetrics) ¶

Example_contextMetrics demonstrates context tracking

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewContextMetrics()

	agent := &ExampleAgent{name: "test-agent"}
	input := &agenkit.Message{
		Role:    "user",
		Content: "Test query",
		Metadata: map[string]interface{}{
			"context_length": 1000.0,
		},
	}
	output := &agenkit.Message{
		Role:    "agent",
		Content: "Test response",
	}

	// Measure context length
	length, err := metric.Measure(agent, input, output, nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Context length: %.0f tokens\n", length)

	// Simulate growing context over time
	measurements := []float64{1000, 1200, 1400, 1600, 1800}
	aggregated := metric.Aggregate(measurements)

	fmt.Printf("Mean context: %.0f tokens\n", aggregated["mean"])
	fmt.Printf("Growth rate: %.0f tokens/interaction\n", aggregated["growth_rate"])

}

Output:

Context length: 1000 tokens
Mean context: 1400 tokens
Growth rate: 160 tokens/interaction

Example (Evaluator) ¶

ExampleEvaluator demonstrates basic evaluation

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	agent := &ExampleAgent{name: "qa-agent"}

	// Create evaluator with metrics
	metrics := []evaluation.Metric{
		evaluation.NewAccuracyMetric(nil, false),
		evaluation.NewQualityMetrics(false, "", nil),
	}

	evaluator := evaluation.NewEvaluator(agent, metrics, "session-123")

	// Define test cases
	testCases := []map[string]interface{}{
		{
			"input":    "What is the capital of France?",
			"expected": "Paris",
		},
		{
			"input":    "What is 2+2?",
			"expected": "4",
		},
	}

	// Run evaluation
	result, err := evaluator.Evaluate(testCases, "")
	if err != nil {
		log.Fatal(err)
	}

	// Print results
	fmt.Printf("Total Tests: %d\n", result.TotalTests)
	fmt.Printf("Passed: %d\n", result.PassedTests)
	if result.Accuracy != nil {
		fmt.Printf("Accuracy: %.2f\n", *result.Accuracy)
	}

}

Output:

Total Tests: 2
Passed: 2
Accuracy: 1.00

Example (EvaluatorContinuous) ¶

Example_evaluatorContinuous demonstrates continuous evaluation

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	agent := &ExampleAgent{name: "qa-agent"}
	metrics := []evaluation.Metric{
		evaluation.NewAccuracyMetric(nil, false),
	}

	evaluator := evaluation.NewEvaluator(agent, metrics, "continuous-session")

	// Set up regression detection
	detector := evaluation.NewRegressionDetector(nil, nil)

	// Baseline evaluation
	baselineTests := []map[string]interface{}{
		{
			"input":    "What is the capital of France?",
			"expected": "Paris",
		},
	}

	baselineResult, _ := evaluator.Evaluate(baselineTests, "")
	detector.SetBaseline(baselineResult)
	fmt.Println("Baseline established")

	// Current evaluation
	currentTests := baselineTests // Same tests
	currentResult, _ := evaluator.Evaluate(currentTests, "")

	// Detect regressions
	regressions := detector.Detect(currentResult, true)
	if len(regressions) > 0 {
		fmt.Printf("Detected %d regressions\n", len(regressions))
	} else {
		fmt.Println("No regressions detected")
	}

}

Output:

Baseline established
No regressions detected

Example (LatencyMetric) ¶

Example_latencyMetric demonstrates latency tracking

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	metric := evaluation.NewLatencyMetric()

	// Simulate latency measurements
	measurements := []float64{
		95, 102, 98, 105, 110, 97, 103, 120, 99, 101,
		104, 108, 96, 100, 115, 102, 98, 105, 130, 99,
	}

	aggregated := metric.Aggregate(measurements)

	fmt.Printf("Latency Statistics:\n")
	fmt.Printf("Mean: %.0fms\n", aggregated["mean"])
	fmt.Printf("p50:  %.0fms\n", aggregated["p50"])
	fmt.Printf("p95:  %.0fms\n", aggregated["p95"])
	fmt.Printf("p99:  %.0fms\n", aggregated["p99"])

}

Output:

Latency Statistics:
Mean: 104ms
p50:  102ms
p95:  130ms
p99:  130ms

Example (PrecisionRecallMetric) ¶

Example_precisionRecallMetric demonstrates classification evaluation

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewPrecisionRecallMetric()
	agent := &ExampleAgent{name: "classifier"}

	input := &agenkit.Message{Role: "user", Content: "Classify"}
	output := &agenkit.Message{Role: "agent", Content: "Positive"}

	// Simulate classification results
	testCases := []struct {
		trueLabel      bool
		predictedLabel bool
	}{
		{true, true},   // TP
		{true, true},   // TP
		{false, true},  // FP
		{true, false},  // FN
		{false, false}, // TN
	}

	measurements := make([]float64, 0)
	for _, tc := range testCases {
		ctx := map[string]interface{}{
			"true_label":      tc.trueLabel,
			"predicted_label": tc.predictedLabel,
		}
		score, _ := metric.Measure(agent, input, output, ctx)
		measurements = append(measurements, score)
	}

	results := metric.Aggregate(measurements)

	fmt.Printf("Precision: %.2f\n", results["precision"])
	fmt.Printf("Recall: %.2f\n", results["recall"])
	fmt.Printf("F1 Score: %.2f\n", results["f1_score"])

}

Output:

Precision: 0.67
Recall: 0.67
F1 Score: 0.67

Example (QualityMetrics) ¶

Example_qualityMetrics demonstrates quality scoring

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewQualityMetrics(false, "", nil)
	agent := &ExampleAgent{name: "qa-agent"}

	input := &agenkit.Message{
		Role:    "user",
		Content: "Explain machine learning",
	}

	// Good response
	output1 := &agenkit.Message{
		Role: "agent",
		Content: "Machine learning is a branch of artificial intelligence that " +
			"enables systems to learn from data and improve their performance " +
			"over time without being explicitly programmed.",
	}

	score1, _ := metric.Measure(agent, input, output1, nil)
	fmt.Printf("Good response quality: %.2f\n", score1)

	// Poor response
	output2 := &agenkit.Message{
		Role:    "agent",
		Content: "Yes.",
	}

	score2, _ := metric.Measure(agent, input, output2, nil)
	fmt.Printf("Poor response quality: %.2f\n", score2)

}

Output:

Good response quality: 0.80
Poor response quality: 0.21

Example (RegressionDetector) ¶

Example_regressionDetector demonstrates regression detection

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create baseline result
	accuracy := 0.95
	quality := 0.90
	latency := 100.0
	baseline := &evaluation.EvaluationResult{
		Accuracy:     &accuracy,
		QualityScore: &quality,
		AvgLatencyMs: &latency,
	}

	// Create detector with baseline
	detector := evaluation.NewRegressionDetector(nil, baseline)

	// Simulate degraded performance
	newAccuracy := 0.80
	newQuality := 0.85
	newLatency := 150.0
	current := &evaluation.EvaluationResult{
		Accuracy:     &newAccuracy,
		QualityScore: &newQuality,
		AvgLatencyMs: &newLatency,
	}

	// Detect regressions
	regressions := detector.Detect(current, true)

	fmt.Printf("Found %d regressions\n", len(regressions))

}

Output:

Found 2 regressions

Example (SessionRecorder) ¶

Example_sessionRecorder demonstrates session recording

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	storage := evaluation.NewInMemoryRecordingStorage()
	recorder := evaluation.NewSessionRecorder(storage)

	// Start session
	recorder.StartSession("session-123", "qa-agent", nil)

	// Wrap agent to automatically record
	agent := &ExampleAgent{name: "qa-agent"}
	wrappedAgent := recorder.Wrap(agent)

	// Process message (automatically recorded)
	input := &agenkit.Message{
		Role:    "user",
		Content: "What is AI?",
		Metadata: map[string]interface{}{
			"session_id": "session-123",
		},
	}

	_, _ = wrappedAgent.Process(context.Background(), input)

	// Finalize session
	recording, err := recorder.FinalizeSession("session-123")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Session %s recorded\n", recording.SessionID)
	fmt.Printf("Interactions: %d\n", recording.InteractionCount())

}

Output:

Session session-123 recorded
Interactions: 1

Index ¶

func CalculateSampleSize(baselineMean, minimumDetectableEffect, alpha, power float64, stdDev *float64) int
type ABResult
- func (r *ABResult) ImprovementPercent() float64
- func (r *ABResult) IsSignificant() bool
- func (r *ABResult) ToMap() map[string]interface{}
- func (r *ABResult) Winner() string
type ABTest
- func NewABTest(name string, controlAgent agenkit.Agent, treatmentAgent agenkit.Agent, ...) *ABTest
- func (t *ABTest) GetSummary() map[string]interface{}
- func (t *ABTest) Run(testCases []map[string]interface{}, sampleSize int, shuffle bool) (map[string]*ABResult, error)
type ABVariant
- func NewABVariant(name string, agent agenkit.Agent) *ABVariant
- func (v *ABVariant) AddSample(value float64)
- func (v *ABVariant) Mean() float64
- func (v *ABVariant) SampleSize() int
- func (v *ABVariant) StdDev() float64
type AccuracyMetric
- func NewAccuracyMetric(validator ValidatorFunc, caseSensitive bool) *AccuracyMetric
- func (m *AccuracyMetric) Aggregate(measurements []float64) map[string]float64
- func (m *AccuracyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *AccuracyMetric) Name() string
type AcquisitionFunction
type AgentFactory
type BayesianOptimizer
- func NewBayesianOptimizer(config BayesianOptimizerConfig) (*BayesianOptimizer, error)
- func (b *BayesianOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
type BayesianOptimizerConfig
type Benchmark
type BenchmarkSuite
- func BenchmarkSuiteExtremeScale() *BenchmarkSuite
- func BenchmarkSuiteQuick() *BenchmarkSuite
- func BenchmarkSuiteStandard() *BenchmarkSuite
- func NewBenchmarkSuite(benchmarks []Benchmark, name string) *BenchmarkSuite
- func (s *BenchmarkSuite) AddBenchmark(benchmark Benchmark)
- func (s *BenchmarkSuite) GenerateAllTestCases() ([]map[string]interface{}, error)
- func (s *BenchmarkSuite) GetBenchmark(name string) Benchmark
- func (s *BenchmarkSuite) RemoveBenchmark(name string)
- func (s *BenchmarkSuite) ToDict() map[string]interface{}
type CompressionMetrics
- func NewCompressionMetrics(testLengths []int, needleCount int) *CompressionMetrics
- func (m *CompressionMetrics) Aggregate(measurements []float64) map[string]float64
- func (m *CompressionMetrics) EvaluateAtLengths(agent agenkit.Agent, sessionID string, needleContent []string) (map[int]*CompressionStats, error)
- func (m *CompressionMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *CompressionMetrics) Name() string
type CompressionStats
- func (c *CompressionStats) ToDict() map[string]interface{}
type ContextMetrics
- func NewContextMetrics() *ContextMetrics
- func (m *ContextMetrics) Aggregate(measurements []float64) map[string]float64
- func (m *ContextMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *ContextMetrics) Name() string
type ErrorRecord
- func NewErrorRecord(errorType string, message string, details map[string]interface{}) *ErrorRecord
type EvaluationResult
- func (r *EvaluationResult) SuccessRate() float64
- func (r *EvaluationResult) ToDict() map[string]interface{}
type Evaluator
- func NewEvaluator(agent agenkit.Agent, metrics []Metric, sessionID string) *Evaluator
- func (e *Evaluator) Evaluate(testCases []map[string]interface{}, evaluationID string) (*EvaluationResult, error)
- func (e *Evaluator) EvaluateSingle(inputMessage *agenkit.Message, expectedOutput interface{}) (map[string]float64, error)
type ExtremeScaleBenchmark
- func NewExtremeScaleBenchmark(testLengths []int, needlesPerLength int) *ExtremeScaleBenchmark
- func (b *ExtremeScaleBenchmark) Description() string
- func (b *ExtremeScaleBenchmark) GenerateTestCases() ([]*TestCase, error)
- func (b *ExtremeScaleBenchmark) Name() string
type FileRecordingStorage
- func NewFileRecordingStorage(recordingsDir string) *FileRecordingStorage
- func (s *FileRecordingStorage) DeleteRecording(sessionID string) error
- func (s *FileRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (s *FileRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
- func (s *FileRecordingStorage) SaveRecording(recording *SessionRecording) error
type InMemoryRecordingStorage
- func NewInMemoryRecordingStorage() *InMemoryRecordingStorage
- func (s *InMemoryRecordingStorage) DeleteRecording(sessionID string) error
- func (s *InMemoryRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (s *InMemoryRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
- func (s *InMemoryRecordingStorage) SaveRecording(recording *SessionRecording) error
type InformationRetentionBenchmark
- func NewInformationRetentionBenchmark(conversationLength int, recallPoints []int) *InformationRetentionBenchmark
- func (b *InformationRetentionBenchmark) Description() string
- func (b *InformationRetentionBenchmark) GenerateTestCases() ([]*TestCase, error)
- func (b *InformationRetentionBenchmark) Name() string
type InteractionRecord
- func InteractionRecordFromDict(data map[string]interface{}) (*InteractionRecord, error)
- func (r *InteractionRecord) ToDict() map[string]interface{}
type LatencyMetric
- func NewLatencyMetric() *LatencyMetric
- func (m *LatencyMetric) Aggregate(measurements []float64) map[string]float64
- func (m *LatencyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *LatencyMetric) Name() string
type Metric
type MetricMeasurement
- func CreateCostMetric(cost float64, currency string, metadata map[string]interface{}) *MetricMeasurement
- func CreateDurationMetric(durationSeconds float64, metadata map[string]interface{}) *MetricMeasurement
- func CreateQualityMetric(name string, score, maxScore float64, metadata map[string]interface{}) *MetricMeasurement
- func NewMetricMeasurement(name string, value float64, metricType MetricType) *MetricMeasurement
type MetricType
type MetricsCollector
- func NewMetricsCollector() *MetricsCollector
- func (mc *MetricsCollector) AddResult(result *SessionResult)
- func (mc *MetricsCollector) Clear()
- func (mc *MetricsCollector) GetMetricAggregates(metricName string) map[string]interface{}
- func (mc *MetricsCollector) GetResults() []SessionResult
- func (mc *MetricsCollector) GetStatistics() map[string]interface{}
type NeedleInHaystackBenchmark
- func NewNeedleInHaystackBenchmark(contextLength, needleCount, haystackMultiplier int) *NeedleInHaystackBenchmark
- func (b *NeedleInHaystackBenchmark) Description() string
- func (b *NeedleInHaystackBenchmark) GenerateTestCases() ([]*TestCase, error)
- func (b *NeedleInHaystackBenchmark) Name() string
type ObjectiveFunc
type OptimizationResult
- func (r *OptimizationResult) Duration() time.Duration
- func (r *OptimizationResult) GetImprovement() float64
type OptimizationStep
type OptimizationStrategy
type Optimizer
type ParameterSpec
type ParameterType
type PrecisionRecallMetric
- func NewPrecisionRecallMetric() *PrecisionRecallMetric
- func (m *PrecisionRecallMetric) Aggregate(measurements []float64) map[string]float64
- func (m *PrecisionRecallMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *PrecisionRecallMetric) Name() string
- func (m *PrecisionRecallMetric) Reset()
type PrecisionRecallStats
- func (s *PrecisionRecallStats) F1Score() float64
- func (s *PrecisionRecallStats) Precision() float64
- func (s *PrecisionRecallStats) Recall() float64
- func (s *PrecisionRecallStats) ToDict() map[string]float64
type PromptEvaluation
type PromptOptimizationResult
- func (r *PromptOptimizationResult) DurationSeconds() float64
- func (r *PromptOptimizationResult) ToDict() map[string]interface{}
type PromptOptimizer
- func NewPromptOptimizer(template string, variations map[string][]string, agentFactory AgentFactory, ...) *PromptOptimizer
- func (p *PromptOptimizer) Optimize(ctx context.Context, testCases []map[string]interface{}, strategy string, ...) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeGenetic(ctx context.Context, testCases []map[string]interface{}, populationSize int, ...) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeGrid(ctx context.Context, testCases []map[string]interface{}) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeRandom(ctx context.Context, testCases []map[string]interface{}, nSamples int) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) SetMaximize(maximize bool)
type QualityMetrics
- func NewQualityMetrics(useLLMJudge bool, judgeModel string, weights map[string]float64) *QualityMetrics
- func (m *QualityMetrics) Aggregate(measurements []float64) map[string]float64
- func (m *QualityMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *QualityMetrics) Name() string
type RandomSearchOptimizer
- func NewRandomSearchOptimizer(objective ObjectiveFunc, searchSpace *SearchSpace, maximize bool) *RandomSearchOptimizer
- func (r *RandomSearchOptimizer) GetHistory() []OptimizationStep
- func (r *RandomSearchOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
type RecordingStorage
type Regression
- func (r *Regression) IsRegression() bool
- func (r *Regression) ToDict() map[string]interface{}
type RegressionDetector
- func NewRegressionDetector(thresholds map[string]float64, baseline *EvaluationResult) *RegressionDetector
- func (d *RegressionDetector) ClearHistory()
- func (d *RegressionDetector) CompareResults(resultA, resultB *EvaluationResult) map[string]map[string]float64
- func (d *RegressionDetector) Detect(result *EvaluationResult, storeHistory bool) []*Regression
- func (d *RegressionDetector) GetSummary() map[string]interface{}
- func (d *RegressionDetector) GetTrend(metricName string, window int) map[string]interface{}
- func (d *RegressionDetector) SetBaseline(result *EvaluationResult)
type SearchSpace
- func NewSearchSpace() *SearchSpace
- func (s *SearchSpace) AddCategorical(name string, values []string)
- func (s *SearchSpace) AddContinuous(name string, low, high float64)
- func (s *SearchSpace) AddDiscrete(name string, values []interface{})
- func (s *SearchSpace) AddInteger(name string, low, high int)
- func (s *SearchSpace) Sample() map[string]interface{}
type SessionRecorder
- func NewSessionRecorder(storage RecordingStorage) *SessionRecorder
- func (r *SessionRecorder) DeleteRecording(sessionID string) error
- func (r *SessionRecorder) FinalizeSession(sessionID string) (*SessionRecording, error)
- func (r *SessionRecorder) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (r *SessionRecorder) LoadRecording(sessionID string) (*SessionRecording, error)
- func (r *SessionRecorder) RecordInteraction(sessionID string, inputMessage, outputMessage *agenkit.Message, ...)
- func (r *SessionRecorder) StartSession(sessionID, agentName string, metadata map[string]interface{})
- func (r *SessionRecorder) Wrap(agent agenkit.Agent) agenkit.Agent
type SessionRecording
- func SessionRecordingFromDict(data map[string]interface{}) (*SessionRecording, error)
- func (r *SessionRecording) DurationSeconds() float64
- func (r *SessionRecording) InteractionCount() int
- func (r *SessionRecording) ToDict() map[string]interface{}
- func (r *SessionRecording) TotalLatencyMs() float64
type SessionReplay
- func NewSessionReplay() *SessionReplay
- func (r *SessionReplay) Compare(resultsA, resultsB map[string]interface{}) map[string]interface{}
- func (r *SessionReplay) Replay(recording *SessionRecording, agent agenkit.Agent, sessionID string) (map[string]interface{}, error)
type SessionResult
- func FromJSON(jsonStr string) (*SessionResult, error)
- func NewSessionResult(sessionID string, agentName string) *SessionResult
- func (sr *SessionResult) AddError(errorType string, message string, details map[string]interface{})
- func (sr *SessionResult) AddMetricMeasurement(measurement *MetricMeasurement)
- func (sr *SessionResult) DurationSeconds() *float64
- func (sr *SessionResult) GetMetric(name string) *MetricMeasurement
- func (sr *SessionResult) GetMetricsByType(metricType MetricType) []MetricMeasurement
- func (sr *SessionResult) SetStatus(status SessionStatus)
- func (sr *SessionResult) ToJSON() (string, error)
type SessionStatus
type Severity
type SignificanceLevel
type SimpleQABenchmark
- func NewSimpleQABenchmark() *SimpleQABenchmark
- func (b *SimpleQABenchmark) Description() string
- func (b *SimpleQABenchmark) GenerateTestCases() ([]*TestCase, error)
- func (b *SimpleQABenchmark) Name() string
type StatisticalTestType
type TestCase
- func (t *TestCase) ToDict() map[string]interface{}
type ValidatorFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CalculateSampleSize ¶

func CalculateSampleSize(baselineMean, minimumDetectableEffect, alpha, power float64, stdDev *float64) int

CalculateSampleSize calculates required sample size for A/B test.

Types ¶

type ABResult ¶

type ABResult struct {
	ExperimentName     string
	ControlVariant     *ABVariant
	TreatmentVariant   *ABVariant
	MetricName         string
	PValue             float64
	TestType           StatisticalTestType
	SignificanceLevel  SignificanceLevel
	EffectSize         float64
	ConfidenceInterval [2]float64
	Timestamp          time.Time
}

ABResult contains results of an A/B test with statistical analysis.

func (*ABResult) ImprovementPercent ¶

func (r *ABResult) ImprovementPercent() float64

ImprovementPercent returns percent improvement of treatment over control.

func (*ABResult) IsSignificant ¶

func (r *ABResult) IsSignificant() bool

IsSignificant checks if result is statistically significant.

func (*ABResult) ToMap ¶

func (r *ABResult) ToMap() map[string]interface{}

ToMap converts result to map for serialization.

func (*ABResult) Winner ¶

func (r *ABResult) Winner() string

Winner returns the winner variant name if significant, otherwise empty string.

type ABTest ¶

type ABTest struct {
	Name              string
	Control           *ABVariant
	Treatment         *ABVariant
	Metrics           []string
	SignificanceLevel SignificanceLevel
	TestType          StatisticalTestType
	Results           map[string]*ABResult
}

ABTest orchestrates A/B experiments comparing agent variants.

func NewABTest ¶

func NewABTest(
	name string,
	controlAgent agenkit.Agent,
	treatmentAgent agenkit.Agent,
	metrics []string,
	significanceLevel SignificanceLevel,
	testType StatisticalTestType,
) *ABTest

NewABTest creates a new A/B test.

func (*ABTest) GetSummary ¶

func (t *ABTest) GetSummary() map[string]interface{}

GetSummary returns experiment summary.

func (*ABTest) Run ¶

func (t *ABTest) Run(testCases []map[string]interface{}, sampleSize int, shuffle bool) (map[string]*ABResult, error)

Run executes the A/B test experiment.

type ABVariant ¶

type ABVariant struct {
	Name     string
	Agent    agenkit.Agent
	Samples  []float64
	Metadata map[string]interface{}
}

ABVariant represents a variant in an A/B test.

func NewABVariant ¶

func NewABVariant(name string, agent agenkit.Agent) *ABVariant

NewABVariant creates a new A/B variant.

func (*ABVariant) AddSample ¶

func (v *ABVariant) AddSample(value float64)

AddSample adds a measurement sample.

func (*ABVariant) Mean ¶

func (v *ABVariant) Mean() float64

Mean returns the mean of samples.

func (*ABVariant) SampleSize ¶

func (v *ABVariant) SampleSize() int

SampleSize returns the number of samples.

func (*ABVariant) StdDev ¶

func (v *ABVariant) StdDev() float64

StdDev returns the standard deviation of samples.

type AccuracyMetric ¶

type AccuracyMetric struct {
	// contains filtered or unexported fields
}

AccuracyMetric measures task accuracy.

Compares agent output to expected output to determine correctness. Supports multiple validation methods:

Exact string matching
Substring matching (case-insensitive)
Custom validator functions

Example:

metric := NewAccuracyMetric(nil, false)
score, _ := metric.Measure(
    agent,
    inputMsg,
    outputMsg,
    map[string]interface{}{"expected": "Paris"},
)
fmt.Printf("Accuracy: %.0f\n", score)  // 0.0 or 1.0

func NewAccuracyMetric ¶

func NewAccuracyMetric(validator ValidatorFunc, caseSensitive bool) *AccuracyMetric

NewAccuracyMetric creates a new accuracy metric.

Args:

validator: Custom validation function(expected, actual) -> bool
caseSensitive: Whether string matching is case-sensitive

Example:

metric := NewAccuracyMetric(nil, false)

func (*AccuracyMetric) Aggregate ¶

func (m *AccuracyMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates accuracy measurements.

Args:

measurements: List of 0.0/1.0 values

Returns:

Accuracy statistics: accuracy, total, correct, incorrect

func (*AccuracyMetric) Measure ¶

func (m *AccuracyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures accuracy for single interaction.

Args:

agent: Agent being evaluated
inputMessage: Input to agent
outputMessage: Agent's response
ctx: Must contain "expected" key with expected output

Returns:

1.0 if correct, 0.0 if incorrect

func (*AccuracyMetric) Name ¶

func (m *AccuracyMetric) Name() string

Name returns the metric name.

type AcquisitionFunction ¶

type AcquisitionFunction string

AcquisitionFunction specifies the acquisition function type for Bayesian optimization.

const (
	// AcquisitionEI represents Expected Improvement
	AcquisitionEI AcquisitionFunction = "ei"
	// AcquisitionUCB represents Upper Confidence Bound
	AcquisitionUCB AcquisitionFunction = "ucb"
	// AcquisitionPI represents Probability of Improvement
	AcquisitionPI AcquisitionFunction = "pi"
)

type AgentFactory ¶

type AgentFactory func(prompt string) agenkit.Agent

AgentFactory is a function that creates an agent from a prompt string.

type BayesianOptimizer ¶

type BayesianOptimizer struct {
	// contains filtered or unexported fields
}

BayesianOptimizer implements Bayesian optimization for hyperparameter tuning.

This implementation uses a simplified surrogate model based on local statistics rather than full Gaussian Process regression. It balances exploration and exploitation through acquisition functions.

Algorithm:

Sample n_initial random configurations
Evaluate and build local statistics
Use acquisition function to select next config
Evaluate new config
Update statistics and repeat

func NewBayesianOptimizer ¶

func NewBayesianOptimizer(config BayesianOptimizerConfig) (*BayesianOptimizer, error)

NewBayesianOptimizer creates a new Bayesian optimizer.

func (*BayesianOptimizer) Optimize ¶

func (b *BayesianOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)

Optimize runs the Bayesian optimization process.

type BayesianOptimizerConfig ¶

type BayesianOptimizerConfig struct {
	SearchSpace *SearchSpace
	Objective   ObjectiveFunc
	Maximize    bool
	Acquisition AcquisitionFunction
	NInitial    int
	Xi          float64 // Exploration parameter for EI and PI (default: 0.01)
	Kappa       float64 // Exploration parameter for UCB (default: 2.576)
}

BayesianOptimizerConfig contains configuration for BayesianOptimizer.

type Benchmark ¶

type Benchmark interface {
	// Name returns the benchmark name.
	Name() string

	// Description returns the benchmark description.
	Description() string

	// GenerateTestCases generates test cases for this benchmark.
	//
	// Returns:
	//   List of test cases
	GenerateTestCases() ([]*TestCase, error)
}

Benchmark is the interface for benchmarks.

Benchmarks define test suites for evaluating specific capabilities.

type BenchmarkSuite ¶

type BenchmarkSuite struct {
	// contains filtered or unexported fields
}

BenchmarkSuite is a collection of benchmarks for comprehensive evaluation.

Provides standard and extreme-scale benchmark suites.

func BenchmarkSuiteExtremeScale ¶

func BenchmarkSuiteExtremeScale() *BenchmarkSuite

BenchmarkSuiteExtremeScale creates an extreme-scale benchmark suite for endless.

Tests at 1M-25M+ tokens with compression and retrieval.

func BenchmarkSuiteQuick ¶

func BenchmarkSuiteQuick() *BenchmarkSuite

BenchmarkSuiteQuick creates a quick benchmark suite for fast iteration.

Small test set for rapid feedback during development.

func BenchmarkSuiteStandard ¶

func BenchmarkSuiteStandard() *BenchmarkSuite

BenchmarkSuiteStandard creates a standard benchmark suite.

Includes basic Q&A and small-scale retrieval tests.

func NewBenchmarkSuite ¶

func NewBenchmarkSuite(benchmarks []Benchmark, name string) *BenchmarkSuite

NewBenchmarkSuite creates a new benchmark suite.

Args:

benchmarks: List of benchmarks to include
name: Suite name

Example:

suite := NewBenchmarkSuite([]Benchmark{NewSimpleQABenchmark()}, "custom")

func (*BenchmarkSuite) AddBenchmark ¶

func (s *BenchmarkSuite) AddBenchmark(benchmark Benchmark)

AddBenchmark adds benchmark to suite.

func (*BenchmarkSuite) GenerateAllTestCases ¶

func (s *BenchmarkSuite) GenerateAllTestCases() ([]map[string]interface{}, error)

GenerateAllTestCases generates all test cases from all benchmarks.

Returns:

Combined list of test cases from all benchmarks

func (*BenchmarkSuite) GetBenchmark ¶

func (s *BenchmarkSuite) GetBenchmark(name string) Benchmark

GetBenchmark gets benchmark by name.

func (*BenchmarkSuite) RemoveBenchmark ¶

func (s *BenchmarkSuite) RemoveBenchmark(name string)

RemoveBenchmark removes benchmark from suite.

func (*BenchmarkSuite) ToDict ¶

func (s *BenchmarkSuite) ToDict() map[string]interface{}

ToDict converts suite to dictionary.

type CompressionMetrics ¶

type CompressionMetrics struct {
	// contains filtered or unexported fields
}

CompressionMetrics measures compression quality at extreme scale.

Critical for endless and similar systems that use 100x-1000x compression at 25M+ tokens. Measures:

Compression ratio achieved
Information retention after compression
Retrieval accuracy from compressed context
Quality degradation as context grows

Example:

metrics := NewCompressionMetrics([]int{1000000, 10000000, 25000000}, 10)
stats, _ := metrics.EvaluateAtLengths(agent, "session-123", nil)
for length, stat := range stats {
    fmt.Printf("%dM tokens: %.1fx compression\n", length/1e6, stat.CompressionRatio)
}

func NewCompressionMetrics ¶

func NewCompressionMetrics(testLengths []int, needleCount int) *CompressionMetrics

NewCompressionMetrics creates a new compression metrics instance.

Args:

testLengths: Context lengths to test (defaults to 1M, 10M, 25M)
needleCount: Number of "needle" facts to test retrieval

Example:

metrics := NewCompressionMetrics([]int{1000000, 10000000}, 10)

func (*CompressionMetrics) Aggregate ¶

func (m *CompressionMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates compression ratios.

Args:

measurements: List of compression ratios

Returns:

Statistics: mean, min, max, std

func (*CompressionMetrics) EvaluateAtLengths ¶

func (m *CompressionMetrics) EvaluateAtLengths(agent agenkit.Agent, sessionID string, needleContent []string) (map[int]*CompressionStats, error)

EvaluateAtLengths evaluates compression quality at multiple context lengths.

Tests compression and retrieval at 1M, 10M, 25M tokens to detect quality degradation as context grows.

Args:

agent: Agent with compression capability
sessionID: Session to evaluate
needleContent: Specific facts to test retrieval (optional)

Returns:

Dictionary mapping context_length -> CompressionStats

func (*CompressionMetrics) Measure ¶

func (m *CompressionMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures compression quality for single interaction.

Returns:

Compression ratio (raw_tokens / compressed_tokens)

func (*CompressionMetrics) Name ¶

func (m *CompressionMetrics) Name() string

Name returns the metric name.

type CompressionStats ¶

type CompressionStats struct {
	RawTokens           int
	CompressedTokens    int
	CompressionRatio    float64
	RetrievalAccuracy   float64
	ContextLengthTested int
	Timestamp           time.Time
}

CompressionStats contains statistics from compression evaluation.

func (*CompressionStats) ToDict ¶

func (c *CompressionStats) ToDict() map[string]interface{}

ToDict converts stats to dictionary.

type ContextMetrics ¶

type ContextMetrics struct{}

ContextMetrics tracks context length and growth over agent lifecycle.

Essential for extreme-scale systems (endless) that operate at 1M-25M+ token contexts. Measures:

Raw context token count
Compressed context token count (if compression used)
Compression ratio
Context growth rate

Example:

metrics := NewContextMetrics()
result, _ := metrics.Measure(agent, inputMsg, outputMsg, ctx)
fmt.Printf("Context length: %.0f tokens\n", result)

func NewContextMetrics ¶

func NewContextMetrics() *ContextMetrics

NewContextMetrics creates a new context metrics instance.

func (*ContextMetrics) Aggregate ¶

func (m *ContextMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates context length measurements.

Args:

measurements: List of context lengths over time

Returns:

Statistics: mean, min, max, final, growth_rate

func (*ContextMetrics) Measure ¶

func (m *ContextMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures context length metrics.

Args:

agent: Agent being evaluated
inputMessage: Input message
outputMessage: Agent response
ctx: Additional context with session history

Returns:

Current context length in tokens (or compressed tokens if available)

func (*ContextMetrics) Name ¶

func (m *ContextMetrics) Name() string

Name returns the metric name.

type ErrorRecord ¶

type ErrorRecord struct {
	// Type of error
	Type string `json:"type"`
	// Error message
	Message string `json:"message"`
	// Additional details
	Details map[string]interface{} `json:"details,omitempty"`
	// Timestamp when error occurred (RFC3339 format)
	Timestamp string `json:"timestamp"`
}

ErrorRecord represents an error that occurred during evaluation.

func NewErrorRecord ¶

func NewErrorRecord(errorType string, message string, details map[string]interface{}) *ErrorRecord

NewErrorRecord creates a new error record with current timestamp.

type EvaluationResult ¶

type EvaluationResult struct {
	// Identification
	EvaluationID string
	AgentName    string
	Timestamp    time.Time

	// Metrics
	Metrics           map[string][]float64
	AggregatedMetrics map[string]map[string]float64

	// Context information
	ContextLength    *int
	CompressedLength *int
	CompressionRatio *float64

	// Quality scores
	Accuracy     *float64
	QualityScore *float64

	// Performance
	AvgLatencyMs *float64
	P95LatencyMs *float64

	// Test details
	TotalTests  int
	PassedTests int
	FailedTests int

	// Additional metadata
	Metadata map[string]interface{}
}

EvaluationResult contains results from an evaluation run.

Includes metrics, metadata, and analysis.

func (*EvaluationResult) SuccessRate ¶

func (r *EvaluationResult) SuccessRate() float64

SuccessRate calculates test success rate.

func (*EvaluationResult) ToDict ¶

func (r *EvaluationResult) ToDict() map[string]interface{}

ToDict converts result to dictionary.

type Evaluator ¶

type Evaluator struct {
	// contains filtered or unexported fields
}

Evaluator is the core evaluation orchestrator.

Runs benchmarks, collects metrics, and aggregates results.

Example:

evaluator := NewEvaluator(agent, nil, "")
suite := BenchmarkSuiteStandard()
testCases, _ := suite.GenerateAllTestCases()
results, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", results.Accuracy)

func NewEvaluator ¶

func NewEvaluator(agent agenkit.Agent, metrics []Metric, sessionID string) *Evaluator

NewEvaluator creates a new evaluator.

Args:

agent: Agent to evaluate
metrics: List of metrics to collect (defaults to empty)
sessionID: Optional session ID for context tracking

Example:

evaluator := NewEvaluator(agent, []Metric{NewAccuracyMetric(nil, false)}, "eval-123")

func (*Evaluator) Evaluate ¶

func (e *Evaluator) Evaluate(testCases []map[string]interface{}, evaluationID string) (*EvaluationResult, error)

Evaluate evaluates agent on test cases.

Args:

testCases: List of test cases, each with 'input' and 'expected' keys
evaluationID: Optional evaluation ID

Returns:

EvaluationResult with metrics and analysis

Example:

testCases := []map[string]interface{}{
    {"input": "What is 2+2?", "expected": "4"},
}
result, err := evaluator.Evaluate(testCases, "")

func (*Evaluator) EvaluateSingle ¶

func (e *Evaluator) EvaluateSingle(inputMessage *agenkit.Message, expectedOutput interface{}) (map[string]float64, error)

EvaluateSingle evaluates single interaction.

Args:

inputMessage: Input to agent
expectedOutput: Expected output (optional)

Returns:

Dictionary of metric values

Example:

inputMsg := &agenkit.Message{Role: "user", Content: "Hello"}
metrics, err := evaluator.EvaluateSingle(inputMsg, "Hi there")

type ExtremeScaleBenchmark ¶

type ExtremeScaleBenchmark struct {
	// contains filtered or unexported fields
}

ExtremeScaleBenchmark is an extreme-scale benchmark for testing at 1M-25M+ tokens.

Designed specifically for endless and similar systems that operate at unprecedented context lengths.

func NewExtremeScaleBenchmark ¶

func NewExtremeScaleBenchmark(testLengths []int, needlesPerLength int) *ExtremeScaleBenchmark

NewExtremeScaleBenchmark creates a new extreme-scale benchmark.

Args:

testLengths: Context lengths to test (defaults to 1M, 10M, 25M)
needlesPerLength: Number of needles per context length

Example:

benchmark := NewExtremeScaleBenchmark([]int{1000000, 10000000}, 10)

func (*ExtremeScaleBenchmark) Description ¶

func (b *ExtremeScaleBenchmark) Description() string

Description returns the benchmark description.

func (*ExtremeScaleBenchmark) GenerateTestCases ¶

func (b *ExtremeScaleBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates extreme-scale test cases.

func (*ExtremeScaleBenchmark) Name ¶

func (b *ExtremeScaleBenchmark) Name() string

Name returns the benchmark name.

type FileRecordingStorage ¶

type FileRecordingStorage struct {
	// contains filtered or unexported fields
}

FileRecordingStorage provides file-based recording storage.

Stores recordings as JSON files on disk.

func NewFileRecordingStorage ¶

func NewFileRecordingStorage(recordingsDir string) *FileRecordingStorage

NewFileRecordingStorage creates a new file storage.

Args:

recordingsDir: Directory to store recordings

Example:

storage := NewFileRecordingStorage("./recordings")

func (*FileRecordingStorage) DeleteRecording ¶

func (s *FileRecordingStorage) DeleteRecording(sessionID string) error

DeleteRecording deletes recording file.

func (*FileRecordingStorage) ListRecordings ¶

func (s *FileRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists all recordings.

func (*FileRecordingStorage) LoadRecording ¶

func (s *FileRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from file.

func (*FileRecordingStorage) SaveRecording ¶

func (s *FileRecordingStorage) SaveRecording(recording *SessionRecording) error

SaveRecording saves recording to file.

type InMemoryRecordingStorage ¶

type InMemoryRecordingStorage struct {
	// contains filtered or unexported fields
}

InMemoryRecordingStorage provides in-memory recording storage for testing.

Does not persist recordings across restarts.

func NewInMemoryRecordingStorage ¶

func NewInMemoryRecordingStorage() *InMemoryRecordingStorage

NewInMemoryRecordingStorage creates a new in-memory storage.

func (*InMemoryRecordingStorage) DeleteRecording ¶

func (s *InMemoryRecordingStorage) DeleteRecording(sessionID string) error

DeleteRecording deletes recording from memory.

func (*InMemoryRecordingStorage) ListRecordings ¶

func (s *InMemoryRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists recordings from memory.

func (*InMemoryRecordingStorage) LoadRecording ¶

func (s *InMemoryRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from memory.

func (*InMemoryRecordingStorage) SaveRecording ¶

func (s *InMemoryRecordingStorage) SaveRecording(recording *SessionRecording) error

SaveRecording saves recording to memory.

type InformationRetentionBenchmark ¶

type InformationRetentionBenchmark struct {
	// contains filtered or unexported fields
}

InformationRetentionBenchmark tests information retention across long conversations.

Verifies that agents remember and can recall information from earlier in the conversation, even after compression.

func NewInformationRetentionBenchmark ¶

func NewInformationRetentionBenchmark(conversationLength int, recallPoints []int) *InformationRetentionBenchmark

NewInformationRetentionBenchmark creates a new information retention benchmark.

Args:

conversationLength: Number of conversation turns
recallPoints: Turns at which to test recall (defaults to checkpoints)

Example:

benchmark := NewInformationRetentionBenchmark(100, []int{10, 25, 50, 75, 100})

func (*InformationRetentionBenchmark) Description ¶

func (b *InformationRetentionBenchmark) Description() string

Description returns the benchmark description.

func (*InformationRetentionBenchmark) GenerateTestCases ¶

func (b *InformationRetentionBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates information retention test cases.

func (*InformationRetentionBenchmark) Name ¶

func (b *InformationRetentionBenchmark) Name() string

Name returns the benchmark name.

type InteractionRecord ¶

type InteractionRecord struct {
	InteractionID string
	SessionID     string
	InputMessage  map[string]interface{}
	OutputMessage map[string]interface{}
	Timestamp     time.Time
	LatencyMs     float64
	Metadata      map[string]interface{}
}

InteractionRecord represents a record of single agent interaction.

Contains input, output, timing, and metadata.

func InteractionRecordFromDict ¶

func InteractionRecordFromDict(data map[string]interface{}) (*InteractionRecord, error)

InteractionRecordFromDict creates record from dictionary.

func (*InteractionRecord) ToDict ¶

func (r *InteractionRecord) ToDict() map[string]interface{}

ToDict converts record to dictionary.

type LatencyMetric ¶

type LatencyMetric struct{}

LatencyMetric measures agent response latency.

Tracks processing time per interaction. Critical for production systems where response time matters.

func NewLatencyMetric ¶

func NewLatencyMetric() *LatencyMetric

NewLatencyMetric creates a new latency metric instance.

func (*LatencyMetric) Aggregate ¶

func (m *LatencyMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates latency measurements.

Returns:

mean, p50, p95, p99, min, max

func (*LatencyMetric) Measure ¶

func (m *LatencyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure gets latency for this interaction.

Returns:

Latency in milliseconds

func (*LatencyMetric) Name ¶

func (m *LatencyMetric) Name() string

Name returns the metric name.

type Metric ¶

type Metric interface {
	// Name returns the metric name.
	Name() string

	// Measure measures metric for a single agent interaction.
	//
	// Args:
	//   agent: The agent being evaluated
	//   inputMessage: Input to the agent
	//   outputMessage: Agent's response
	//   ctx: Additional context (session history, etc.)
	//
	// Returns:
	//   Metric value (typically 0.0 to 1.0)
	Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

	// Aggregate aggregates multiple measurements.
	//
	// Args:
	//   measurements: List of individual measurements
	//
	// Returns:
	//   Aggregated statistics (mean, std, min, max, etc.)
	Aggregate(measurements []float64) map[string]float64
}

Metric is the interface for evaluation metrics.

Metrics measure specific aspects of agent performance:

Accuracy
Latency
Context usage
Quality scores
etc.

type MetricMeasurement ¶

type MetricMeasurement struct {
	// Name of the metric
	Name string `json:"name"`
	// Value of the measurement
	Value float64 `json:"value"`
	// Type categorizes the metric
	Type MetricType `json:"type"`
	// Timestamp when measurement was taken (RFC3339 format)
	Timestamp string `json:"timestamp"`
	// Metadata for additional context
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

MetricMeasurement represents a single metric measurement.

Note: This is distinct from the Metric interface in core.go. Metric interface defines how to measure, MetricMeasurement stores the measurement.

func CreateCostMetric ¶

func CreateCostMetric(cost float64, currency string, metadata map[string]interface{}) *MetricMeasurement

CreateCostMetric creates a cost metric measurement.

Helper to create a cost metric with currency information.

Args:

cost: Cost amount
currency: Currency code (default: "USD")
metadata: Additional metadata

Returns:

Cost metric measurement

Example:

metric := evaluation.CreateCostMetric(0.0042, "USD", nil)

func CreateDurationMetric ¶

func CreateDurationMetric(durationSeconds float64, metadata map[string]interface{}) *MetricMeasurement

CreateDurationMetric creates a duration metric measurement.

Helper to create a duration metric with hours conversion.

Args:

durationSeconds: Duration in seconds
metadata: Additional metadata

Returns:

Duration metric measurement

Example:

metric := evaluation.CreateDurationMetric(125.5, nil)

func CreateQualityMetric ¶

func CreateQualityMetric(name string, score, maxScore float64, metadata map[string]interface{}) *MetricMeasurement

CreateQualityMetric creates a quality score metric measurement.

Helper to create a quality score metric with normalized score (0.0-1.0).

Args:

name: Metric name
score: Raw score
maxScore: Maximum possible score (default: 10.0)
metadata: Additional metadata

Returns:

Metric measurement with normalized score

Example:

metric := evaluation.CreateQualityMetric("response_quality", 8.5, 10.0, nil)

func NewMetricMeasurement ¶

func NewMetricMeasurement(name string, value float64, metricType MetricType) *MetricMeasurement

NewMetricMeasurement creates a new metric measurement with current timestamp.

type MetricType ¶

type MetricType string

MetricType categorizes different types of metrics.

const (
	// MetricTypeSuccessRate measures success/failure rates
	MetricTypeSuccessRate MetricType = "success_rate"
	// MetricTypeQualityScore measures output quality
	MetricTypeQualityScore MetricType = "quality_score"
	// MetricTypeCost measures token/API costs
	MetricTypeCost MetricType = "cost"
	// MetricTypeDuration measures time taken
	MetricTypeDuration MetricType = "duration"
	// MetricTypeErrorRate measures error frequency
	MetricTypeErrorRate MetricType = "error_rate"
	// MetricTypeTaskCompletion measures task completion
	MetricTypeTaskCompletion MetricType = "task_completion"
	// MetricTypeCustom for custom metrics
	MetricTypeCustom MetricType = "custom"
)

type MetricsCollector ¶

type MetricsCollector struct {
	// contains filtered or unexported fields
}

MetricsCollector aggregates metrics across multiple evaluation sessions.

Useful for analyzing agent performance over time and across different scenarios. Thread-safe for concurrent access.

Example:

collector := evaluation.NewMetricsCollector()
collector.AddResult(result1)
collector.AddResult(result2)
stats := collector.GetStatistics()
fmt.Printf("Success rate: %.2f%%\n", stats["success_rate"]*100)

func NewMetricsCollector ¶

func NewMetricsCollector() *MetricsCollector

NewMetricsCollector creates a new metrics collector.

func (*MetricsCollector) AddResult ¶

func (mc *MetricsCollector) AddResult(result *SessionResult)

AddResult adds a session result to the collector. Thread-safe for concurrent access.

func (*MetricsCollector) Clear ¶

func (mc *MetricsCollector) Clear()

Clear removes all collected results. Thread-safe for concurrent access.

func (*MetricsCollector) GetMetricAggregates ¶

func (mc *MetricsCollector) GetMetricAggregates(metricName string) map[string]interface{}

GetMetricAggregates computes aggregated statistics for a specific metric across all sessions. Thread-safe for concurrent access.

Args:

metricName: Name of the metric to aggregate

Returns:

Map with statistics: count, sum, mean, min, max

func (*MetricsCollector) GetResults ¶

func (mc *MetricsCollector) GetResults() []SessionResult

GetResults returns all collected session results. Thread-safe for concurrent access.

func (*MetricsCollector) GetStatistics ¶

func (mc *MetricsCollector) GetStatistics() map[string]interface{}

GetStatistics computes aggregated statistics across all collected results. Thread-safe for concurrent access.

Returns a map with statistics including:

session_count: Total number of sessions
completed_count: Number of completed sessions
failed_count: Number of failed sessions
success_rate: Ratio of completed to total sessions
avg_duration: Average session duration in seconds
total_errors: Total number of errors across all sessions
avg_errors_per_session: Average errors per session

type NeedleInHaystackBenchmark ¶

type NeedleInHaystackBenchmark struct {
	// contains filtered or unexported fields
}

NeedleInHaystackBenchmark is a needle-in-haystack benchmark for context retrieval.

Tests ability to retrieve specific information from large contexts. Essential for extreme-scale systems like endless.

func NewNeedleInHaystackBenchmark ¶

func NewNeedleInHaystackBenchmark(contextLength, needleCount, haystackMultiplier int) *NeedleInHaystackBenchmark

NewNeedleInHaystackBenchmark creates a new needle-in-haystack benchmark.

Args:

contextLength: Target context length in tokens
needleCount: Number of needles to hide
haystackMultiplier: How much filler per needle

Example:

benchmark := NewNeedleInHaystackBenchmark(10000, 5, 10)

func (*NeedleInHaystackBenchmark) Description ¶

func (b *NeedleInHaystackBenchmark) Description() string

Description returns the benchmark description.

func (*NeedleInHaystackBenchmark) GenerateTestCases ¶

func (b *NeedleInHaystackBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates needle-in-haystack test cases.

func (*NeedleInHaystackBenchmark) Name ¶

func (b *NeedleInHaystackBenchmark) Name() string

Name returns the benchmark name.

type ObjectiveFunc ¶

type ObjectiveFunc func(ctx context.Context, config map[string]interface{}) (float64, error)

ObjectiveFunc evaluates a configuration and returns a score.

type OptimizationResult ¶

type OptimizationResult struct {
	BestConfig  map[string]interface{}
	BestScore   float64
	History     []OptimizationStep
	NIterations int
	StartTime   time.Time
	EndTime     time.Time
	Metadata    map[string]interface{}
}

OptimizationResult contains the results of an optimization run.

func (*OptimizationResult) Duration ¶

func (r *OptimizationResult) Duration() time.Duration

Duration returns the total optimization duration.

func (*OptimizationResult) GetImprovement ¶

func (r *OptimizationResult) GetImprovement() float64

GetImprovement returns the improvement from initial to best score.

type OptimizationStep ¶

type OptimizationStep struct {
	Config map[string]interface{}
	Score  float64
}

OptimizationStep represents a single evaluation in the optimization.

type OptimizationStrategy ¶

type OptimizationStrategy string

OptimizationStrategy represents prompt optimization strategies.

const (
	// StrategyGrid exhaustive grid search
	StrategyGrid OptimizationStrategy = "grid"
	// StrategyRandom random sampling
	StrategyRandom OptimizationStrategy = "random"
	// StrategyGenetic genetic algorithm
	StrategyGenetic OptimizationStrategy = "genetic"
)

type Optimizer ¶

type Optimizer interface {
	// Optimize runs the optimization process.
	//
	// Args:
	//   ctx: Context for cancellation
	//   nIterations: Number of iterations to run
	//
	// Returns:
	//   OptimizationResult with best configuration and history
	Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
}

Optimizer is the base interface for optimization algorithms.

Implementations should provide intelligent search over configuration spaces using various strategies (random search, Bayesian optimization, genetic algorithms, etc.).

type ParameterSpec ¶

type ParameterSpec struct {
	Type   ParameterType
	Low    float64       // For continuous/integer
	High   float64       // For continuous/integer
	Values []interface{} // For discrete/categorical
}

ParameterSpec defines a parameter in the search space.

type ParameterType ¶

type ParameterType string

ParameterType specifies the type of a hyperparameter.

const (
	// ParamTypeContinuous represents a continuous parameter (float)
	ParamTypeContinuous ParameterType = "continuous"
	// ParamTypeInteger represents an integer parameter
	ParamTypeInteger ParameterType = "integer"
	// ParamTypeDiscrete represents a discrete set of values
	ParamTypeDiscrete ParameterType = "discrete"
	// ParamTypeCategorical represents categorical values
	ParamTypeCategorical ParameterType = "categorical"
)

type PrecisionRecallMetric ¶

type PrecisionRecallMetric struct {
	// contains filtered or unexported fields
}

PrecisionRecallMetric measures precision and recall for classification tasks.

Useful for agents that categorize, filter, or make binary decisions.

Example:

metric := NewPrecisionRecallMetric()
// Agent classifies documents as relevant/not relevant
for _, doc := range testDocs {
    score, _ := metric.Measure(agent, doc, output, ctx)
}

func NewPrecisionRecallMetric ¶

func NewPrecisionRecallMetric() *PrecisionRecallMetric

NewPrecisionRecallMetric creates a new precision/recall metric.

func (*PrecisionRecallMetric) Aggregate ¶

func (m *PrecisionRecallMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates precision/recall metrics.

Returns:

Precision, recall, F1 score, and confusion matrix counts

func (*PrecisionRecallMetric) Measure ¶

func (m *PrecisionRecallMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures precision/recall for single classification.

Context must contain:

"true_label": Ground truth (True/False or 1/0)
"predicted_label": Agent's prediction (True/False or 1/0)

Returns:

1.0 if correct classification, 0.0 if incorrect

func (*PrecisionRecallMetric) Name ¶

func (m *PrecisionRecallMetric) Name() string

Name returns the metric name.

func (*PrecisionRecallMetric) Reset ¶

func (m *PrecisionRecallMetric) Reset()

Reset resets confusion matrix counts.

type PrecisionRecallStats ¶

type PrecisionRecallStats struct {
	TruePositives  int
	FalsePositives int
	FalseNegatives int
	TrueNegatives  int
}

PrecisionRecallStats contains precision and recall statistics.

func (*PrecisionRecallStats) F1Score ¶

func (s *PrecisionRecallStats) F1Score() float64

F1Score calculates F1 score.

func (*PrecisionRecallStats) Precision ¶

func (s *PrecisionRecallStats) Precision() float64

Precision calculates precision.

func (*PrecisionRecallStats) Recall ¶

func (s *PrecisionRecallStats) Recall() float64

Recall calculates recall.

func (*PrecisionRecallStats) ToDict ¶

func (s *PrecisionRecallStats) ToDict() map[string]float64

ToDict converts stats to dictionary.

type PromptEvaluation ¶

type PromptEvaluation struct {
	Prompt string
	Config map[string]string
	Scores map[string]float64
}

PromptEvaluation represents a single prompt evaluation.

type PromptOptimizationResult ¶

type PromptOptimizationResult struct {
	// BestPrompt is the best prompt found
	BestPrompt string
	// BestConfig is the best variable configuration
	BestConfig map[string]string
	// BestScores contains best metric scores
	BestScores map[string]float64
	// History contains all (prompt, config, scores) tuples
	History []PromptEvaluation
	// NEvaluated is the number of prompts evaluated
	NEvaluated int
	// Strategy used
	Strategy string
	// StartTime in milliseconds
	StartTime int64
	// EndTime in milliseconds
	EndTime int64
}

PromptOptimizationResult contains results from prompt optimization.

func (*PromptOptimizationResult) DurationSeconds ¶

func (r *PromptOptimizationResult) DurationSeconds() float64

DurationSeconds returns the duration in seconds.

func (*PromptOptimizationResult) ToDict ¶

func (r *PromptOptimizationResult) ToDict() map[string]interface{}

ToDict converts result to dictionary.

type PromptOptimizer ¶

type PromptOptimizer struct {
	// contains filtered or unexported fields
}

PromptOptimizer optimizes prompts through systematic variation.

Supports multiple optimization strategies: - Grid search: Exhaustive evaluation of all combinations - Random search: Random sampling of combinations - Genetic algorithm: Evolutionary optimization

func NewPromptOptimizer ¶

func NewPromptOptimizer(
	template string,
	variations map[string][]string,
	agentFactory AgentFactory,
	metrics []string,
	objectiveMetric *string,
) *PromptOptimizer

NewPromptOptimizer creates a new prompt optimizer.

Args:

template: Prompt template with {variable} placeholders
variations: Map of variable names to possible values
agentFactory: Function that creates agent from prompt string
metrics: List of metrics to evaluate
objectiveMetric: Primary metric for optimization (nil = first metric)
maximize: Whether to maximize (true) or minimize (false) objective

func (*PromptOptimizer) Optimize ¶

func (p *PromptOptimizer) Optimize(
	ctx context.Context,
	testCases []map[string]interface{},
	strategy string,
	options map[string]interface{},
) (*PromptOptimizationResult, error)

Optimize runs prompt optimization with the specified strategy.

Args:

ctx: Context for cancellation
testCases: Test cases for evaluation
strategy: Optimization strategy ("grid", "random", "genetic")
options: Strategy-specific options (nSamples, populationSize, etc.)

func (*PromptOptimizer) OptimizeGenetic ¶

func (p *PromptOptimizer) OptimizeGenetic(
	ctx context.Context,
	testCases []map[string]interface{},
	populationSize int,
	nGenerations int,
	mutationRate float64,
) (*PromptOptimizationResult, error)

OptimizeGenetic performs genetic algorithm optimization.

func (*PromptOptimizer) OptimizeGrid ¶

func (p *PromptOptimizer) OptimizeGrid(
	ctx context.Context,
	testCases []map[string]interface{},
) (*PromptOptimizationResult, error)

OptimizeGrid performs grid search by evaluating all possible combinations.

func (*PromptOptimizer) OptimizeRandom ¶

func (p *PromptOptimizer) OptimizeRandom(
	ctx context.Context,
	testCases []map[string]interface{},
	nSamples int,
) (*PromptOptimizationResult, error)

OptimizeRandom performs random search by sampling random combinations.

func (*PromptOptimizer) SetMaximize ¶

func (p *PromptOptimizer) SetMaximize(maximize bool)

SetMaximize sets whether to maximize (true) or minimize (false) the objective.

type QualityMetrics ¶

type QualityMetrics struct {
	// contains filtered or unexported fields
}

QualityMetrics provides comprehensive quality scoring.

Evaluates multiple quality dimensions:

Relevance: How relevant is response to query?
Completeness: Does response answer all parts?
Coherence: Is response logically structured?
Accuracy: Is information factually correct?

Uses rule-based scoring.

Example:

metric := NewQualityMetrics(false, "", nil)
score, _ := metric.Measure(agent, inputMsg, outputMsg, nil)
fmt.Printf("Quality: %.2f\n", score)  // 0.0 to 1.0

func NewQualityMetrics ¶

func NewQualityMetrics(useLLMJudge bool, judgeModel string, weights map[string]float64) *QualityMetrics

NewQualityMetrics creates a new quality metrics instance.

Args:

useLLMJudge: Use LLM to judge quality (not yet implemented)
judgeModel: Model to use for judging (e.g., "claude-sonnet-4")
weights: Weights for each dimension (relevance, completeness, etc.)

Example:

metric := NewQualityMetrics(false, "", nil)

func (*QualityMetrics) Aggregate ¶

func (m *QualityMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates quality measurements.

Args:

measurements: List of quality scores

Returns:

Statistics: mean, min, max, std

func (*QualityMetrics) Measure ¶

func (m *QualityMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures response quality.

Args:

agent: Agent being evaluated
inputMessage: Input query
outputMessage: Agent response
ctx: Optional context

Returns:

Quality score (0.0 to 1.0)

func (*QualityMetrics) Name ¶

func (m *QualityMetrics) Name() string

Name returns the metric name.

type RandomSearchOptimizer ¶

type RandomSearchOptimizer struct {
	// contains filtered or unexported fields
}

RandomSearchOptimizer implements baseline random search optimization.

Randomly samples configurations from the search space and evaluates them. Useful as a baseline for comparison with more sophisticated algorithms.

Example:

optimizer := evaluation.NewRandomSearchOptimizer(
    objectiveFunc,
    searchSpace,
    true, // maximize
)
result, err := optimizer.Optimize(ctx, 20)

func NewRandomSearchOptimizer ¶

func NewRandomSearchOptimizer(
	objective ObjectiveFunc,
	searchSpace *SearchSpace,
	maximize bool,
) *RandomSearchOptimizer

NewRandomSearchOptimizer creates a new random search optimizer.

Args:

objective: Function to evaluate configurations
searchSpace: SearchSpace defining parameter space
maximize: Whether to maximize (true) or minimize (false) objective

Returns:

RandomSearchOptimizer instance

func (*RandomSearchOptimizer) GetHistory ¶

func (r *RandomSearchOptimizer) GetHistory() []OptimizationStep

GetHistory returns the optimization history.

func (*RandomSearchOptimizer) Optimize ¶

func (r *RandomSearchOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)

Optimize runs random search optimization.

Randomly samples nIterations configurations from the search space, evaluates each, and tracks the best configuration found.

Args:

ctx: Context for cancellation
nIterations: Number of configurations to sample and evaluate

Returns:

OptimizationResult with best config, score, and history
error if optimization fails

type RecordingStorage ¶

type RecordingStorage interface {
	// SaveRecording saves recording.
	SaveRecording(recording *SessionRecording) error

	// LoadRecording loads recording by session ID.
	LoadRecording(sessionID string) (*SessionRecording, error)

	// ListRecordings lists recordings.
	ListRecordings(limit, offset int) ([]*SessionRecording, error)

	// DeleteRecording deletes recording.
	DeleteRecording(sessionID string) error
}

RecordingStorage is the interface for recording storage backends.

Implement this to create custom storage (Redis, S3, Postgres, etc.).

type Regression ¶

type Regression struct {
	MetricName         string
	BaselineValue      float64
	CurrentValue       float64
	DegradationPercent float64
	Severity           Severity
	Timestamp          time.Time
	Context            map[string]interface{}
}

Regression represents a detected regression in agent performance.

Contains information about what degraded and by how much.

func (*Regression) IsRegression ¶

func (r *Regression) IsRegression() bool

IsRegression checks if this is a real regression (not improvement).

func (*Regression) ToDict ¶

func (r *Regression) ToDict() map[string]interface{}

ToDict converts regression to dictionary.

type RegressionDetector ¶

type RegressionDetector struct {
	// contains filtered or unexported fields
}

RegressionDetector detects performance regressions by comparing results.

Monitors agent quality over time and alerts when performance degrades beyond acceptable thresholds.

Example:

detector := NewRegressionDetector(nil, nil)
detector.SetBaseline(baselineResult)

// Later, after changes
regressions := detector.Detect(currentResult, true)
if len(regressions) > 0 {
    fmt.Printf("Found %d regressions!\n", len(regressions))
    for _, r := range regressions {
        fmt.Printf("  %s: %.1f%% worse\n", r.MetricName, r.DegradationPercent)
    }
}

func NewRegressionDetector ¶

func NewRegressionDetector(thresholds map[string]float64, baseline *EvaluationResult) *RegressionDetector

NewRegressionDetector creates a new regression detector.

Args:

thresholds: Acceptable degradation per metric (default 10%)
baseline: Baseline evaluation result to compare against

Example:

detector := NewRegressionDetector(nil, baselineResult)

func (*RegressionDetector) ClearHistory ¶

func (d *RegressionDetector) ClearHistory()

ClearHistory clears evaluation history.

func (*RegressionDetector) CompareResults ¶

func (d *RegressionDetector) CompareResults(resultA, resultB *EvaluationResult) map[string]map[string]float64

CompareResults compares two evaluation results.

Args:

resultA: First result (baseline)
resultB: Second result (comparison)

Returns:

Dictionary of metric comparisons

func (*RegressionDetector) Detect ¶

func (d *RegressionDetector) Detect(result *EvaluationResult, storeHistory bool) []*Regression

Detect detects regressions in evaluation result.

Compares current result to baseline and identifies metrics that have degraded beyond acceptable thresholds.

Args:

result: Current evaluation result
storeHistory: Whether to store result in history

Returns:

List of detected regressions (empty if no regressions)

func (*RegressionDetector) GetSummary ¶

func (d *RegressionDetector) GetSummary() map[string]interface{}

GetSummary gets summary of detector state.

Returns:

Summary with baseline info and history count

func (*RegressionDetector) GetTrend ¶

func (d *RegressionDetector) GetTrend(metricName string, window int) map[string]interface{}

GetTrend gets trend for metric over recent history.

Args:

metricName: Metric to analyze
window: Number of recent results to analyze

Returns:

Trend statistics (slope, direction, variance)

func (*RegressionDetector) SetBaseline ¶

func (d *RegressionDetector) SetBaseline(result *EvaluationResult)

SetBaseline sets baseline for comparison.

Args:

result: Evaluation result to use as baseline

type SearchSpace ¶

type SearchSpace struct {
	Parameters map[string]ParameterSpec
}

SearchSpace defines the hyperparameter search space.

func NewSearchSpace ¶

func NewSearchSpace() *SearchSpace

NewSearchSpace creates a new search space.

func (*SearchSpace) AddCategorical ¶

func (s *SearchSpace) AddCategorical(name string, values []string)

AddCategorical adds a categorical parameter with specific values.

func (*SearchSpace) AddContinuous ¶

func (s *SearchSpace) AddContinuous(name string, low, high float64)

AddContinuous adds a continuous parameter with range [low, high].

func (*SearchSpace) AddDiscrete ¶

func (s *SearchSpace) AddDiscrete(name string, values []interface{})

AddDiscrete adds a discrete parameter with specific values.

func (*SearchSpace) AddInteger ¶

func (s *SearchSpace) AddInteger(name string, low, high int)

AddInteger adds an integer parameter with range [low, high].

func (*SearchSpace) Sample ¶

func (s *SearchSpace) Sample() map[string]interface{}

Sample generates a random configuration from the search space.

type SessionRecorder ¶

type SessionRecorder struct {
	// contains filtered or unexported fields
}

SessionRecorder records agent sessions for replay and analysis.

Automatically records all interactions with an agent, storing inputs, outputs, timing, and metadata.

Example:

recorder := NewSessionRecorder(NewFileRecordingStorage("./recordings"))
wrappedAgent := recorder.Wrap(agent)

// Use agent normally (automatically recorded)
response, _ := wrappedAgent.Process(ctx, message)

// Save recording
recorder.FinalizeSession("test-123")

func NewSessionRecorder ¶

func NewSessionRecorder(storage RecordingStorage) *SessionRecorder

NewSessionRecorder creates a new session recorder.

Args:

storage: Storage backend (nil = in-memory)

Example:

recorder := NewSessionRecorder(nil)

func (*SessionRecorder) DeleteRecording ¶

func (r *SessionRecorder) DeleteRecording(sessionID string) error

DeleteRecording deletes recording.

func (*SessionRecorder) FinalizeSession ¶

func (r *SessionRecorder) FinalizeSession(sessionID string) (*SessionRecording, error)

FinalizeSession finalizes and saves session recording.

Args:

sessionID: Session to finalize

Returns:

Session recording

func (*SessionRecorder) ListRecordings ¶

func (r *SessionRecorder) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists all recordings.

func (*SessionRecorder) LoadRecording ¶

func (r *SessionRecorder) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from storage.

Args:

sessionID: Session to load

Returns:

Session recording if found

func (*SessionRecorder) RecordInteraction ¶

func (r *SessionRecorder) RecordInteraction(sessionID string, inputMessage, outputMessage *agenkit.Message, latencyMs float64, metadata map[string]interface{})

RecordInteraction records single interaction.

Args:

sessionID: Session identifier
inputMessage: Input to agent
outputMessage: Agent response
latencyMs: Processing time in milliseconds
metadata: Optional interaction metadata

func (*SessionRecorder) StartSession ¶

func (r *SessionRecorder) StartSession(sessionID, agentName string, metadata map[string]interface{})

StartSession starts recording session.

Args:

sessionID: Session identifier
agentName: Name of agent being recorded
metadata: Optional session metadata

func (*SessionRecorder) Wrap ¶

func (r *SessionRecorder) Wrap(agent agenkit.Agent) agenkit.Agent

Wrap wraps agent to record interactions.

Args:

agent: Agent to wrap

Returns:

Wrapped agent that records all interactions

type SessionRecording ¶

type SessionRecording struct {
	SessionID    string
	AgentName    string
	StartTime    time.Time
	EndTime      *time.Time
	Interactions []*InteractionRecord
	Metadata     map[string]interface{}
}

SessionRecording represents a recording of entire session.

Contains all interactions and session metadata.

func SessionRecordingFromDict ¶

func SessionRecordingFromDict(data map[string]interface{}) (*SessionRecording, error)

SessionRecordingFromDict creates recording from dictionary.

func (*SessionRecording) DurationSeconds ¶

func (r *SessionRecording) DurationSeconds() float64

DurationSeconds calculates session duration in seconds.

func (*SessionRecording) InteractionCount ¶

func (r *SessionRecording) InteractionCount() int

InteractionCount gets number of interactions.

func (*SessionRecording) ToDict ¶

func (r *SessionRecording) ToDict() map[string]interface{}

ToDict converts recording to dictionary.

func (*SessionRecording) TotalLatencyMs ¶

func (r *SessionRecording) TotalLatencyMs() float64

TotalLatencyMs gets total latency across all interactions.

type SessionReplay ¶

type SessionReplay struct{}

SessionReplay replays recorded sessions for analysis and A/B testing.

Takes recorded session and replays it through a (possibly different) agent to compare behavior.

Example:

replay := NewSessionReplay()
recording, _ := recorder.LoadRecording("test-123")

// Replay with original agent
resultsA, _ := replay.Replay(recording, agentV1, "")

// Replay with new agent (A/B test)
resultsB, _ := replay.Replay(recording, agentV2, "")

// Compare
comparison := replay.Compare(resultsA, resultsB)

func NewSessionReplay ¶

func NewSessionReplay() *SessionReplay

NewSessionReplay creates a new session replay.

func (*SessionReplay) Compare ¶

func (r *SessionReplay) Compare(resultsA, resultsB map[string]interface{}) map[string]interface{}

Compare compares two replay results.

Useful for A/B testing different agent versions.

Args:

resultsA: First replay results
resultsB: Second replay results

Returns:

Comparison metrics

func (*SessionReplay) Replay ¶

func (r *SessionReplay) Replay(recording *SessionRecording, agent agenkit.Agent, sessionID string) (map[string]interface{}, error)

Replay replays session through agent.

Args:

recording: Session recording to replay
agent: Agent to replay through
sessionID: Optional session ID (defaults to original)

Returns:

Replay results with outputs and metrics

type SessionResult ¶

type SessionResult struct {
	// SessionID uniquely identifies this session
	SessionID string `json:"session_id"`
	// AgentName identifies the agent being evaluated
	AgentName string `json:"agent_name"`
	// Status of the session
	Status SessionStatus `json:"status"`
	// StartTime when session started (RFC3339 format)
	StartTime string `json:"start_time"`
	// EndTime when session ended (RFC3339 format, nil if still running)
	EndTime *string `json:"end_time,omitempty"`
	// Measurements collected during session
	Measurements []MetricMeasurement `json:"measurements"`
	// Errors that occurred during session
	Errors []ErrorRecord `json:"errors"`
	// Metadata for additional context
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

SessionResult contains results from evaluating an agent session with enhanced tracking.

This extends the core EvaluationResult with session status, error tracking, and richer metadata for long-running agent evaluations.

func FromJSON ¶

func FromJSON(jsonStr string) (*SessionResult, error)

FromJSON deserializes a session result from JSON.

func NewSessionResult ¶

func NewSessionResult(sessionID string, agentName string) *SessionResult

NewSessionResult creates a new session result.

func (*SessionResult) AddError ¶

func (sr *SessionResult) AddError(errorType string, message string, details map[string]interface{})

AddError records an error that occurred during the session.

func (*SessionResult) AddMetricMeasurement ¶

func (sr *SessionResult) AddMetricMeasurement(measurement *MetricMeasurement)

AddMetricMeasurement adds a metric measurement to this session.

func (*SessionResult) DurationSeconds ¶

func (sr *SessionResult) DurationSeconds() *float64

DurationSeconds calculates session duration in seconds.

Returns nil if session hasn't ended yet.

func (*SessionResult) GetMetric ¶

func (sr *SessionResult) GetMetric(name string) *MetricMeasurement

GetMetric retrieves a specific metric measurement by name (returns first match).

func (*SessionResult) GetMetricsByType ¶

func (sr *SessionResult) GetMetricsByType(metricType MetricType) []MetricMeasurement

GetMetricsByType retrieves all measurements of a specific type.

func (*SessionResult) SetStatus ¶

func (sr *SessionResult) SetStatus(status SessionStatus)

SetStatus updates the session status.

func (*SessionResult) ToJSON ¶

func (sr *SessionResult) ToJSON() (string, error)

ToJSON serializes the session result to JSON.

type SessionStatus ¶

type SessionStatus string

SessionStatus represents the status of an evaluation session.

const (
	// SessionStatusRunning indicates session is currently running
	SessionStatusRunning SessionStatus = "running"
	// SessionStatusCompleted indicates session completed successfully
	SessionStatusCompleted SessionStatus = "completed"
	// SessionStatusFailed indicates session failed
	SessionStatusFailed SessionStatus = "failed"
	// SessionStatusTimeout indicates session timed out
	SessionStatusTimeout SessionStatus = "timeout"
	// SessionStatusCancelled indicates session was cancelled
	SessionStatusCancelled SessionStatus = "cancelled"
)

type Severity ¶

type Severity string

Severity represents regression severity levels.

const (
	SeverityNone     Severity = "none"
	SeverityMinor    Severity = "minor"    // <10% degradation
	SeverityModerate Severity = "moderate" // 10-20% degradation
	SeverityMajor    Severity = "major"    // 20-50% degradation
	SeverityCritical Severity = "critical" // >50% degradation
)

type SignificanceLevel ¶

type SignificanceLevel float64

SignificanceLevel represents statistical significance thresholds.

const (
	// SignificanceLevel0001 represents 99.9% confidence
	SignificanceLevel0001 SignificanceLevel = 0.001
	// SignificanceLevel001 represents 99% confidence
	SignificanceLevel001 SignificanceLevel = 0.01
	// SignificanceLevel005 represents 95% confidence (default)
	SignificanceLevel005 SignificanceLevel = 0.05
	// SignificanceLevel010 represents 90% confidence
	SignificanceLevel010 SignificanceLevel = 0.10
)

type SimpleQABenchmark ¶

type SimpleQABenchmark struct{}

SimpleQABenchmark is a simple question-answering benchmark.

Tests basic knowledge and reasoning.

func NewSimpleQABenchmark ¶

func NewSimpleQABenchmark() *SimpleQABenchmark

NewSimpleQABenchmark creates a new simple Q&A benchmark.

func (*SimpleQABenchmark) Description ¶

func (b *SimpleQABenchmark) Description() string

Description returns the benchmark description.

func (*SimpleQABenchmark) GenerateTestCases ¶

func (b *SimpleQABenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates simple Q&A test cases.

func (*SimpleQABenchmark) Name ¶

func (b *SimpleQABenchmark) Name() string

Name returns the benchmark name.

type StatisticalTestType ¶

type StatisticalTestType string

StatisticalTestType represents types of statistical tests.

const (
	// TestTypeTTest is a parametric test assuming normal distribution
	TestTypeTTest StatisticalTestType = "t_test"
	// TestTypeMannWhitney is a non-parametric test
	TestTypeMannWhitney StatisticalTestType = "mann_whitney"
)

type TestCase ¶

type TestCase struct {
	Input    string
	Expected interface{} // String or validation function
	Metadata map[string]interface{}
	Tags     []string
}

TestCase represents a single test case for evaluation.

Contains input, expected output, and metadata.

func (*TestCase) ToDict ¶

func (t *TestCase) ToDict() map[string]interface{}

ToDict converts test case to dictionary.

type ValidatorFunc ¶

type ValidatorFunc func(expected, actual string) bool

ValidatorFunc is a custom validation function type.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL