evaluation

package
v0.0.0-...-140e820 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: Apache-2.0 Imports: 14 Imported by: 0

Documentation

Overview

Package evaluation provides A/B testing framework for comparing agent variants.

A/B Testing Framework provides statistical significance testing, effect size calculation, and automated experiment orchestration for comparing agent performance.

Example:

// Create A/B test
abTest := evaluation.NewABTest(
    "prompt_comparison",
    controlAgent,
    treatmentAgent,
    []string{"accuracy"},
    evaluation.SignificanceLevel005,
    evaluation.TestTypeTTest,
)

// Run experiment
results, _ := abTest.Run(testCases, 50, true)

// Check significance
if results["accuracy"].IsSignificant() {
    fmt.Printf("Winner: %s\n", results["accuracy"].Winner())
}

Package evaluation provides tools for evaluating and optimizing agent performance. Bayesian Optimization uses probabilistic models to efficiently find optimal hyperparameter configurations.

Package evaluation provides comprehensive evaluation capabilities for autonomous agents.

Designed for measuring agent quality and performance, with special focus on extreme-scale context evaluation (1M-25M+ tokens) for systems like endless.

Example:

evaluator := evaluation.NewEvaluator(agent, nil, "")
testCases := []map[string]interface{}{
    {"input": "What is 2+2?", "expected": "4"},
}
result, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", result.Accuracy)

Package evaluation provides detailed metrics tracking for agent evaluation.

This module extends core evaluation with enhanced metric tracking including: - Session status tracking (running, completed, failed, etc.) - Error collection and analysis - Metric type categorization - Cross-session aggregation

Key use case: "How do you know a long-running agent succeeded?"

Example:

result := evaluation.NewSessionResult("session-123", "my-agent")
result.AddMetricMeasurement(evaluation.NewMetricMeasurement(
    "accuracy",
    0.95,
    evaluation.MetricTypeSuccessRate,
))
result.SetStatus(evaluation.SessionStatusCompleted)

Package evaluation provides the Automated Optimization Framework.

This module provides intelligent optimization of agent configurations, prompts, and hyperparameters using Bayesian optimization, genetic algorithms, and other search strategies.

Interfaces:

  • Optimizer: Base interface for optimization algorithms

Implementations:

  • RandomSearchOptimizer: Baseline random search
  • BayesianOptimizer: Bayesian optimization (in bayesian_optimizer.go)

Example:

searchSpace := evaluation.NewSearchSpace()
searchSpace.AddContinuous("temperature", 0.0, 1.0)
searchSpace.AddContinuous("top_p", 0.0, 1.0)

optimizer := evaluation.NewRandomSearchOptimizer(
    objectiveFunc,
    searchSpace,
    true, // maximize
)

result, err := optimizer.Optimize(ctx, 50)
fmt.Printf("Best config: %v\n", result.BestConfig)
fmt.Printf("Best score: %.3f\n", result.BestScore)

Package evaluation provides the Prompt Optimization Framework.

Automatically improve prompts through systematic variation and testing using grid search, random search, or genetic algorithms.

Example:

template := `You are a {role}.
{instructions}`

variations := map[string][]string{
    "role":         {"helpful assistant", "expert advisor"},
    "instructions": {"Be concise.", "Be detailed."},
}

optimizer := evaluation.NewPromptOptimizer(
    template,
    variations,
    func(prompt string) agenkit.Agent { return MyAgent{SystemPrompt: prompt} },
    []string{"accuracy"},
    nil,
)

result, _ := optimizer.Optimize(ctx, testCases, "grid", nil)
Example (AccuracyMetric)

Example_accuracyMetric demonstrates accuracy measurement

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewAccuracyMetric(nil, false)
	agent := &ExampleAgent{name: "qa-agent"}

	input := &agenkit.Message{
		Role:    "user",
		Content: "What is the capital of France?",
	}
	output := &agenkit.Message{
		Role:    "agent",
		Content: "The capital of France is Paris.",
	}

	ctx := map[string]interface{}{
		"expected": "Paris",
	}

	score, err := metric.Measure(agent, input, output, ctx)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Accuracy score: %.0f\n", score)

}
Output:

Accuracy score: 1
Example (BenchmarkSuite)

Example_benchmarkSuite demonstrates running benchmark suites

package main

import (
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create standard benchmark suite
	suite := evaluation.BenchmarkSuiteStandard()

	// Generate all test cases
	testCases, err := suite.GenerateAllTestCases()
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Generated %d test cases from standard suite\n", len(testCases))

}
Output:

Generated 19 test cases from standard suite
Example (CompressionMetrics)

Example_compressionMetrics demonstrates compression evaluation

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create metrics for small test lengths
	testLengths := []int{1000, 5000, 10000}
	metric := evaluation.NewCompressionMetrics(testLengths, 3)

	fmt.Printf("Metric name: %s\n", metric.Name())
	fmt.Printf("Testing %d scale points\n", len(testLengths))

}
Output:

Metric name: compression_quality
Testing 3 scale points
Example (ContextMetrics)

Example_contextMetrics demonstrates context tracking

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewContextMetrics()

	agent := &ExampleAgent{name: "test-agent"}
	input := &agenkit.Message{
		Role:    "user",
		Content: "Test query",
		Metadata: map[string]interface{}{
			"context_length": 1000.0,
		},
	}
	output := &agenkit.Message{
		Role:    "agent",
		Content: "Test response",
	}

	// Measure context length
	length, err := metric.Measure(agent, input, output, nil)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Context length: %.0f tokens\n", length)

	// Simulate growing context over time
	measurements := []float64{1000, 1200, 1400, 1600, 1800}
	aggregated := metric.Aggregate(measurements)

	fmt.Printf("Mean context: %.0f tokens\n", aggregated["mean"])
	fmt.Printf("Growth rate: %.0f tokens/interaction\n", aggregated["growth_rate"])

}
Output:

Context length: 1000 tokens
Mean context: 1400 tokens
Growth rate: 160 tokens/interaction
Example (Evaluator)

ExampleEvaluator demonstrates basic evaluation

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	agent := &ExampleAgent{name: "qa-agent"}

	// Create evaluator with metrics
	metrics := []evaluation.Metric{
		evaluation.NewAccuracyMetric(nil, false),
		evaluation.NewQualityMetrics(false, "", nil),
	}

	evaluator := evaluation.NewEvaluator(agent, metrics, "session-123")

	// Define test cases
	testCases := []map[string]interface{}{
		{
			"input":    "What is the capital of France?",
			"expected": "Paris",
		},
		{
			"input":    "What is 2+2?",
			"expected": "4",
		},
	}

	// Run evaluation
	result, err := evaluator.Evaluate(testCases, "")
	if err != nil {
		log.Fatal(err)
	}

	// Print results
	fmt.Printf("Total Tests: %d\n", result.TotalTests)
	fmt.Printf("Passed: %d\n", result.PassedTests)
	if result.Accuracy != nil {
		fmt.Printf("Accuracy: %.2f\n", *result.Accuracy)
	}

}
Output:

Total Tests: 2
Passed: 2
Accuracy: 1.00
Example (EvaluatorContinuous)

Example_evaluatorContinuous demonstrates continuous evaluation

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	agent := &ExampleAgent{name: "qa-agent"}
	metrics := []evaluation.Metric{
		evaluation.NewAccuracyMetric(nil, false),
	}

	evaluator := evaluation.NewEvaluator(agent, metrics, "continuous-session")

	// Set up regression detection
	detector := evaluation.NewRegressionDetector(nil, nil)

	// Baseline evaluation
	baselineTests := []map[string]interface{}{
		{
			"input":    "What is the capital of France?",
			"expected": "Paris",
		},
	}

	baselineResult, _ := evaluator.Evaluate(baselineTests, "")
	detector.SetBaseline(baselineResult)
	fmt.Println("Baseline established")

	// Current evaluation
	currentTests := baselineTests // Same tests
	currentResult, _ := evaluator.Evaluate(currentTests, "")

	// Detect regressions
	regressions := detector.Detect(currentResult, true)
	if len(regressions) > 0 {
		fmt.Printf("Detected %d regressions\n", len(regressions))
	} else {
		fmt.Println("No regressions detected")
	}

}
Output:

Baseline established
No regressions detected
Example (LatencyMetric)

Example_latencyMetric demonstrates latency tracking

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	metric := evaluation.NewLatencyMetric()

	// Simulate latency measurements
	measurements := []float64{
		95, 102, 98, 105, 110, 97, 103, 120, 99, 101,
		104, 108, 96, 100, 115, 102, 98, 105, 130, 99,
	}

	aggregated := metric.Aggregate(measurements)

	fmt.Printf("Latency Statistics:\n")
	fmt.Printf("Mean: %.0fms\n", aggregated["mean"])
	fmt.Printf("p50:  %.0fms\n", aggregated["p50"])
	fmt.Printf("p95:  %.0fms\n", aggregated["p95"])
	fmt.Printf("p99:  %.0fms\n", aggregated["p99"])

}
Output:

Latency Statistics:
Mean: 104ms
p50:  102ms
p95:  130ms
p99:  130ms
Example (PrecisionRecallMetric)

Example_precisionRecallMetric demonstrates classification evaluation

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewPrecisionRecallMetric()
	agent := &ExampleAgent{name: "classifier"}

	input := &agenkit.Message{Role: "user", Content: "Classify"}
	output := &agenkit.Message{Role: "agent", Content: "Positive"}

	// Simulate classification results
	testCases := []struct {
		trueLabel      bool
		predictedLabel bool
	}{
		{true, true},   // TP
		{true, true},   // TP
		{false, true},  // FP
		{true, false},  // FN
		{false, false}, // TN
	}

	measurements := make([]float64, 0)
	for _, tc := range testCases {
		ctx := map[string]interface{}{
			"true_label":      tc.trueLabel,
			"predicted_label": tc.predictedLabel,
		}
		score, _ := metric.Measure(agent, input, output, ctx)
		measurements = append(measurements, score)
	}

	results := metric.Aggregate(measurements)

	fmt.Printf("Precision: %.2f\n", results["precision"])
	fmt.Printf("Recall: %.2f\n", results["recall"])
	fmt.Printf("F1 Score: %.2f\n", results["f1_score"])

}
Output:

Precision: 0.67
Recall: 0.67
F1 Score: 0.67
Example (QualityMetrics)

Example_qualityMetrics demonstrates quality scoring

package main

import (
	"context"
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	metric := evaluation.NewQualityMetrics(false, "", nil)
	agent := &ExampleAgent{name: "qa-agent"}

	input := &agenkit.Message{
		Role:    "user",
		Content: "Explain machine learning",
	}

	// Good response
	output1 := &agenkit.Message{
		Role: "agent",
		Content: "Machine learning is a branch of artificial intelligence that " +
			"enables systems to learn from data and improve their performance " +
			"over time without being explicitly programmed.",
	}

	score1, _ := metric.Measure(agent, input, output1, nil)
	fmt.Printf("Good response quality: %.2f\n", score1)

	// Poor response
	output2 := &agenkit.Message{
		Role:    "agent",
		Content: "Yes.",
	}

	score2, _ := metric.Measure(agent, input, output2, nil)
	fmt.Printf("Poor response quality: %.2f\n", score2)

}
Output:

Good response quality: 0.80
Poor response quality: 0.21
Example (RegressionDetector)

Example_regressionDetector demonstrates regression detection

package main

import (
	"fmt"

	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

func main() {
	// Create baseline result
	accuracy := 0.95
	quality := 0.90
	latency := 100.0
	baseline := &evaluation.EvaluationResult{
		Accuracy:     &accuracy,
		QualityScore: &quality,
		AvgLatencyMs: &latency,
	}

	// Create detector with baseline
	detector := evaluation.NewRegressionDetector(nil, baseline)

	// Simulate degraded performance
	newAccuracy := 0.80
	newQuality := 0.85
	newLatency := 150.0
	current := &evaluation.EvaluationResult{
		Accuracy:     &newAccuracy,
		QualityScore: &newQuality,
		AvgLatencyMs: &newLatency,
	}

	// Detect regressions
	regressions := detector.Detect(current, true)

	fmt.Printf("Found %d regressions\n", len(regressions))

}
Output:

Found 2 regressions
Example (SessionRecorder)

Example_sessionRecorder demonstrates session recording

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
	"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)

// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
	name string
}

func (a *ExampleAgent) Name() string {
	return a.name
}

func (a *ExampleAgent) Capabilities() []string {
	return []string{"question-answering"}
}

func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
	return agenkit.DefaultIntrospectionResult(a)
}

func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {

	content := msg.Content
	var response string

	switch content {
	case "What is the capital of France?":
		response = "The capital of France is Paris."
	case "What is 2+2?":
		response = "2+2 equals 4."
	default:
		response = "I don't know the answer to that question."
	}

	return &agenkit.Message{
		Role:    "agent",
		Content: response,
	}, nil
}

func main() {
	storage := evaluation.NewInMemoryRecordingStorage()
	recorder := evaluation.NewSessionRecorder(storage)

	// Start session
	recorder.StartSession("session-123", "qa-agent", nil)

	// Wrap agent to automatically record
	agent := &ExampleAgent{name: "qa-agent"}
	wrappedAgent := recorder.Wrap(agent)

	// Process message (automatically recorded)
	input := &agenkit.Message{
		Role:    "user",
		Content: "What is AI?",
		Metadata: map[string]interface{}{
			"session_id": "session-123",
		},
	}

	_, _ = wrappedAgent.Process(context.Background(), input)

	// Finalize session
	recording, err := recorder.FinalizeSession("session-123")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Session %s recorded\n", recording.SessionID)
	fmt.Printf("Interactions: %d\n", recording.InteractionCount())

}
Output:

Session session-123 recorded
Interactions: 1

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func CalculateSampleSize

func CalculateSampleSize(baselineMean, minimumDetectableEffect, alpha, power float64, stdDev *float64) int

CalculateSampleSize calculates required sample size for A/B test.

Types

type ABResult

type ABResult struct {
	ExperimentName     string
	ControlVariant     *ABVariant
	TreatmentVariant   *ABVariant
	MetricName         string
	PValue             float64
	TestType           StatisticalTestType
	SignificanceLevel  SignificanceLevel
	EffectSize         float64
	ConfidenceInterval [2]float64
	Timestamp          time.Time
}

ABResult contains results of an A/B test with statistical analysis.

func (*ABResult) ImprovementPercent

func (r *ABResult) ImprovementPercent() float64

ImprovementPercent returns percent improvement of treatment over control.

func (*ABResult) IsSignificant

func (r *ABResult) IsSignificant() bool

IsSignificant checks if result is statistically significant.

func (*ABResult) ToMap

func (r *ABResult) ToMap() map[string]interface{}

ToMap converts result to map for serialization.

func (*ABResult) Winner

func (r *ABResult) Winner() string

Winner returns the winner variant name if significant, otherwise empty string.

type ABTest

type ABTest struct {
	Name              string
	Control           *ABVariant
	Treatment         *ABVariant
	Metrics           []string
	SignificanceLevel SignificanceLevel
	TestType          StatisticalTestType
	Results           map[string]*ABResult
}

ABTest orchestrates A/B experiments comparing agent variants.

func NewABTest

func NewABTest(
	name string,
	controlAgent agenkit.Agent,
	treatmentAgent agenkit.Agent,
	metrics []string,
	significanceLevel SignificanceLevel,
	testType StatisticalTestType,
) *ABTest

NewABTest creates a new A/B test.

func (*ABTest) GetSummary

func (t *ABTest) GetSummary() map[string]interface{}

GetSummary returns experiment summary.

func (*ABTest) Run

func (t *ABTest) Run(testCases []map[string]interface{}, sampleSize int, shuffle bool) (map[string]*ABResult, error)

Run executes the A/B test experiment.

type ABVariant

type ABVariant struct {
	Name     string
	Agent    agenkit.Agent
	Samples  []float64
	Metadata map[string]interface{}
}

ABVariant represents a variant in an A/B test.

func NewABVariant

func NewABVariant(name string, agent agenkit.Agent) *ABVariant

NewABVariant creates a new A/B variant.

func (*ABVariant) AddSample

func (v *ABVariant) AddSample(value float64)

AddSample adds a measurement sample.

func (*ABVariant) Mean

func (v *ABVariant) Mean() float64

Mean returns the mean of samples.

func (*ABVariant) SampleSize

func (v *ABVariant) SampleSize() int

SampleSize returns the number of samples.

func (*ABVariant) StdDev

func (v *ABVariant) StdDev() float64

StdDev returns the standard deviation of samples.

type AccuracyMetric

type AccuracyMetric struct {
	// contains filtered or unexported fields
}

AccuracyMetric measures task accuracy.

Compares agent output to expected output to determine correctness. Supports multiple validation methods:

  • Exact string matching
  • Substring matching (case-insensitive)
  • Custom validator functions

Example:

metric := NewAccuracyMetric(nil, false)
score, _ := metric.Measure(
    agent,
    inputMsg,
    outputMsg,
    map[string]interface{}{"expected": "Paris"},
)
fmt.Printf("Accuracy: %.0f\n", score)  // 0.0 or 1.0

func NewAccuracyMetric

func NewAccuracyMetric(validator ValidatorFunc, caseSensitive bool) *AccuracyMetric

NewAccuracyMetric creates a new accuracy metric.

Args:

validator: Custom validation function(expected, actual) -> bool
caseSensitive: Whether string matching is case-sensitive

Example:

metric := NewAccuracyMetric(nil, false)

func (*AccuracyMetric) Aggregate

func (m *AccuracyMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates accuracy measurements.

Args:

measurements: List of 0.0/1.0 values

Returns:

Accuracy statistics: accuracy, total, correct, incorrect

func (*AccuracyMetric) Measure

func (m *AccuracyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures accuracy for single interaction.

Args:

agent: Agent being evaluated
inputMessage: Input to agent
outputMessage: Agent's response
ctx: Must contain "expected" key with expected output

Returns:

1.0 if correct, 0.0 if incorrect

func (*AccuracyMetric) Name

func (m *AccuracyMetric) Name() string

Name returns the metric name.

type AcquisitionFunction

type AcquisitionFunction string

AcquisitionFunction specifies the acquisition function type for Bayesian optimization.

const (
	// AcquisitionEI represents Expected Improvement
	AcquisitionEI AcquisitionFunction = "ei"
	// AcquisitionUCB represents Upper Confidence Bound
	AcquisitionUCB AcquisitionFunction = "ucb"
	// AcquisitionPI represents Probability of Improvement
	AcquisitionPI AcquisitionFunction = "pi"
)

type AgentFactory

type AgentFactory func(prompt string) agenkit.Agent

AgentFactory is a function that creates an agent from a prompt string.

type BayesianOptimizer

type BayesianOptimizer struct {
	// contains filtered or unexported fields
}

BayesianOptimizer implements Bayesian optimization for hyperparameter tuning.

This implementation uses a simplified surrogate model based on local statistics rather than full Gaussian Process regression. It balances exploration and exploitation through acquisition functions.

Algorithm:

  1. Sample n_initial random configurations
  2. Evaluate and build local statistics
  3. Use acquisition function to select next config
  4. Evaluate new config
  5. Update statistics and repeat

func NewBayesianOptimizer

func NewBayesianOptimizer(config BayesianOptimizerConfig) (*BayesianOptimizer, error)

NewBayesianOptimizer creates a new Bayesian optimizer.

func (*BayesianOptimizer) Optimize

func (b *BayesianOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)

Optimize runs the Bayesian optimization process.

type BayesianOptimizerConfig

type BayesianOptimizerConfig struct {
	SearchSpace *SearchSpace
	Objective   ObjectiveFunc
	Maximize    bool
	Acquisition AcquisitionFunction
	NInitial    int
	Xi          float64 // Exploration parameter for EI and PI (default: 0.01)
	Kappa       float64 // Exploration parameter for UCB (default: 2.576)
}

BayesianOptimizerConfig contains configuration for BayesianOptimizer.

type Benchmark

type Benchmark interface {
	// Name returns the benchmark name.
	Name() string

	// Description returns the benchmark description.
	Description() string

	// GenerateTestCases generates test cases for this benchmark.
	//
	// Returns:
	//   List of test cases
	GenerateTestCases() ([]*TestCase, error)
}

Benchmark is the interface for benchmarks.

Benchmarks define test suites for evaluating specific capabilities.

type BenchmarkSuite

type BenchmarkSuite struct {
	// contains filtered or unexported fields
}

BenchmarkSuite is a collection of benchmarks for comprehensive evaluation.

Provides standard and extreme-scale benchmark suites.

func BenchmarkSuiteExtremeScale

func BenchmarkSuiteExtremeScale() *BenchmarkSuite

BenchmarkSuiteExtremeScale creates an extreme-scale benchmark suite for endless.

Tests at 1M-25M+ tokens with compression and retrieval.

func BenchmarkSuiteQuick

func BenchmarkSuiteQuick() *BenchmarkSuite

BenchmarkSuiteQuick creates a quick benchmark suite for fast iteration.

Small test set for rapid feedback during development.

func BenchmarkSuiteStandard

func BenchmarkSuiteStandard() *BenchmarkSuite

BenchmarkSuiteStandard creates a standard benchmark suite.

Includes basic Q&A and small-scale retrieval tests.

func NewBenchmarkSuite

func NewBenchmarkSuite(benchmarks []Benchmark, name string) *BenchmarkSuite

NewBenchmarkSuite creates a new benchmark suite.

Args:

benchmarks: List of benchmarks to include
name: Suite name

Example:

suite := NewBenchmarkSuite([]Benchmark{NewSimpleQABenchmark()}, "custom")

func (*BenchmarkSuite) AddBenchmark

func (s *BenchmarkSuite) AddBenchmark(benchmark Benchmark)

AddBenchmark adds benchmark to suite.

func (*BenchmarkSuite) GenerateAllTestCases

func (s *BenchmarkSuite) GenerateAllTestCases() ([]map[string]interface{}, error)

GenerateAllTestCases generates all test cases from all benchmarks.

Returns:

Combined list of test cases from all benchmarks

func (*BenchmarkSuite) GetBenchmark

func (s *BenchmarkSuite) GetBenchmark(name string) Benchmark

GetBenchmark gets benchmark by name.

func (*BenchmarkSuite) RemoveBenchmark

func (s *BenchmarkSuite) RemoveBenchmark(name string)

RemoveBenchmark removes benchmark from suite.

func (*BenchmarkSuite) ToDict

func (s *BenchmarkSuite) ToDict() map[string]interface{}

ToDict converts suite to dictionary.

type CompressionMetrics

type CompressionMetrics struct {
	// contains filtered or unexported fields
}

CompressionMetrics measures compression quality at extreme scale.

Critical for endless and similar systems that use 100x-1000x compression at 25M+ tokens. Measures:

  • Compression ratio achieved
  • Information retention after compression
  • Retrieval accuracy from compressed context
  • Quality degradation as context grows

Example:

metrics := NewCompressionMetrics([]int{1000000, 10000000, 25000000}, 10)
stats, _ := metrics.EvaluateAtLengths(agent, "session-123", nil)
for length, stat := range stats {
    fmt.Printf("%dM tokens: %.1fx compression\n", length/1e6, stat.CompressionRatio)
}

func NewCompressionMetrics

func NewCompressionMetrics(testLengths []int, needleCount int) *CompressionMetrics

NewCompressionMetrics creates a new compression metrics instance.

Args:

testLengths: Context lengths to test (defaults to 1M, 10M, 25M)
needleCount: Number of "needle" facts to test retrieval

Example:

metrics := NewCompressionMetrics([]int{1000000, 10000000}, 10)

func (*CompressionMetrics) Aggregate

func (m *CompressionMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates compression ratios.

Args:

measurements: List of compression ratios

Returns:

Statistics: mean, min, max, std

func (*CompressionMetrics) EvaluateAtLengths

func (m *CompressionMetrics) EvaluateAtLengths(agent agenkit.Agent, sessionID string, needleContent []string) (map[int]*CompressionStats, error)

EvaluateAtLengths evaluates compression quality at multiple context lengths.

Tests compression and retrieval at 1M, 10M, 25M tokens to detect quality degradation as context grows.

Args:

agent: Agent with compression capability
sessionID: Session to evaluate
needleContent: Specific facts to test retrieval (optional)

Returns:

Dictionary mapping context_length -> CompressionStats

func (*CompressionMetrics) Measure

func (m *CompressionMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures compression quality for single interaction.

Returns:

Compression ratio (raw_tokens / compressed_tokens)

func (*CompressionMetrics) Name

func (m *CompressionMetrics) Name() string

Name returns the metric name.

type CompressionStats

type CompressionStats struct {
	RawTokens           int
	CompressedTokens    int
	CompressionRatio    float64
	RetrievalAccuracy   float64
	ContextLengthTested int
	Timestamp           time.Time
}

CompressionStats contains statistics from compression evaluation.

func (*CompressionStats) ToDict

func (c *CompressionStats) ToDict() map[string]interface{}

ToDict converts stats to dictionary.

type ContextMetrics

type ContextMetrics struct{}

ContextMetrics tracks context length and growth over agent lifecycle.

Essential for extreme-scale systems (endless) that operate at 1M-25M+ token contexts. Measures:

  • Raw context token count
  • Compressed context token count (if compression used)
  • Compression ratio
  • Context growth rate

Example:

metrics := NewContextMetrics()
result, _ := metrics.Measure(agent, inputMsg, outputMsg, ctx)
fmt.Printf("Context length: %.0f tokens\n", result)

func NewContextMetrics

func NewContextMetrics() *ContextMetrics

NewContextMetrics creates a new context metrics instance.

func (*ContextMetrics) Aggregate

func (m *ContextMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates context length measurements.

Args:

measurements: List of context lengths over time

Returns:

Statistics: mean, min, max, final, growth_rate

func (*ContextMetrics) Measure

func (m *ContextMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures context length metrics.

Args:

agent: Agent being evaluated
inputMessage: Input message
outputMessage: Agent response
ctx: Additional context with session history

Returns:

Current context length in tokens (or compressed tokens if available)

func (*ContextMetrics) Name

func (m *ContextMetrics) Name() string

Name returns the metric name.

type ErrorRecord

type ErrorRecord struct {
	// Type of error
	Type string `json:"type"`
	// Error message
	Message string `json:"message"`
	// Additional details
	Details map[string]interface{} `json:"details,omitempty"`
	// Timestamp when error occurred (RFC3339 format)
	Timestamp string `json:"timestamp"`
}

ErrorRecord represents an error that occurred during evaluation.

func NewErrorRecord

func NewErrorRecord(errorType string, message string, details map[string]interface{}) *ErrorRecord

NewErrorRecord creates a new error record with current timestamp.

type EvaluationResult

type EvaluationResult struct {
	// Identification
	EvaluationID string
	AgentName    string
	Timestamp    time.Time

	// Metrics
	Metrics           map[string][]float64
	AggregatedMetrics map[string]map[string]float64

	// Context information
	ContextLength    *int
	CompressedLength *int
	CompressionRatio *float64

	// Quality scores
	Accuracy     *float64
	QualityScore *float64

	// Performance
	AvgLatencyMs *float64
	P95LatencyMs *float64

	// Test details
	TotalTests  int
	PassedTests int
	FailedTests int

	// Additional metadata
	Metadata map[string]interface{}
}

EvaluationResult contains results from an evaluation run.

Includes metrics, metadata, and analysis.

func (*EvaluationResult) SuccessRate

func (r *EvaluationResult) SuccessRate() float64

SuccessRate calculates test success rate.

func (*EvaluationResult) ToDict

func (r *EvaluationResult) ToDict() map[string]interface{}

ToDict converts result to dictionary.

type Evaluator

type Evaluator struct {
	// contains filtered or unexported fields
}

Evaluator is the core evaluation orchestrator.

Runs benchmarks, collects metrics, and aggregates results.

Example:

evaluator := NewEvaluator(agent, nil, "")
suite := BenchmarkSuiteStandard()
testCases, _ := suite.GenerateAllTestCases()
results, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", results.Accuracy)

func NewEvaluator

func NewEvaluator(agent agenkit.Agent, metrics []Metric, sessionID string) *Evaluator

NewEvaluator creates a new evaluator.

Args:

agent: Agent to evaluate
metrics: List of metrics to collect (defaults to empty)
sessionID: Optional session ID for context tracking

Example:

evaluator := NewEvaluator(agent, []Metric{NewAccuracyMetric(nil, false)}, "eval-123")

func (*Evaluator) Evaluate

func (e *Evaluator) Evaluate(testCases []map[string]interface{}, evaluationID string) (*EvaluationResult, error)

Evaluate evaluates agent on test cases.

Args:

testCases: List of test cases, each with 'input' and 'expected' keys
evaluationID: Optional evaluation ID

Returns:

EvaluationResult with metrics and analysis

Example:

testCases := []map[string]interface{}{
    {"input": "What is 2+2?", "expected": "4"},
}
result, err := evaluator.Evaluate(testCases, "")

func (*Evaluator) EvaluateSingle

func (e *Evaluator) EvaluateSingle(inputMessage *agenkit.Message, expectedOutput interface{}) (map[string]float64, error)

EvaluateSingle evaluates single interaction.

Args:

inputMessage: Input to agent
expectedOutput: Expected output (optional)

Returns:

Dictionary of metric values

Example:

inputMsg := &agenkit.Message{Role: "user", Content: "Hello"}
metrics, err := evaluator.EvaluateSingle(inputMsg, "Hi there")

type ExtremeScaleBenchmark

type ExtremeScaleBenchmark struct {
	// contains filtered or unexported fields
}

ExtremeScaleBenchmark is an extreme-scale benchmark for testing at 1M-25M+ tokens.

Designed specifically for endless and similar systems that operate at unprecedented context lengths.

func NewExtremeScaleBenchmark

func NewExtremeScaleBenchmark(testLengths []int, needlesPerLength int) *ExtremeScaleBenchmark

NewExtremeScaleBenchmark creates a new extreme-scale benchmark.

Args:

testLengths: Context lengths to test (defaults to 1M, 10M, 25M)
needlesPerLength: Number of needles per context length

Example:

benchmark := NewExtremeScaleBenchmark([]int{1000000, 10000000}, 10)

func (*ExtremeScaleBenchmark) Description

func (b *ExtremeScaleBenchmark) Description() string

Description returns the benchmark description.

func (*ExtremeScaleBenchmark) GenerateTestCases

func (b *ExtremeScaleBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates extreme-scale test cases.

func (*ExtremeScaleBenchmark) Name

func (b *ExtremeScaleBenchmark) Name() string

Name returns the benchmark name.

type FileRecordingStorage

type FileRecordingStorage struct {
	// contains filtered or unexported fields
}

FileRecordingStorage provides file-based recording storage.

Stores recordings as JSON files on disk.

func NewFileRecordingStorage

func NewFileRecordingStorage(recordingsDir string) *FileRecordingStorage

NewFileRecordingStorage creates a new file storage.

Args:

recordingsDir: Directory to store recordings

Example:

storage := NewFileRecordingStorage("./recordings")

func (*FileRecordingStorage) DeleteRecording

func (s *FileRecordingStorage) DeleteRecording(sessionID string) error

DeleteRecording deletes recording file.

func (*FileRecordingStorage) ListRecordings

func (s *FileRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists all recordings.

func (*FileRecordingStorage) LoadRecording

func (s *FileRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from file.

func (*FileRecordingStorage) SaveRecording

func (s *FileRecordingStorage) SaveRecording(recording *SessionRecording) error

SaveRecording saves recording to file.

type InMemoryRecordingStorage

type InMemoryRecordingStorage struct {
	// contains filtered or unexported fields
}

InMemoryRecordingStorage provides in-memory recording storage for testing.

Does not persist recordings across restarts.

func NewInMemoryRecordingStorage

func NewInMemoryRecordingStorage() *InMemoryRecordingStorage

NewInMemoryRecordingStorage creates a new in-memory storage.

func (*InMemoryRecordingStorage) DeleteRecording

func (s *InMemoryRecordingStorage) DeleteRecording(sessionID string) error

DeleteRecording deletes recording from memory.

func (*InMemoryRecordingStorage) ListRecordings

func (s *InMemoryRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists recordings from memory.

func (*InMemoryRecordingStorage) LoadRecording

func (s *InMemoryRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from memory.

func (*InMemoryRecordingStorage) SaveRecording

func (s *InMemoryRecordingStorage) SaveRecording(recording *SessionRecording) error

SaveRecording saves recording to memory.

type InformationRetentionBenchmark

type InformationRetentionBenchmark struct {
	// contains filtered or unexported fields
}

InformationRetentionBenchmark tests information retention across long conversations.

Verifies that agents remember and can recall information from earlier in the conversation, even after compression.

func NewInformationRetentionBenchmark

func NewInformationRetentionBenchmark(conversationLength int, recallPoints []int) *InformationRetentionBenchmark

NewInformationRetentionBenchmark creates a new information retention benchmark.

Args:

conversationLength: Number of conversation turns
recallPoints: Turns at which to test recall (defaults to checkpoints)

Example:

benchmark := NewInformationRetentionBenchmark(100, []int{10, 25, 50, 75, 100})

func (*InformationRetentionBenchmark) Description

func (b *InformationRetentionBenchmark) Description() string

Description returns the benchmark description.

func (*InformationRetentionBenchmark) GenerateTestCases

func (b *InformationRetentionBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates information retention test cases.

func (*InformationRetentionBenchmark) Name

Name returns the benchmark name.

type InteractionRecord

type InteractionRecord struct {
	InteractionID string
	SessionID     string
	InputMessage  map[string]interface{}
	OutputMessage map[string]interface{}
	Timestamp     time.Time
	LatencyMs     float64
	Metadata      map[string]interface{}
}

InteractionRecord represents a record of single agent interaction.

Contains input, output, timing, and metadata.

func InteractionRecordFromDict

func InteractionRecordFromDict(data map[string]interface{}) (*InteractionRecord, error)

InteractionRecordFromDict creates record from dictionary.

func (*InteractionRecord) ToDict

func (r *InteractionRecord) ToDict() map[string]interface{}

ToDict converts record to dictionary.

type LatencyMetric

type LatencyMetric struct{}

LatencyMetric measures agent response latency.

Tracks processing time per interaction. Critical for production systems where response time matters.

func NewLatencyMetric

func NewLatencyMetric() *LatencyMetric

NewLatencyMetric creates a new latency metric instance.

func (*LatencyMetric) Aggregate

func (m *LatencyMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates latency measurements.

Returns:

mean, p50, p95, p99, min, max

func (*LatencyMetric) Measure

func (m *LatencyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure gets latency for this interaction.

Returns:

Latency in milliseconds

func (*LatencyMetric) Name

func (m *LatencyMetric) Name() string

Name returns the metric name.

type Metric

type Metric interface {
	// Name returns the metric name.
	Name() string

	// Measure measures metric for a single agent interaction.
	//
	// Args:
	//   agent: The agent being evaluated
	//   inputMessage: Input to the agent
	//   outputMessage: Agent's response
	//   ctx: Additional context (session history, etc.)
	//
	// Returns:
	//   Metric value (typically 0.0 to 1.0)
	Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

	// Aggregate aggregates multiple measurements.
	//
	// Args:
	//   measurements: List of individual measurements
	//
	// Returns:
	//   Aggregated statistics (mean, std, min, max, etc.)
	Aggregate(measurements []float64) map[string]float64
}

Metric is the interface for evaluation metrics.

Metrics measure specific aspects of agent performance:

  • Accuracy
  • Latency
  • Context usage
  • Quality scores
  • etc.

type MetricMeasurement

type MetricMeasurement struct {
	// Name of the metric
	Name string `json:"name"`
	// Value of the measurement
	Value float64 `json:"value"`
	// Type categorizes the metric
	Type MetricType `json:"type"`
	// Timestamp when measurement was taken (RFC3339 format)
	Timestamp string `json:"timestamp"`
	// Metadata for additional context
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

MetricMeasurement represents a single metric measurement.

Note: This is distinct from the Metric interface in core.go. Metric interface defines how to measure, MetricMeasurement stores the measurement.

func CreateCostMetric

func CreateCostMetric(cost float64, currency string, metadata map[string]interface{}) *MetricMeasurement

CreateCostMetric creates a cost metric measurement.

Helper to create a cost metric with currency information.

Args:

cost: Cost amount
currency: Currency code (default: "USD")
metadata: Additional metadata

Returns:

Cost metric measurement

Example:

metric := evaluation.CreateCostMetric(0.0042, "USD", nil)

func CreateDurationMetric

func CreateDurationMetric(durationSeconds float64, metadata map[string]interface{}) *MetricMeasurement

CreateDurationMetric creates a duration metric measurement.

Helper to create a duration metric with hours conversion.

Args:

durationSeconds: Duration in seconds
metadata: Additional metadata

Returns:

Duration metric measurement

Example:

metric := evaluation.CreateDurationMetric(125.5, nil)

func CreateQualityMetric

func CreateQualityMetric(name string, score, maxScore float64, metadata map[string]interface{}) *MetricMeasurement

CreateQualityMetric creates a quality score metric measurement.

Helper to create a quality score metric with normalized score (0.0-1.0).

Args:

name: Metric name
score: Raw score
maxScore: Maximum possible score (default: 10.0)
metadata: Additional metadata

Returns:

Metric measurement with normalized score

Example:

metric := evaluation.CreateQualityMetric("response_quality", 8.5, 10.0, nil)

func NewMetricMeasurement

func NewMetricMeasurement(name string, value float64, metricType MetricType) *MetricMeasurement

NewMetricMeasurement creates a new metric measurement with current timestamp.

type MetricType

type MetricType string

MetricType categorizes different types of metrics.

const (
	// MetricTypeSuccessRate measures success/failure rates
	MetricTypeSuccessRate MetricType = "success_rate"
	// MetricTypeQualityScore measures output quality
	MetricTypeQualityScore MetricType = "quality_score"
	// MetricTypeCost measures token/API costs
	MetricTypeCost MetricType = "cost"
	// MetricTypeDuration measures time taken
	MetricTypeDuration MetricType = "duration"
	// MetricTypeErrorRate measures error frequency
	MetricTypeErrorRate MetricType = "error_rate"
	// MetricTypeTaskCompletion measures task completion
	MetricTypeTaskCompletion MetricType = "task_completion"
	// MetricTypeCustom for custom metrics
	MetricTypeCustom MetricType = "custom"
)

type MetricsCollector

type MetricsCollector struct {
	// contains filtered or unexported fields
}

MetricsCollector aggregates metrics across multiple evaluation sessions.

Useful for analyzing agent performance over time and across different scenarios. Thread-safe for concurrent access.

Example:

collector := evaluation.NewMetricsCollector()
collector.AddResult(result1)
collector.AddResult(result2)
stats := collector.GetStatistics()
fmt.Printf("Success rate: %.2f%%\n", stats["success_rate"]*100)

func NewMetricsCollector

func NewMetricsCollector() *MetricsCollector

NewMetricsCollector creates a new metrics collector.

func (*MetricsCollector) AddResult

func (mc *MetricsCollector) AddResult(result *SessionResult)

AddResult adds a session result to the collector. Thread-safe for concurrent access.

func (*MetricsCollector) Clear

func (mc *MetricsCollector) Clear()

Clear removes all collected results. Thread-safe for concurrent access.

func (*MetricsCollector) GetMetricAggregates

func (mc *MetricsCollector) GetMetricAggregates(metricName string) map[string]interface{}

GetMetricAggregates computes aggregated statistics for a specific metric across all sessions. Thread-safe for concurrent access.

Args:

metricName: Name of the metric to aggregate

Returns:

Map with statistics: count, sum, mean, min, max

func (*MetricsCollector) GetResults

func (mc *MetricsCollector) GetResults() []SessionResult

GetResults returns all collected session results. Thread-safe for concurrent access.

func (*MetricsCollector) GetStatistics

func (mc *MetricsCollector) GetStatistics() map[string]interface{}

GetStatistics computes aggregated statistics across all collected results. Thread-safe for concurrent access.

Returns a map with statistics including:

  • session_count: Total number of sessions
  • completed_count: Number of completed sessions
  • failed_count: Number of failed sessions
  • success_rate: Ratio of completed to total sessions
  • avg_duration: Average session duration in seconds
  • total_errors: Total number of errors across all sessions
  • avg_errors_per_session: Average errors per session

type NeedleInHaystackBenchmark

type NeedleInHaystackBenchmark struct {
	// contains filtered or unexported fields
}

NeedleInHaystackBenchmark is a needle-in-haystack benchmark for context retrieval.

Tests ability to retrieve specific information from large contexts. Essential for extreme-scale systems like endless.

func NewNeedleInHaystackBenchmark

func NewNeedleInHaystackBenchmark(contextLength, needleCount, haystackMultiplier int) *NeedleInHaystackBenchmark

NewNeedleInHaystackBenchmark creates a new needle-in-haystack benchmark.

Args:

contextLength: Target context length in tokens
needleCount: Number of needles to hide
haystackMultiplier: How much filler per needle

Example:

benchmark := NewNeedleInHaystackBenchmark(10000, 5, 10)

func (*NeedleInHaystackBenchmark) Description

func (b *NeedleInHaystackBenchmark) Description() string

Description returns the benchmark description.

func (*NeedleInHaystackBenchmark) GenerateTestCases

func (b *NeedleInHaystackBenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates needle-in-haystack test cases.

func (*NeedleInHaystackBenchmark) Name

Name returns the benchmark name.

type ObjectiveFunc

type ObjectiveFunc func(ctx context.Context, config map[string]interface{}) (float64, error)

ObjectiveFunc evaluates a configuration and returns a score.

type OptimizationResult

type OptimizationResult struct {
	BestConfig  map[string]interface{}
	BestScore   float64
	History     []OptimizationStep
	NIterations int
	StartTime   time.Time
	EndTime     time.Time
	Metadata    map[string]interface{}
}

OptimizationResult contains the results of an optimization run.

func (*OptimizationResult) Duration

func (r *OptimizationResult) Duration() time.Duration

Duration returns the total optimization duration.

func (*OptimizationResult) GetImprovement

func (r *OptimizationResult) GetImprovement() float64

GetImprovement returns the improvement from initial to best score.

type OptimizationStep

type OptimizationStep struct {
	Config map[string]interface{}
	Score  float64
}

OptimizationStep represents a single evaluation in the optimization.

type OptimizationStrategy

type OptimizationStrategy string

OptimizationStrategy represents prompt optimization strategies.

const (
	// StrategyGrid exhaustive grid search
	StrategyGrid OptimizationStrategy = "grid"
	// StrategyRandom random sampling
	StrategyRandom OptimizationStrategy = "random"
	// StrategyGenetic genetic algorithm
	StrategyGenetic OptimizationStrategy = "genetic"
)

type Optimizer

type Optimizer interface {
	// Optimize runs the optimization process.
	//
	// Args:
	//   ctx: Context for cancellation
	//   nIterations: Number of iterations to run
	//
	// Returns:
	//   OptimizationResult with best configuration and history
	Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
}

Optimizer is the base interface for optimization algorithms.

Implementations should provide intelligent search over configuration spaces using various strategies (random search, Bayesian optimization, genetic algorithms, etc.).

type ParameterSpec

type ParameterSpec struct {
	Type   ParameterType
	Low    float64       // For continuous/integer
	High   float64       // For continuous/integer
	Values []interface{} // For discrete/categorical
}

ParameterSpec defines a parameter in the search space.

type ParameterType

type ParameterType string

ParameterType specifies the type of a hyperparameter.

const (
	// ParamTypeContinuous represents a continuous parameter (float)
	ParamTypeContinuous ParameterType = "continuous"
	// ParamTypeInteger represents an integer parameter
	ParamTypeInteger ParameterType = "integer"
	// ParamTypeDiscrete represents a discrete set of values
	ParamTypeDiscrete ParameterType = "discrete"
	// ParamTypeCategorical represents categorical values
	ParamTypeCategorical ParameterType = "categorical"
)

type PrecisionRecallMetric

type PrecisionRecallMetric struct {
	// contains filtered or unexported fields
}

PrecisionRecallMetric measures precision and recall for classification tasks.

Useful for agents that categorize, filter, or make binary decisions.

Example:

metric := NewPrecisionRecallMetric()
// Agent classifies documents as relevant/not relevant
for _, doc := range testDocs {
    score, _ := metric.Measure(agent, doc, output, ctx)
}

func NewPrecisionRecallMetric

func NewPrecisionRecallMetric() *PrecisionRecallMetric

NewPrecisionRecallMetric creates a new precision/recall metric.

func (*PrecisionRecallMetric) Aggregate

func (m *PrecisionRecallMetric) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates precision/recall metrics.

Returns:

Precision, recall, F1 score, and confusion matrix counts

func (*PrecisionRecallMetric) Measure

func (m *PrecisionRecallMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures precision/recall for single classification.

Context must contain:

  • "true_label": Ground truth (True/False or 1/0)
  • "predicted_label": Agent's prediction (True/False or 1/0)

Returns:

1.0 if correct classification, 0.0 if incorrect

func (*PrecisionRecallMetric) Name

func (m *PrecisionRecallMetric) Name() string

Name returns the metric name.

func (*PrecisionRecallMetric) Reset

func (m *PrecisionRecallMetric) Reset()

Reset resets confusion matrix counts.

type PrecisionRecallStats

type PrecisionRecallStats struct {
	TruePositives  int
	FalsePositives int
	FalseNegatives int
	TrueNegatives  int
}

PrecisionRecallStats contains precision and recall statistics.

func (*PrecisionRecallStats) F1Score

func (s *PrecisionRecallStats) F1Score() float64

F1Score calculates F1 score.

func (*PrecisionRecallStats) Precision

func (s *PrecisionRecallStats) Precision() float64

Precision calculates precision.

func (*PrecisionRecallStats) Recall

func (s *PrecisionRecallStats) Recall() float64

Recall calculates recall.

func (*PrecisionRecallStats) ToDict

func (s *PrecisionRecallStats) ToDict() map[string]float64

ToDict converts stats to dictionary.

type PromptEvaluation

type PromptEvaluation struct {
	Prompt string
	Config map[string]string
	Scores map[string]float64
}

PromptEvaluation represents a single prompt evaluation.

type PromptOptimizationResult

type PromptOptimizationResult struct {
	// BestPrompt is the best prompt found
	BestPrompt string
	// BestConfig is the best variable configuration
	BestConfig map[string]string
	// BestScores contains best metric scores
	BestScores map[string]float64
	// History contains all (prompt, config, scores) tuples
	History []PromptEvaluation
	// NEvaluated is the number of prompts evaluated
	NEvaluated int
	// Strategy used
	Strategy string
	// StartTime in milliseconds
	StartTime int64
	// EndTime in milliseconds
	EndTime int64
}

PromptOptimizationResult contains results from prompt optimization.

func (*PromptOptimizationResult) DurationSeconds

func (r *PromptOptimizationResult) DurationSeconds() float64

DurationSeconds returns the duration in seconds.

func (*PromptOptimizationResult) ToDict

func (r *PromptOptimizationResult) ToDict() map[string]interface{}

ToDict converts result to dictionary.

type PromptOptimizer

type PromptOptimizer struct {
	// contains filtered or unexported fields
}

PromptOptimizer optimizes prompts through systematic variation.

Supports multiple optimization strategies: - Grid search: Exhaustive evaluation of all combinations - Random search: Random sampling of combinations - Genetic algorithm: Evolutionary optimization

func NewPromptOptimizer

func NewPromptOptimizer(
	template string,
	variations map[string][]string,
	agentFactory AgentFactory,
	metrics []string,
	objectiveMetric *string,
) *PromptOptimizer

NewPromptOptimizer creates a new prompt optimizer.

Args:

template: Prompt template with {variable} placeholders
variations: Map of variable names to possible values
agentFactory: Function that creates agent from prompt string
metrics: List of metrics to evaluate
objectiveMetric: Primary metric for optimization (nil = first metric)
maximize: Whether to maximize (true) or minimize (false) objective

func (*PromptOptimizer) Optimize

func (p *PromptOptimizer) Optimize(
	ctx context.Context,
	testCases []map[string]interface{},
	strategy string,
	options map[string]interface{},
) (*PromptOptimizationResult, error)

Optimize runs prompt optimization with the specified strategy.

Args:

ctx: Context for cancellation
testCases: Test cases for evaluation
strategy: Optimization strategy ("grid", "random", "genetic")
options: Strategy-specific options (nSamples, populationSize, etc.)

func (*PromptOptimizer) OptimizeGenetic

func (p *PromptOptimizer) OptimizeGenetic(
	ctx context.Context,
	testCases []map[string]interface{},
	populationSize int,
	nGenerations int,
	mutationRate float64,
) (*PromptOptimizationResult, error)

OptimizeGenetic performs genetic algorithm optimization.

func (*PromptOptimizer) OptimizeGrid

func (p *PromptOptimizer) OptimizeGrid(
	ctx context.Context,
	testCases []map[string]interface{},
) (*PromptOptimizationResult, error)

OptimizeGrid performs grid search by evaluating all possible combinations.

func (*PromptOptimizer) OptimizeRandom

func (p *PromptOptimizer) OptimizeRandom(
	ctx context.Context,
	testCases []map[string]interface{},
	nSamples int,
) (*PromptOptimizationResult, error)

OptimizeRandom performs random search by sampling random combinations.

func (*PromptOptimizer) SetMaximize

func (p *PromptOptimizer) SetMaximize(maximize bool)

SetMaximize sets whether to maximize (true) or minimize (false) the objective.

type QualityMetrics

type QualityMetrics struct {
	// contains filtered or unexported fields
}

QualityMetrics provides comprehensive quality scoring.

Evaluates multiple quality dimensions:

  • Relevance: How relevant is response to query?
  • Completeness: Does response answer all parts?
  • Coherence: Is response logically structured?
  • Accuracy: Is information factually correct?

Uses rule-based scoring.

Example:

metric := NewQualityMetrics(false, "", nil)
score, _ := metric.Measure(agent, inputMsg, outputMsg, nil)
fmt.Printf("Quality: %.2f\n", score)  // 0.0 to 1.0

func NewQualityMetrics

func NewQualityMetrics(useLLMJudge bool, judgeModel string, weights map[string]float64) *QualityMetrics

NewQualityMetrics creates a new quality metrics instance.

Args:

useLLMJudge: Use LLM to judge quality (not yet implemented)
judgeModel: Model to use for judging (e.g., "claude-sonnet-4")
weights: Weights for each dimension (relevance, completeness, etc.)

Example:

metric := NewQualityMetrics(false, "", nil)

func (*QualityMetrics) Aggregate

func (m *QualityMetrics) Aggregate(measurements []float64) map[string]float64

Aggregate aggregates quality measurements.

Args:

measurements: List of quality scores

Returns:

Statistics: mean, min, max, std

func (*QualityMetrics) Measure

func (m *QualityMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)

Measure measures response quality.

Args:

agent: Agent being evaluated
inputMessage: Input query
outputMessage: Agent response
ctx: Optional context

Returns:

Quality score (0.0 to 1.0)

func (*QualityMetrics) Name

func (m *QualityMetrics) Name() string

Name returns the metric name.

type RandomSearchOptimizer

type RandomSearchOptimizer struct {
	// contains filtered or unexported fields
}

RandomSearchOptimizer implements baseline random search optimization.

Randomly samples configurations from the search space and evaluates them. Useful as a baseline for comparison with more sophisticated algorithms.

Example:

optimizer := evaluation.NewRandomSearchOptimizer(
    objectiveFunc,
    searchSpace,
    true, // maximize
)
result, err := optimizer.Optimize(ctx, 20)

func NewRandomSearchOptimizer

func NewRandomSearchOptimizer(
	objective ObjectiveFunc,
	searchSpace *SearchSpace,
	maximize bool,
) *RandomSearchOptimizer

NewRandomSearchOptimizer creates a new random search optimizer.

Args:

objective: Function to evaluate configurations
searchSpace: SearchSpace defining parameter space
maximize: Whether to maximize (true) or minimize (false) objective

Returns:

RandomSearchOptimizer instance

func (*RandomSearchOptimizer) GetHistory

func (r *RandomSearchOptimizer) GetHistory() []OptimizationStep

GetHistory returns the optimization history.

func (*RandomSearchOptimizer) Optimize

func (r *RandomSearchOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)

Optimize runs random search optimization.

Randomly samples nIterations configurations from the search space, evaluates each, and tracks the best configuration found.

Args:

ctx: Context for cancellation
nIterations: Number of configurations to sample and evaluate

Returns:

OptimizationResult with best config, score, and history
error if optimization fails

type RecordingStorage

type RecordingStorage interface {
	// SaveRecording saves recording.
	SaveRecording(recording *SessionRecording) error

	// LoadRecording loads recording by session ID.
	LoadRecording(sessionID string) (*SessionRecording, error)

	// ListRecordings lists recordings.
	ListRecordings(limit, offset int) ([]*SessionRecording, error)

	// DeleteRecording deletes recording.
	DeleteRecording(sessionID string) error
}

RecordingStorage is the interface for recording storage backends.

Implement this to create custom storage (Redis, S3, Postgres, etc.).

type Regression

type Regression struct {
	MetricName         string
	BaselineValue      float64
	CurrentValue       float64
	DegradationPercent float64
	Severity           Severity
	Timestamp          time.Time
	Context            map[string]interface{}
}

Regression represents a detected regression in agent performance.

Contains information about what degraded and by how much.

func (*Regression) IsRegression

func (r *Regression) IsRegression() bool

IsRegression checks if this is a real regression (not improvement).

func (*Regression) ToDict

func (r *Regression) ToDict() map[string]interface{}

ToDict converts regression to dictionary.

type RegressionDetector

type RegressionDetector struct {
	// contains filtered or unexported fields
}

RegressionDetector detects performance regressions by comparing results.

Monitors agent quality over time and alerts when performance degrades beyond acceptable thresholds.

Example:

detector := NewRegressionDetector(nil, nil)
detector.SetBaseline(baselineResult)

// Later, after changes
regressions := detector.Detect(currentResult, true)
if len(regressions) > 0 {
    fmt.Printf("Found %d regressions!\n", len(regressions))
    for _, r := range regressions {
        fmt.Printf("  %s: %.1f%% worse\n", r.MetricName, r.DegradationPercent)
    }
}

func NewRegressionDetector

func NewRegressionDetector(thresholds map[string]float64, baseline *EvaluationResult) *RegressionDetector

NewRegressionDetector creates a new regression detector.

Args:

thresholds: Acceptable degradation per metric (default 10%)
baseline: Baseline evaluation result to compare against

Example:

detector := NewRegressionDetector(nil, baselineResult)

func (*RegressionDetector) ClearHistory

func (d *RegressionDetector) ClearHistory()

ClearHistory clears evaluation history.

func (*RegressionDetector) CompareResults

func (d *RegressionDetector) CompareResults(resultA, resultB *EvaluationResult) map[string]map[string]float64

CompareResults compares two evaluation results.

Args:

resultA: First result (baseline)
resultB: Second result (comparison)

Returns:

Dictionary of metric comparisons

func (*RegressionDetector) Detect

func (d *RegressionDetector) Detect(result *EvaluationResult, storeHistory bool) []*Regression

Detect detects regressions in evaluation result.

Compares current result to baseline and identifies metrics that have degraded beyond acceptable thresholds.

Args:

result: Current evaluation result
storeHistory: Whether to store result in history

Returns:

List of detected regressions (empty if no regressions)

func (*RegressionDetector) GetSummary

func (d *RegressionDetector) GetSummary() map[string]interface{}

GetSummary gets summary of detector state.

Returns:

Summary with baseline info and history count

func (*RegressionDetector) GetTrend

func (d *RegressionDetector) GetTrend(metricName string, window int) map[string]interface{}

GetTrend gets trend for metric over recent history.

Args:

metricName: Metric to analyze
window: Number of recent results to analyze

Returns:

Trend statistics (slope, direction, variance)

func (*RegressionDetector) SetBaseline

func (d *RegressionDetector) SetBaseline(result *EvaluationResult)

SetBaseline sets baseline for comparison.

Args:

result: Evaluation result to use as baseline

type SearchSpace

type SearchSpace struct {
	Parameters map[string]ParameterSpec
}

SearchSpace defines the hyperparameter search space.

func NewSearchSpace

func NewSearchSpace() *SearchSpace

NewSearchSpace creates a new search space.

func (*SearchSpace) AddCategorical

func (s *SearchSpace) AddCategorical(name string, values []string)

AddCategorical adds a categorical parameter with specific values.

func (*SearchSpace) AddContinuous

func (s *SearchSpace) AddContinuous(name string, low, high float64)

AddContinuous adds a continuous parameter with range [low, high].

func (*SearchSpace) AddDiscrete

func (s *SearchSpace) AddDiscrete(name string, values []interface{})

AddDiscrete adds a discrete parameter with specific values.

func (*SearchSpace) AddInteger

func (s *SearchSpace) AddInteger(name string, low, high int)

AddInteger adds an integer parameter with range [low, high].

func (*SearchSpace) Sample

func (s *SearchSpace) Sample() map[string]interface{}

Sample generates a random configuration from the search space.

type SessionRecorder

type SessionRecorder struct {
	// contains filtered or unexported fields
}

SessionRecorder records agent sessions for replay and analysis.

Automatically records all interactions with an agent, storing inputs, outputs, timing, and metadata.

Example:

recorder := NewSessionRecorder(NewFileRecordingStorage("./recordings"))
wrappedAgent := recorder.Wrap(agent)

// Use agent normally (automatically recorded)
response, _ := wrappedAgent.Process(ctx, message)

// Save recording
recorder.FinalizeSession("test-123")

func NewSessionRecorder

func NewSessionRecorder(storage RecordingStorage) *SessionRecorder

NewSessionRecorder creates a new session recorder.

Args:

storage: Storage backend (nil = in-memory)

Example:

recorder := NewSessionRecorder(nil)

func (*SessionRecorder) DeleteRecording

func (r *SessionRecorder) DeleteRecording(sessionID string) error

DeleteRecording deletes recording.

func (*SessionRecorder) FinalizeSession

func (r *SessionRecorder) FinalizeSession(sessionID string) (*SessionRecording, error)

FinalizeSession finalizes and saves session recording.

Args:

sessionID: Session to finalize

Returns:

Session recording

func (*SessionRecorder) ListRecordings

func (r *SessionRecorder) ListRecordings(limit, offset int) ([]*SessionRecording, error)

ListRecordings lists all recordings.

func (*SessionRecorder) LoadRecording

func (r *SessionRecorder) LoadRecording(sessionID string) (*SessionRecording, error)

LoadRecording loads recording from storage.

Args:

sessionID: Session to load

Returns:

Session recording if found

func (*SessionRecorder) RecordInteraction

func (r *SessionRecorder) RecordInteraction(sessionID string, inputMessage, outputMessage *agenkit.Message, latencyMs float64, metadata map[string]interface{})

RecordInteraction records single interaction.

Args:

sessionID: Session identifier
inputMessage: Input to agent
outputMessage: Agent response
latencyMs: Processing time in milliseconds
metadata: Optional interaction metadata

func (*SessionRecorder) StartSession

func (r *SessionRecorder) StartSession(sessionID, agentName string, metadata map[string]interface{})

StartSession starts recording session.

Args:

sessionID: Session identifier
agentName: Name of agent being recorded
metadata: Optional session metadata

func (*SessionRecorder) Wrap

func (r *SessionRecorder) Wrap(agent agenkit.Agent) agenkit.Agent

Wrap wraps agent to record interactions.

Args:

agent: Agent to wrap

Returns:

Wrapped agent that records all interactions

type SessionRecording

type SessionRecording struct {
	SessionID    string
	AgentName    string
	StartTime    time.Time
	EndTime      *time.Time
	Interactions []*InteractionRecord
	Metadata     map[string]interface{}
}

SessionRecording represents a recording of entire session.

Contains all interactions and session metadata.

func SessionRecordingFromDict

func SessionRecordingFromDict(data map[string]interface{}) (*SessionRecording, error)

SessionRecordingFromDict creates recording from dictionary.

func (*SessionRecording) DurationSeconds

func (r *SessionRecording) DurationSeconds() float64

DurationSeconds calculates session duration in seconds.

func (*SessionRecording) InteractionCount

func (r *SessionRecording) InteractionCount() int

InteractionCount gets number of interactions.

func (*SessionRecording) ToDict

func (r *SessionRecording) ToDict() map[string]interface{}

ToDict converts recording to dictionary.

func (*SessionRecording) TotalLatencyMs

func (r *SessionRecording) TotalLatencyMs() float64

TotalLatencyMs gets total latency across all interactions.

type SessionReplay

type SessionReplay struct{}

SessionReplay replays recorded sessions for analysis and A/B testing.

Takes recorded session and replays it through a (possibly different) agent to compare behavior.

Example:

replay := NewSessionReplay()
recording, _ := recorder.LoadRecording("test-123")

// Replay with original agent
resultsA, _ := replay.Replay(recording, agentV1, "")

// Replay with new agent (A/B test)
resultsB, _ := replay.Replay(recording, agentV2, "")

// Compare
comparison := replay.Compare(resultsA, resultsB)

func NewSessionReplay

func NewSessionReplay() *SessionReplay

NewSessionReplay creates a new session replay.

func (*SessionReplay) Compare

func (r *SessionReplay) Compare(resultsA, resultsB map[string]interface{}) map[string]interface{}

Compare compares two replay results.

Useful for A/B testing different agent versions.

Args:

resultsA: First replay results
resultsB: Second replay results

Returns:

Comparison metrics

func (*SessionReplay) Replay

func (r *SessionReplay) Replay(recording *SessionRecording, agent agenkit.Agent, sessionID string) (map[string]interface{}, error)

Replay replays session through agent.

Args:

recording: Session recording to replay
agent: Agent to replay through
sessionID: Optional session ID (defaults to original)

Returns:

Replay results with outputs and metrics

type SessionResult

type SessionResult struct {
	// SessionID uniquely identifies this session
	SessionID string `json:"session_id"`
	// AgentName identifies the agent being evaluated
	AgentName string `json:"agent_name"`
	// Status of the session
	Status SessionStatus `json:"status"`
	// StartTime when session started (RFC3339 format)
	StartTime string `json:"start_time"`
	// EndTime when session ended (RFC3339 format, nil if still running)
	EndTime *string `json:"end_time,omitempty"`
	// Measurements collected during session
	Measurements []MetricMeasurement `json:"measurements"`
	// Errors that occurred during session
	Errors []ErrorRecord `json:"errors"`
	// Metadata for additional context
	Metadata map[string]interface{} `json:"metadata,omitempty"`
}

SessionResult contains results from evaluating an agent session with enhanced tracking.

This extends the core EvaluationResult with session status, error tracking, and richer metadata for long-running agent evaluations.

func FromJSON

func FromJSON(jsonStr string) (*SessionResult, error)

FromJSON deserializes a session result from JSON.

func NewSessionResult

func NewSessionResult(sessionID string, agentName string) *SessionResult

NewSessionResult creates a new session result.

func (*SessionResult) AddError

func (sr *SessionResult) AddError(errorType string, message string, details map[string]interface{})

AddError records an error that occurred during the session.

func (*SessionResult) AddMetricMeasurement

func (sr *SessionResult) AddMetricMeasurement(measurement *MetricMeasurement)

AddMetricMeasurement adds a metric measurement to this session.

func (*SessionResult) DurationSeconds

func (sr *SessionResult) DurationSeconds() *float64

DurationSeconds calculates session duration in seconds.

Returns nil if session hasn't ended yet.

func (*SessionResult) GetMetric

func (sr *SessionResult) GetMetric(name string) *MetricMeasurement

GetMetric retrieves a specific metric measurement by name (returns first match).

func (*SessionResult) GetMetricsByType

func (sr *SessionResult) GetMetricsByType(metricType MetricType) []MetricMeasurement

GetMetricsByType retrieves all measurements of a specific type.

func (*SessionResult) SetStatus

func (sr *SessionResult) SetStatus(status SessionStatus)

SetStatus updates the session status.

func (*SessionResult) ToJSON

func (sr *SessionResult) ToJSON() (string, error)

ToJSON serializes the session result to JSON.

type SessionStatus

type SessionStatus string

SessionStatus represents the status of an evaluation session.

const (
	// SessionStatusRunning indicates session is currently running
	SessionStatusRunning SessionStatus = "running"
	// SessionStatusCompleted indicates session completed successfully
	SessionStatusCompleted SessionStatus = "completed"
	// SessionStatusFailed indicates session failed
	SessionStatusFailed SessionStatus = "failed"
	// SessionStatusTimeout indicates session timed out
	SessionStatusTimeout SessionStatus = "timeout"
	// SessionStatusCancelled indicates session was cancelled
	SessionStatusCancelled SessionStatus = "cancelled"
)

type Severity

type Severity string

Severity represents regression severity levels.

const (
	SeverityNone     Severity = "none"
	SeverityMinor    Severity = "minor"    // <10% degradation
	SeverityModerate Severity = "moderate" // 10-20% degradation
	SeverityMajor    Severity = "major"    // 20-50% degradation
	SeverityCritical Severity = "critical" // >50% degradation
)

type SignificanceLevel

type SignificanceLevel float64

SignificanceLevel represents statistical significance thresholds.

const (
	// SignificanceLevel0001 represents 99.9% confidence
	SignificanceLevel0001 SignificanceLevel = 0.001
	// SignificanceLevel001 represents 99% confidence
	SignificanceLevel001 SignificanceLevel = 0.01
	// SignificanceLevel005 represents 95% confidence (default)
	SignificanceLevel005 SignificanceLevel = 0.05
	// SignificanceLevel010 represents 90% confidence
	SignificanceLevel010 SignificanceLevel = 0.10
)

type SimpleQABenchmark

type SimpleQABenchmark struct{}

SimpleQABenchmark is a simple question-answering benchmark.

Tests basic knowledge and reasoning.

func NewSimpleQABenchmark

func NewSimpleQABenchmark() *SimpleQABenchmark

NewSimpleQABenchmark creates a new simple Q&A benchmark.

func (*SimpleQABenchmark) Description

func (b *SimpleQABenchmark) Description() string

Description returns the benchmark description.

func (*SimpleQABenchmark) GenerateTestCases

func (b *SimpleQABenchmark) GenerateTestCases() ([]*TestCase, error)

GenerateTestCases generates simple Q&A test cases.

func (*SimpleQABenchmark) Name

func (b *SimpleQABenchmark) Name() string

Name returns the benchmark name.

type StatisticalTestType

type StatisticalTestType string

StatisticalTestType represents types of statistical tests.

const (
	// TestTypeTTest is a parametric test assuming normal distribution
	TestTypeTTest StatisticalTestType = "t_test"
	// TestTypeMannWhitney is a non-parametric test
	TestTypeMannWhitney StatisticalTestType = "mann_whitney"
)

type TestCase

type TestCase struct {
	Input    string
	Expected interface{} // String or validation function
	Metadata map[string]interface{}
	Tags     []string
}

TestCase represents a single test case for evaluation.

Contains input, expected output, and metadata.

func (*TestCase) ToDict

func (t *TestCase) ToDict() map[string]interface{}

ToDict converts test case to dictionary.

type ValidatorFunc

type ValidatorFunc func(expected, actual string) bool

ValidatorFunc is a custom validation function type.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL