Documentation
¶
Overview ¶
Package evaluation provides A/B testing framework for comparing agent variants.
A/B Testing Framework provides statistical significance testing, effect size calculation, and automated experiment orchestration for comparing agent performance.
Example:
// Create A/B test
abTest := evaluation.NewABTest(
"prompt_comparison",
controlAgent,
treatmentAgent,
[]string{"accuracy"},
evaluation.SignificanceLevel005,
evaluation.TestTypeTTest,
)
// Run experiment
results, _ := abTest.Run(testCases, 50, true)
// Check significance
if results["accuracy"].IsSignificant() {
fmt.Printf("Winner: %s\n", results["accuracy"].Winner())
}
Package evaluation provides tools for evaluating and optimizing agent performance. Bayesian Optimization uses probabilistic models to efficiently find optimal hyperparameter configurations.
Package evaluation provides comprehensive evaluation capabilities for autonomous agents.
Designed for measuring agent quality and performance, with special focus on extreme-scale context evaluation (1M-25M+ tokens) for systems like endless.
Example:
evaluator := evaluation.NewEvaluator(agent, nil, "")
testCases := []map[string]interface{}{
{"input": "What is 2+2?", "expected": "4"},
}
result, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", result.Accuracy)
Package evaluation provides detailed metrics tracking for agent evaluation.
This module extends core evaluation with enhanced metric tracking including: - Session status tracking (running, completed, failed, etc.) - Error collection and analysis - Metric type categorization - Cross-session aggregation
Key use case: "How do you know a long-running agent succeeded?"
Example:
result := evaluation.NewSessionResult("session-123", "my-agent")
result.AddMetricMeasurement(evaluation.NewMetricMeasurement(
"accuracy",
0.95,
evaluation.MetricTypeSuccessRate,
))
result.SetStatus(evaluation.SessionStatusCompleted)
Package evaluation provides the Automated Optimization Framework.
This module provides intelligent optimization of agent configurations, prompts, and hyperparameters using Bayesian optimization, genetic algorithms, and other search strategies.
Interfaces:
- Optimizer: Base interface for optimization algorithms
Implementations:
- RandomSearchOptimizer: Baseline random search
- BayesianOptimizer: Bayesian optimization (in bayesian_optimizer.go)
Example:
searchSpace := evaluation.NewSearchSpace()
searchSpace.AddContinuous("temperature", 0.0, 1.0)
searchSpace.AddContinuous("top_p", 0.0, 1.0)
optimizer := evaluation.NewRandomSearchOptimizer(
objectiveFunc,
searchSpace,
true, // maximize
)
result, err := optimizer.Optimize(ctx, 50)
fmt.Printf("Best config: %v\n", result.BestConfig)
fmt.Printf("Best score: %.3f\n", result.BestScore)
Package evaluation provides the Prompt Optimization Framework.
Automatically improve prompts through systematic variation and testing using grid search, random search, or genetic algorithms.
Example:
template := `You are a {role}.
{instructions}`
variations := map[string][]string{
"role": {"helpful assistant", "expert advisor"},
"instructions": {"Be concise.", "Be detailed."},
}
optimizer := evaluation.NewPromptOptimizer(
template,
variations,
func(prompt string) agenkit.Agent { return MyAgent{SystemPrompt: prompt} },
[]string{"accuracy"},
nil,
)
result, _ := optimizer.Optimize(ctx, testCases, "grid", nil)
Example (AccuracyMetric) ¶
Example_accuracyMetric demonstrates accuracy measurement
package main
import (
"context"
"fmt"
"log"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
metric := evaluation.NewAccuracyMetric(nil, false)
agent := &ExampleAgent{name: "qa-agent"}
input := &agenkit.Message{
Role: "user",
Content: "What is the capital of France?",
}
output := &agenkit.Message{
Role: "agent",
Content: "The capital of France is Paris.",
}
ctx := map[string]interface{}{
"expected": "Paris",
}
score, err := metric.Measure(agent, input, output, ctx)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Accuracy score: %.0f\n", score)
}
Output: Accuracy score: 1
Example (BenchmarkSuite) ¶
Example_benchmarkSuite demonstrates running benchmark suites
package main
import (
"fmt"
"log"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
func main() {
// Create standard benchmark suite
suite := evaluation.BenchmarkSuiteStandard()
// Generate all test cases
testCases, err := suite.GenerateAllTestCases()
if err != nil {
log.Fatal(err)
}
fmt.Printf("Generated %d test cases from standard suite\n", len(testCases))
}
Output: Generated 19 test cases from standard suite
Example (CompressionMetrics) ¶
Example_compressionMetrics demonstrates compression evaluation
package main
import (
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
func main() {
// Create metrics for small test lengths
testLengths := []int{1000, 5000, 10000}
metric := evaluation.NewCompressionMetrics(testLengths, 3)
fmt.Printf("Metric name: %s\n", metric.Name())
fmt.Printf("Testing %d scale points\n", len(testLengths))
}
Output: Metric name: compression_quality Testing 3 scale points
Example (ContextMetrics) ¶
Example_contextMetrics demonstrates context tracking
package main
import (
"context"
"fmt"
"log"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
metric := evaluation.NewContextMetrics()
agent := &ExampleAgent{name: "test-agent"}
input := &agenkit.Message{
Role: "user",
Content: "Test query",
Metadata: map[string]interface{}{
"context_length": 1000.0,
},
}
output := &agenkit.Message{
Role: "agent",
Content: "Test response",
}
// Measure context length
length, err := metric.Measure(agent, input, output, nil)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Context length: %.0f tokens\n", length)
// Simulate growing context over time
measurements := []float64{1000, 1200, 1400, 1600, 1800}
aggregated := metric.Aggregate(measurements)
fmt.Printf("Mean context: %.0f tokens\n", aggregated["mean"])
fmt.Printf("Growth rate: %.0f tokens/interaction\n", aggregated["growth_rate"])
}
Output: Context length: 1000 tokens Mean context: 1400 tokens Growth rate: 160 tokens/interaction
Example (Evaluator) ¶
ExampleEvaluator demonstrates basic evaluation
package main
import (
"context"
"fmt"
"log"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
agent := &ExampleAgent{name: "qa-agent"}
// Create evaluator with metrics
metrics := []evaluation.Metric{
evaluation.NewAccuracyMetric(nil, false),
evaluation.NewQualityMetrics(false, "", nil),
}
evaluator := evaluation.NewEvaluator(agent, metrics, "session-123")
// Define test cases
testCases := []map[string]interface{}{
{
"input": "What is the capital of France?",
"expected": "Paris",
},
{
"input": "What is 2+2?",
"expected": "4",
},
}
// Run evaluation
result, err := evaluator.Evaluate(testCases, "")
if err != nil {
log.Fatal(err)
}
// Print results
fmt.Printf("Total Tests: %d\n", result.TotalTests)
fmt.Printf("Passed: %d\n", result.PassedTests)
if result.Accuracy != nil {
fmt.Printf("Accuracy: %.2f\n", *result.Accuracy)
}
}
Output: Total Tests: 2 Passed: 2 Accuracy: 1.00
Example (EvaluatorContinuous) ¶
Example_evaluatorContinuous demonstrates continuous evaluation
package main
import (
"context"
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
agent := &ExampleAgent{name: "qa-agent"}
metrics := []evaluation.Metric{
evaluation.NewAccuracyMetric(nil, false),
}
evaluator := evaluation.NewEvaluator(agent, metrics, "continuous-session")
// Set up regression detection
detector := evaluation.NewRegressionDetector(nil, nil)
// Baseline evaluation
baselineTests := []map[string]interface{}{
{
"input": "What is the capital of France?",
"expected": "Paris",
},
}
baselineResult, _ := evaluator.Evaluate(baselineTests, "")
detector.SetBaseline(baselineResult)
fmt.Println("Baseline established")
// Current evaluation
currentTests := baselineTests // Same tests
currentResult, _ := evaluator.Evaluate(currentTests, "")
// Detect regressions
regressions := detector.Detect(currentResult, true)
if len(regressions) > 0 {
fmt.Printf("Detected %d regressions\n", len(regressions))
} else {
fmt.Println("No regressions detected")
}
}
Output: Baseline established No regressions detected
Example (LatencyMetric) ¶
Example_latencyMetric demonstrates latency tracking
package main
import (
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
func main() {
metric := evaluation.NewLatencyMetric()
// Simulate latency measurements
measurements := []float64{
95, 102, 98, 105, 110, 97, 103, 120, 99, 101,
104, 108, 96, 100, 115, 102, 98, 105, 130, 99,
}
aggregated := metric.Aggregate(measurements)
fmt.Printf("Latency Statistics:\n")
fmt.Printf("Mean: %.0fms\n", aggregated["mean"])
fmt.Printf("p50: %.0fms\n", aggregated["p50"])
fmt.Printf("p95: %.0fms\n", aggregated["p95"])
fmt.Printf("p99: %.0fms\n", aggregated["p99"])
}
Output: Latency Statistics: Mean: 104ms p50: 102ms p95: 130ms p99: 130ms
Example (PrecisionRecallMetric) ¶
Example_precisionRecallMetric demonstrates classification evaluation
package main
import (
"context"
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
metric := evaluation.NewPrecisionRecallMetric()
agent := &ExampleAgent{name: "classifier"}
input := &agenkit.Message{Role: "user", Content: "Classify"}
output := &agenkit.Message{Role: "agent", Content: "Positive"}
// Simulate classification results
testCases := []struct {
trueLabel bool
predictedLabel bool
}{
{true, true}, // TP
{true, true}, // TP
{false, true}, // FP
{true, false}, // FN
{false, false}, // TN
}
measurements := make([]float64, 0)
for _, tc := range testCases {
ctx := map[string]interface{}{
"true_label": tc.trueLabel,
"predicted_label": tc.predictedLabel,
}
score, _ := metric.Measure(agent, input, output, ctx)
measurements = append(measurements, score)
}
results := metric.Aggregate(measurements)
fmt.Printf("Precision: %.2f\n", results["precision"])
fmt.Printf("Recall: %.2f\n", results["recall"])
fmt.Printf("F1 Score: %.2f\n", results["f1_score"])
}
Output: Precision: 0.67 Recall: 0.67 F1 Score: 0.67
Example (QualityMetrics) ¶
Example_qualityMetrics demonstrates quality scoring
package main
import (
"context"
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
metric := evaluation.NewQualityMetrics(false, "", nil)
agent := &ExampleAgent{name: "qa-agent"}
input := &agenkit.Message{
Role: "user",
Content: "Explain machine learning",
}
// Good response
output1 := &agenkit.Message{
Role: "agent",
Content: "Machine learning is a branch of artificial intelligence that " +
"enables systems to learn from data and improve their performance " +
"over time without being explicitly programmed.",
}
score1, _ := metric.Measure(agent, input, output1, nil)
fmt.Printf("Good response quality: %.2f\n", score1)
// Poor response
output2 := &agenkit.Message{
Role: "agent",
Content: "Yes.",
}
score2, _ := metric.Measure(agent, input, output2, nil)
fmt.Printf("Poor response quality: %.2f\n", score2)
}
Output: Good response quality: 0.80 Poor response quality: 0.21
Example (RegressionDetector) ¶
Example_regressionDetector demonstrates regression detection
package main
import (
"fmt"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
func main() {
// Create baseline result
accuracy := 0.95
quality := 0.90
latency := 100.0
baseline := &evaluation.EvaluationResult{
Accuracy: &accuracy,
QualityScore: &quality,
AvgLatencyMs: &latency,
}
// Create detector with baseline
detector := evaluation.NewRegressionDetector(nil, baseline)
// Simulate degraded performance
newAccuracy := 0.80
newQuality := 0.85
newLatency := 150.0
current := &evaluation.EvaluationResult{
Accuracy: &newAccuracy,
QualityScore: &newQuality,
AvgLatencyMs: &newLatency,
}
// Detect regressions
regressions := detector.Detect(current, true)
fmt.Printf("Found %d regressions\n", len(regressions))
}
Output: Found 2 regressions
Example (SessionRecorder) ¶
Example_sessionRecorder demonstrates session recording
package main
import (
"context"
"fmt"
"log"
"github.com/scttfrdmn/agenkit/agenkit-go/agenkit"
"github.com/scttfrdmn/agenkit/agenkit-go/evaluation"
)
// ExampleAgent is a simple agent for demonstration
type ExampleAgent struct {
name string
}
func (a *ExampleAgent) Name() string {
return a.name
}
func (a *ExampleAgent) Capabilities() []string {
return []string{"question-answering"}
}
func (a *ExampleAgent) Introspect() *agenkit.IntrospectionResult {
return agenkit.DefaultIntrospectionResult(a)
}
func (a *ExampleAgent) Process(ctx context.Context, msg *agenkit.Message) (*agenkit.Message, error) {
content := msg.Content
var response string
switch content {
case "What is the capital of France?":
response = "The capital of France is Paris."
case "What is 2+2?":
response = "2+2 equals 4."
default:
response = "I don't know the answer to that question."
}
return &agenkit.Message{
Role: "agent",
Content: response,
}, nil
}
func main() {
storage := evaluation.NewInMemoryRecordingStorage()
recorder := evaluation.NewSessionRecorder(storage)
// Start session
recorder.StartSession("session-123", "qa-agent", nil)
// Wrap agent to automatically record
agent := &ExampleAgent{name: "qa-agent"}
wrappedAgent := recorder.Wrap(agent)
// Process message (automatically recorded)
input := &agenkit.Message{
Role: "user",
Content: "What is AI?",
Metadata: map[string]interface{}{
"session_id": "session-123",
},
}
_, _ = wrappedAgent.Process(context.Background(), input)
// Finalize session
recording, err := recorder.FinalizeSession("session-123")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Session %s recorded\n", recording.SessionID)
fmt.Printf("Interactions: %d\n", recording.InteractionCount())
}
Output: Session session-123 recorded Interactions: 1
Index ¶
- func CalculateSampleSize(baselineMean, minimumDetectableEffect, alpha, power float64, stdDev *float64) int
- type ABResult
- type ABTest
- type ABVariant
- type AccuracyMetric
- type AcquisitionFunction
- type AgentFactory
- type BayesianOptimizer
- type BayesianOptimizerConfig
- type Benchmark
- type BenchmarkSuite
- func (s *BenchmarkSuite) AddBenchmark(benchmark Benchmark)
- func (s *BenchmarkSuite) GenerateAllTestCases() ([]map[string]interface{}, error)
- func (s *BenchmarkSuite) GetBenchmark(name string) Benchmark
- func (s *BenchmarkSuite) RemoveBenchmark(name string)
- func (s *BenchmarkSuite) ToDict() map[string]interface{}
- type CompressionMetrics
- func (m *CompressionMetrics) Aggregate(measurements []float64) map[string]float64
- func (m *CompressionMetrics) EvaluateAtLengths(agent agenkit.Agent, sessionID string, needleContent []string) (map[int]*CompressionStats, error)
- func (m *CompressionMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ...) (float64, error)
- func (m *CompressionMetrics) Name() string
- type CompressionStats
- type ContextMetrics
- type ErrorRecord
- type EvaluationResult
- type Evaluator
- type ExtremeScaleBenchmark
- type FileRecordingStorage
- func (s *FileRecordingStorage) DeleteRecording(sessionID string) error
- func (s *FileRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (s *FileRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
- func (s *FileRecordingStorage) SaveRecording(recording *SessionRecording) error
- type InMemoryRecordingStorage
- func (s *InMemoryRecordingStorage) DeleteRecording(sessionID string) error
- func (s *InMemoryRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (s *InMemoryRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
- func (s *InMemoryRecordingStorage) SaveRecording(recording *SessionRecording) error
- type InformationRetentionBenchmark
- type InteractionRecord
- type LatencyMetric
- type Metric
- type MetricMeasurement
- func CreateCostMetric(cost float64, currency string, metadata map[string]interface{}) *MetricMeasurement
- func CreateDurationMetric(durationSeconds float64, metadata map[string]interface{}) *MetricMeasurement
- func CreateQualityMetric(name string, score, maxScore float64, metadata map[string]interface{}) *MetricMeasurement
- func NewMetricMeasurement(name string, value float64, metricType MetricType) *MetricMeasurement
- type MetricType
- type MetricsCollector
- func (mc *MetricsCollector) AddResult(result *SessionResult)
- func (mc *MetricsCollector) Clear()
- func (mc *MetricsCollector) GetMetricAggregates(metricName string) map[string]interface{}
- func (mc *MetricsCollector) GetResults() []SessionResult
- func (mc *MetricsCollector) GetStatistics() map[string]interface{}
- type NeedleInHaystackBenchmark
- type ObjectiveFunc
- type OptimizationResult
- type OptimizationStep
- type OptimizationStrategy
- type Optimizer
- type ParameterSpec
- type ParameterType
- type PrecisionRecallMetric
- type PrecisionRecallStats
- type PromptEvaluation
- type PromptOptimizationResult
- type PromptOptimizer
- func (p *PromptOptimizer) Optimize(ctx context.Context, testCases []map[string]interface{}, strategy string, ...) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeGenetic(ctx context.Context, testCases []map[string]interface{}, populationSize int, ...) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeGrid(ctx context.Context, testCases []map[string]interface{}) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) OptimizeRandom(ctx context.Context, testCases []map[string]interface{}, nSamples int) (*PromptOptimizationResult, error)
- func (p *PromptOptimizer) SetMaximize(maximize bool)
- type QualityMetrics
- type RandomSearchOptimizer
- type RecordingStorage
- type Regression
- type RegressionDetector
- func (d *RegressionDetector) ClearHistory()
- func (d *RegressionDetector) CompareResults(resultA, resultB *EvaluationResult) map[string]map[string]float64
- func (d *RegressionDetector) Detect(result *EvaluationResult, storeHistory bool) []*Regression
- func (d *RegressionDetector) GetSummary() map[string]interface{}
- func (d *RegressionDetector) GetTrend(metricName string, window int) map[string]interface{}
- func (d *RegressionDetector) SetBaseline(result *EvaluationResult)
- type SearchSpace
- func (s *SearchSpace) AddCategorical(name string, values []string)
- func (s *SearchSpace) AddContinuous(name string, low, high float64)
- func (s *SearchSpace) AddDiscrete(name string, values []interface{})
- func (s *SearchSpace) AddInteger(name string, low, high int)
- func (s *SearchSpace) Sample() map[string]interface{}
- type SessionRecorder
- func (r *SessionRecorder) DeleteRecording(sessionID string) error
- func (r *SessionRecorder) FinalizeSession(sessionID string) (*SessionRecording, error)
- func (r *SessionRecorder) ListRecordings(limit, offset int) ([]*SessionRecording, error)
- func (r *SessionRecorder) LoadRecording(sessionID string) (*SessionRecording, error)
- func (r *SessionRecorder) RecordInteraction(sessionID string, inputMessage, outputMessage *agenkit.Message, ...)
- func (r *SessionRecorder) StartSession(sessionID, agentName string, metadata map[string]interface{})
- func (r *SessionRecorder) Wrap(agent agenkit.Agent) agenkit.Agent
- type SessionRecording
- type SessionReplay
- type SessionResult
- func (sr *SessionResult) AddError(errorType string, message string, details map[string]interface{})
- func (sr *SessionResult) AddMetricMeasurement(measurement *MetricMeasurement)
- func (sr *SessionResult) DurationSeconds() *float64
- func (sr *SessionResult) GetMetric(name string) *MetricMeasurement
- func (sr *SessionResult) GetMetricsByType(metricType MetricType) []MetricMeasurement
- func (sr *SessionResult) SetStatus(status SessionStatus)
- func (sr *SessionResult) ToJSON() (string, error)
- type SessionStatus
- type Severity
- type SignificanceLevel
- type SimpleQABenchmark
- type StatisticalTestType
- type TestCase
- type ValidatorFunc
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CalculateSampleSize ¶
func CalculateSampleSize(baselineMean, minimumDetectableEffect, alpha, power float64, stdDev *float64) int
CalculateSampleSize calculates required sample size for A/B test.
Types ¶
type ABResult ¶
type ABResult struct {
ExperimentName string
ControlVariant *ABVariant
TreatmentVariant *ABVariant
MetricName string
PValue float64
TestType StatisticalTestType
SignificanceLevel SignificanceLevel
EffectSize float64
ConfidenceInterval [2]float64
Timestamp time.Time
}
ABResult contains results of an A/B test with statistical analysis.
func (*ABResult) ImprovementPercent ¶
ImprovementPercent returns percent improvement of treatment over control.
func (*ABResult) IsSignificant ¶
IsSignificant checks if result is statistically significant.
type ABTest ¶
type ABTest struct {
Name string
Control *ABVariant
Treatment *ABVariant
Metrics []string
SignificanceLevel SignificanceLevel
TestType StatisticalTestType
Results map[string]*ABResult
}
ABTest orchestrates A/B experiments comparing agent variants.
func NewABTest ¶
func NewABTest( name string, controlAgent agenkit.Agent, treatmentAgent agenkit.Agent, metrics []string, significanceLevel SignificanceLevel, testType StatisticalTestType, ) *ABTest
NewABTest creates a new A/B test.
func (*ABTest) GetSummary ¶
GetSummary returns experiment summary.
type ABVariant ¶
type ABVariant struct {
Name string
Agent agenkit.Agent
Samples []float64
Metadata map[string]interface{}
}
ABVariant represents a variant in an A/B test.
func NewABVariant ¶
NewABVariant creates a new A/B variant.
func (*ABVariant) SampleSize ¶
SampleSize returns the number of samples.
type AccuracyMetric ¶
type AccuracyMetric struct {
// contains filtered or unexported fields
}
AccuracyMetric measures task accuracy.
Compares agent output to expected output to determine correctness. Supports multiple validation methods:
- Exact string matching
- Substring matching (case-insensitive)
- Custom validator functions
Example:
metric := NewAccuracyMetric(nil, false)
score, _ := metric.Measure(
agent,
inputMsg,
outputMsg,
map[string]interface{}{"expected": "Paris"},
)
fmt.Printf("Accuracy: %.0f\n", score) // 0.0 or 1.0
func NewAccuracyMetric ¶
func NewAccuracyMetric(validator ValidatorFunc, caseSensitive bool) *AccuracyMetric
NewAccuracyMetric creates a new accuracy metric.
Args:
validator: Custom validation function(expected, actual) -> bool caseSensitive: Whether string matching is case-sensitive
Example:
metric := NewAccuracyMetric(nil, false)
func (*AccuracyMetric) Aggregate ¶
func (m *AccuracyMetric) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates accuracy measurements.
Args:
measurements: List of 0.0/1.0 values
Returns:
Accuracy statistics: accuracy, total, correct, incorrect
func (*AccuracyMetric) Measure ¶
func (m *AccuracyMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
Measure measures accuracy for single interaction.
Args:
agent: Agent being evaluated inputMessage: Input to agent outputMessage: Agent's response ctx: Must contain "expected" key with expected output
Returns:
1.0 if correct, 0.0 if incorrect
type AcquisitionFunction ¶
type AcquisitionFunction string
AcquisitionFunction specifies the acquisition function type for Bayesian optimization.
const ( // AcquisitionEI represents Expected Improvement AcquisitionEI AcquisitionFunction = "ei" // AcquisitionUCB represents Upper Confidence Bound AcquisitionUCB AcquisitionFunction = "ucb" // AcquisitionPI represents Probability of Improvement AcquisitionPI AcquisitionFunction = "pi" )
type AgentFactory ¶
AgentFactory is a function that creates an agent from a prompt string.
type BayesianOptimizer ¶
type BayesianOptimizer struct {
// contains filtered or unexported fields
}
BayesianOptimizer implements Bayesian optimization for hyperparameter tuning.
This implementation uses a simplified surrogate model based on local statistics rather than full Gaussian Process regression. It balances exploration and exploitation through acquisition functions.
Algorithm:
- Sample n_initial random configurations
- Evaluate and build local statistics
- Use acquisition function to select next config
- Evaluate new config
- Update statistics and repeat
func NewBayesianOptimizer ¶
func NewBayesianOptimizer(config BayesianOptimizerConfig) (*BayesianOptimizer, error)
NewBayesianOptimizer creates a new Bayesian optimizer.
func (*BayesianOptimizer) Optimize ¶
func (b *BayesianOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
Optimize runs the Bayesian optimization process.
type BayesianOptimizerConfig ¶
type BayesianOptimizerConfig struct {
SearchSpace *SearchSpace
Objective ObjectiveFunc
Maximize bool
Acquisition AcquisitionFunction
NInitial int
Xi float64 // Exploration parameter for EI and PI (default: 0.01)
Kappa float64 // Exploration parameter for UCB (default: 2.576)
}
BayesianOptimizerConfig contains configuration for BayesianOptimizer.
type Benchmark ¶
type Benchmark interface {
// Name returns the benchmark name.
Name() string
// Description returns the benchmark description.
Description() string
// GenerateTestCases generates test cases for this benchmark.
//
// Returns:
// List of test cases
GenerateTestCases() ([]*TestCase, error)
}
Benchmark is the interface for benchmarks.
Benchmarks define test suites for evaluating specific capabilities.
type BenchmarkSuite ¶
type BenchmarkSuite struct {
// contains filtered or unexported fields
}
BenchmarkSuite is a collection of benchmarks for comprehensive evaluation.
Provides standard and extreme-scale benchmark suites.
func BenchmarkSuiteExtremeScale ¶
func BenchmarkSuiteExtremeScale() *BenchmarkSuite
BenchmarkSuiteExtremeScale creates an extreme-scale benchmark suite for endless.
Tests at 1M-25M+ tokens with compression and retrieval.
func BenchmarkSuiteQuick ¶
func BenchmarkSuiteQuick() *BenchmarkSuite
BenchmarkSuiteQuick creates a quick benchmark suite for fast iteration.
Small test set for rapid feedback during development.
func BenchmarkSuiteStandard ¶
func BenchmarkSuiteStandard() *BenchmarkSuite
BenchmarkSuiteStandard creates a standard benchmark suite.
Includes basic Q&A and small-scale retrieval tests.
func NewBenchmarkSuite ¶
func NewBenchmarkSuite(benchmarks []Benchmark, name string) *BenchmarkSuite
NewBenchmarkSuite creates a new benchmark suite.
Args:
benchmarks: List of benchmarks to include name: Suite name
Example:
suite := NewBenchmarkSuite([]Benchmark{NewSimpleQABenchmark()}, "custom")
func (*BenchmarkSuite) AddBenchmark ¶
func (s *BenchmarkSuite) AddBenchmark(benchmark Benchmark)
AddBenchmark adds benchmark to suite.
func (*BenchmarkSuite) GenerateAllTestCases ¶
func (s *BenchmarkSuite) GenerateAllTestCases() ([]map[string]interface{}, error)
GenerateAllTestCases generates all test cases from all benchmarks.
Returns:
Combined list of test cases from all benchmarks
func (*BenchmarkSuite) GetBenchmark ¶
func (s *BenchmarkSuite) GetBenchmark(name string) Benchmark
GetBenchmark gets benchmark by name.
func (*BenchmarkSuite) RemoveBenchmark ¶
func (s *BenchmarkSuite) RemoveBenchmark(name string)
RemoveBenchmark removes benchmark from suite.
func (*BenchmarkSuite) ToDict ¶
func (s *BenchmarkSuite) ToDict() map[string]interface{}
ToDict converts suite to dictionary.
type CompressionMetrics ¶
type CompressionMetrics struct {
// contains filtered or unexported fields
}
CompressionMetrics measures compression quality at extreme scale.
Critical for endless and similar systems that use 100x-1000x compression at 25M+ tokens. Measures:
- Compression ratio achieved
- Information retention after compression
- Retrieval accuracy from compressed context
- Quality degradation as context grows
Example:
metrics := NewCompressionMetrics([]int{1000000, 10000000, 25000000}, 10)
stats, _ := metrics.EvaluateAtLengths(agent, "session-123", nil)
for length, stat := range stats {
fmt.Printf("%dM tokens: %.1fx compression\n", length/1e6, stat.CompressionRatio)
}
func NewCompressionMetrics ¶
func NewCompressionMetrics(testLengths []int, needleCount int) *CompressionMetrics
NewCompressionMetrics creates a new compression metrics instance.
Args:
testLengths: Context lengths to test (defaults to 1M, 10M, 25M) needleCount: Number of "needle" facts to test retrieval
Example:
metrics := NewCompressionMetrics([]int{1000000, 10000000}, 10)
func (*CompressionMetrics) Aggregate ¶
func (m *CompressionMetrics) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates compression ratios.
Args:
measurements: List of compression ratios
Returns:
Statistics: mean, min, max, std
func (*CompressionMetrics) EvaluateAtLengths ¶
func (m *CompressionMetrics) EvaluateAtLengths(agent agenkit.Agent, sessionID string, needleContent []string) (map[int]*CompressionStats, error)
EvaluateAtLengths evaluates compression quality at multiple context lengths.
Tests compression and retrieval at 1M, 10M, 25M tokens to detect quality degradation as context grows.
Args:
agent: Agent with compression capability sessionID: Session to evaluate needleContent: Specific facts to test retrieval (optional)
Returns:
Dictionary mapping context_length -> CompressionStats
func (*CompressionMetrics) Measure ¶
func (m *CompressionMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
Measure measures compression quality for single interaction.
Returns:
Compression ratio (raw_tokens / compressed_tokens)
func (*CompressionMetrics) Name ¶
func (m *CompressionMetrics) Name() string
Name returns the metric name.
type CompressionStats ¶
type CompressionStats struct {
RawTokens int
CompressedTokens int
CompressionRatio float64
RetrievalAccuracy float64
ContextLengthTested int
Timestamp time.Time
}
CompressionStats contains statistics from compression evaluation.
func (*CompressionStats) ToDict ¶
func (c *CompressionStats) ToDict() map[string]interface{}
ToDict converts stats to dictionary.
type ContextMetrics ¶
type ContextMetrics struct{}
ContextMetrics tracks context length and growth over agent lifecycle.
Essential for extreme-scale systems (endless) that operate at 1M-25M+ token contexts. Measures:
- Raw context token count
- Compressed context token count (if compression used)
- Compression ratio
- Context growth rate
Example:
metrics := NewContextMetrics()
result, _ := metrics.Measure(agent, inputMsg, outputMsg, ctx)
fmt.Printf("Context length: %.0f tokens\n", result)
func NewContextMetrics ¶
func NewContextMetrics() *ContextMetrics
NewContextMetrics creates a new context metrics instance.
func (*ContextMetrics) Aggregate ¶
func (m *ContextMetrics) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates context length measurements.
Args:
measurements: List of context lengths over time
Returns:
Statistics: mean, min, max, final, growth_rate
func (*ContextMetrics) Measure ¶
func (m *ContextMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
Measure measures context length metrics.
Args:
agent: Agent being evaluated inputMessage: Input message outputMessage: Agent response ctx: Additional context with session history
Returns:
Current context length in tokens (or compressed tokens if available)
type ErrorRecord ¶
type ErrorRecord struct {
// Type of error
Type string `json:"type"`
// Error message
Message string `json:"message"`
// Additional details
Details map[string]interface{} `json:"details,omitempty"`
// Timestamp when error occurred (RFC3339 format)
Timestamp string `json:"timestamp"`
}
ErrorRecord represents an error that occurred during evaluation.
func NewErrorRecord ¶
func NewErrorRecord(errorType string, message string, details map[string]interface{}) *ErrorRecord
NewErrorRecord creates a new error record with current timestamp.
type EvaluationResult ¶
type EvaluationResult struct {
// Identification
EvaluationID string
AgentName string
Timestamp time.Time
// Metrics
Metrics map[string][]float64
AggregatedMetrics map[string]map[string]float64
// Context information
ContextLength *int
CompressedLength *int
CompressionRatio *float64
// Quality scores
Accuracy *float64
QualityScore *float64
// Performance
AvgLatencyMs *float64
P95LatencyMs *float64
// Test details
TotalTests int
PassedTests int
FailedTests int
// Additional metadata
Metadata map[string]interface{}
}
EvaluationResult contains results from an evaluation run.
Includes metrics, metadata, and analysis.
func (*EvaluationResult) SuccessRate ¶
func (r *EvaluationResult) SuccessRate() float64
SuccessRate calculates test success rate.
func (*EvaluationResult) ToDict ¶
func (r *EvaluationResult) ToDict() map[string]interface{}
ToDict converts result to dictionary.
type Evaluator ¶
type Evaluator struct {
// contains filtered or unexported fields
}
Evaluator is the core evaluation orchestrator.
Runs benchmarks, collects metrics, and aggregates results.
Example:
evaluator := NewEvaluator(agent, nil, "")
suite := BenchmarkSuiteStandard()
testCases, _ := suite.GenerateAllTestCases()
results, _ := evaluator.Evaluate(testCases, "")
fmt.Printf("Accuracy: %.2f\n", results.Accuracy)
func NewEvaluator ¶
NewEvaluator creates a new evaluator.
Args:
agent: Agent to evaluate metrics: List of metrics to collect (defaults to empty) sessionID: Optional session ID for context tracking
Example:
evaluator := NewEvaluator(agent, []Metric{NewAccuracyMetric(nil, false)}, "eval-123")
func (*Evaluator) Evaluate ¶
func (e *Evaluator) Evaluate(testCases []map[string]interface{}, evaluationID string) (*EvaluationResult, error)
Evaluate evaluates agent on test cases.
Args:
testCases: List of test cases, each with 'input' and 'expected' keys evaluationID: Optional evaluation ID
Returns:
EvaluationResult with metrics and analysis
Example:
testCases := []map[string]interface{}{
{"input": "What is 2+2?", "expected": "4"},
}
result, err := evaluator.Evaluate(testCases, "")
func (*Evaluator) EvaluateSingle ¶
func (e *Evaluator) EvaluateSingle(inputMessage *agenkit.Message, expectedOutput interface{}) (map[string]float64, error)
EvaluateSingle evaluates single interaction.
Args:
inputMessage: Input to agent expectedOutput: Expected output (optional)
Returns:
Dictionary of metric values
Example:
inputMsg := &agenkit.Message{Role: "user", Content: "Hello"}
metrics, err := evaluator.EvaluateSingle(inputMsg, "Hi there")
type ExtremeScaleBenchmark ¶
type ExtremeScaleBenchmark struct {
// contains filtered or unexported fields
}
ExtremeScaleBenchmark is an extreme-scale benchmark for testing at 1M-25M+ tokens.
Designed specifically for endless and similar systems that operate at unprecedented context lengths.
func NewExtremeScaleBenchmark ¶
func NewExtremeScaleBenchmark(testLengths []int, needlesPerLength int) *ExtremeScaleBenchmark
NewExtremeScaleBenchmark creates a new extreme-scale benchmark.
Args:
testLengths: Context lengths to test (defaults to 1M, 10M, 25M) needlesPerLength: Number of needles per context length
Example:
benchmark := NewExtremeScaleBenchmark([]int{1000000, 10000000}, 10)
func (*ExtremeScaleBenchmark) Description ¶
func (b *ExtremeScaleBenchmark) Description() string
Description returns the benchmark description.
func (*ExtremeScaleBenchmark) GenerateTestCases ¶
func (b *ExtremeScaleBenchmark) GenerateTestCases() ([]*TestCase, error)
GenerateTestCases generates extreme-scale test cases.
func (*ExtremeScaleBenchmark) Name ¶
func (b *ExtremeScaleBenchmark) Name() string
Name returns the benchmark name.
type FileRecordingStorage ¶
type FileRecordingStorage struct {
// contains filtered or unexported fields
}
FileRecordingStorage provides file-based recording storage.
Stores recordings as JSON files on disk.
func NewFileRecordingStorage ¶
func NewFileRecordingStorage(recordingsDir string) *FileRecordingStorage
NewFileRecordingStorage creates a new file storage.
Args:
recordingsDir: Directory to store recordings
Example:
storage := NewFileRecordingStorage("./recordings")
func (*FileRecordingStorage) DeleteRecording ¶
func (s *FileRecordingStorage) DeleteRecording(sessionID string) error
DeleteRecording deletes recording file.
func (*FileRecordingStorage) ListRecordings ¶
func (s *FileRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
ListRecordings lists all recordings.
func (*FileRecordingStorage) LoadRecording ¶
func (s *FileRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
LoadRecording loads recording from file.
func (*FileRecordingStorage) SaveRecording ¶
func (s *FileRecordingStorage) SaveRecording(recording *SessionRecording) error
SaveRecording saves recording to file.
type InMemoryRecordingStorage ¶
type InMemoryRecordingStorage struct {
// contains filtered or unexported fields
}
InMemoryRecordingStorage provides in-memory recording storage for testing.
Does not persist recordings across restarts.
func NewInMemoryRecordingStorage ¶
func NewInMemoryRecordingStorage() *InMemoryRecordingStorage
NewInMemoryRecordingStorage creates a new in-memory storage.
func (*InMemoryRecordingStorage) DeleteRecording ¶
func (s *InMemoryRecordingStorage) DeleteRecording(sessionID string) error
DeleteRecording deletes recording from memory.
func (*InMemoryRecordingStorage) ListRecordings ¶
func (s *InMemoryRecordingStorage) ListRecordings(limit, offset int) ([]*SessionRecording, error)
ListRecordings lists recordings from memory.
func (*InMemoryRecordingStorage) LoadRecording ¶
func (s *InMemoryRecordingStorage) LoadRecording(sessionID string) (*SessionRecording, error)
LoadRecording loads recording from memory.
func (*InMemoryRecordingStorage) SaveRecording ¶
func (s *InMemoryRecordingStorage) SaveRecording(recording *SessionRecording) error
SaveRecording saves recording to memory.
type InformationRetentionBenchmark ¶
type InformationRetentionBenchmark struct {
// contains filtered or unexported fields
}
InformationRetentionBenchmark tests information retention across long conversations.
Verifies that agents remember and can recall information from earlier in the conversation, even after compression.
func NewInformationRetentionBenchmark ¶
func NewInformationRetentionBenchmark(conversationLength int, recallPoints []int) *InformationRetentionBenchmark
NewInformationRetentionBenchmark creates a new information retention benchmark.
Args:
conversationLength: Number of conversation turns recallPoints: Turns at which to test recall (defaults to checkpoints)
Example:
benchmark := NewInformationRetentionBenchmark(100, []int{10, 25, 50, 75, 100})
func (*InformationRetentionBenchmark) Description ¶
func (b *InformationRetentionBenchmark) Description() string
Description returns the benchmark description.
func (*InformationRetentionBenchmark) GenerateTestCases ¶
func (b *InformationRetentionBenchmark) GenerateTestCases() ([]*TestCase, error)
GenerateTestCases generates information retention test cases.
func (*InformationRetentionBenchmark) Name ¶
func (b *InformationRetentionBenchmark) Name() string
Name returns the benchmark name.
type InteractionRecord ¶
type InteractionRecord struct {
InteractionID string
SessionID string
InputMessage map[string]interface{}
OutputMessage map[string]interface{}
Timestamp time.Time
LatencyMs float64
Metadata map[string]interface{}
}
InteractionRecord represents a record of single agent interaction.
Contains input, output, timing, and metadata.
func InteractionRecordFromDict ¶
func InteractionRecordFromDict(data map[string]interface{}) (*InteractionRecord, error)
InteractionRecordFromDict creates record from dictionary.
func (*InteractionRecord) ToDict ¶
func (r *InteractionRecord) ToDict() map[string]interface{}
ToDict converts record to dictionary.
type LatencyMetric ¶
type LatencyMetric struct{}
LatencyMetric measures agent response latency.
Tracks processing time per interaction. Critical for production systems where response time matters.
func NewLatencyMetric ¶
func NewLatencyMetric() *LatencyMetric
NewLatencyMetric creates a new latency metric instance.
func (*LatencyMetric) Aggregate ¶
func (m *LatencyMetric) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates latency measurements.
Returns:
mean, p50, p95, p99, min, max
type Metric ¶
type Metric interface {
// Name returns the metric name.
Name() string
// Measure measures metric for a single agent interaction.
//
// Args:
// agent: The agent being evaluated
// inputMessage: Input to the agent
// outputMessage: Agent's response
// ctx: Additional context (session history, etc.)
//
// Returns:
// Metric value (typically 0.0 to 1.0)
Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
// Aggregate aggregates multiple measurements.
//
// Args:
// measurements: List of individual measurements
//
// Returns:
// Aggregated statistics (mean, std, min, max, etc.)
Aggregate(measurements []float64) map[string]float64
}
Metric is the interface for evaluation metrics.
Metrics measure specific aspects of agent performance:
- Accuracy
- Latency
- Context usage
- Quality scores
- etc.
type MetricMeasurement ¶
type MetricMeasurement struct {
// Name of the metric
Name string `json:"name"`
// Value of the measurement
Value float64 `json:"value"`
// Type categorizes the metric
Type MetricType `json:"type"`
// Timestamp when measurement was taken (RFC3339 format)
Timestamp string `json:"timestamp"`
// Metadata for additional context
Metadata map[string]interface{} `json:"metadata,omitempty"`
}
MetricMeasurement represents a single metric measurement.
Note: This is distinct from the Metric interface in core.go. Metric interface defines how to measure, MetricMeasurement stores the measurement.
func CreateCostMetric ¶
func CreateCostMetric(cost float64, currency string, metadata map[string]interface{}) *MetricMeasurement
CreateCostMetric creates a cost metric measurement.
Helper to create a cost metric with currency information.
Args:
cost: Cost amount currency: Currency code (default: "USD") metadata: Additional metadata
Returns:
Cost metric measurement
Example:
metric := evaluation.CreateCostMetric(0.0042, "USD", nil)
func CreateDurationMetric ¶
func CreateDurationMetric(durationSeconds float64, metadata map[string]interface{}) *MetricMeasurement
CreateDurationMetric creates a duration metric measurement.
Helper to create a duration metric with hours conversion.
Args:
durationSeconds: Duration in seconds metadata: Additional metadata
Returns:
Duration metric measurement
Example:
metric := evaluation.CreateDurationMetric(125.5, nil)
func CreateQualityMetric ¶
func CreateQualityMetric(name string, score, maxScore float64, metadata map[string]interface{}) *MetricMeasurement
CreateQualityMetric creates a quality score metric measurement.
Helper to create a quality score metric with normalized score (0.0-1.0).
Args:
name: Metric name score: Raw score maxScore: Maximum possible score (default: 10.0) metadata: Additional metadata
Returns:
Metric measurement with normalized score
Example:
metric := evaluation.CreateQualityMetric("response_quality", 8.5, 10.0, nil)
func NewMetricMeasurement ¶
func NewMetricMeasurement(name string, value float64, metricType MetricType) *MetricMeasurement
NewMetricMeasurement creates a new metric measurement with current timestamp.
type MetricType ¶
type MetricType string
MetricType categorizes different types of metrics.
const ( // MetricTypeSuccessRate measures success/failure rates MetricTypeSuccessRate MetricType = "success_rate" // MetricTypeQualityScore measures output quality MetricTypeQualityScore MetricType = "quality_score" // MetricTypeCost measures token/API costs MetricTypeCost MetricType = "cost" // MetricTypeDuration measures time taken MetricTypeDuration MetricType = "duration" // MetricTypeErrorRate measures error frequency MetricTypeErrorRate MetricType = "error_rate" // MetricTypeTaskCompletion measures task completion MetricTypeTaskCompletion MetricType = "task_completion" // MetricTypeCustom for custom metrics MetricTypeCustom MetricType = "custom" )
type MetricsCollector ¶
type MetricsCollector struct {
// contains filtered or unexported fields
}
MetricsCollector aggregates metrics across multiple evaluation sessions.
Useful for analyzing agent performance over time and across different scenarios. Thread-safe for concurrent access.
Example:
collector := evaluation.NewMetricsCollector()
collector.AddResult(result1)
collector.AddResult(result2)
stats := collector.GetStatistics()
fmt.Printf("Success rate: %.2f%%\n", stats["success_rate"]*100)
func NewMetricsCollector ¶
func NewMetricsCollector() *MetricsCollector
NewMetricsCollector creates a new metrics collector.
func (*MetricsCollector) AddResult ¶
func (mc *MetricsCollector) AddResult(result *SessionResult)
AddResult adds a session result to the collector. Thread-safe for concurrent access.
func (*MetricsCollector) Clear ¶
func (mc *MetricsCollector) Clear()
Clear removes all collected results. Thread-safe for concurrent access.
func (*MetricsCollector) GetMetricAggregates ¶
func (mc *MetricsCollector) GetMetricAggregates(metricName string) map[string]interface{}
GetMetricAggregates computes aggregated statistics for a specific metric across all sessions. Thread-safe for concurrent access.
Args:
metricName: Name of the metric to aggregate
Returns:
Map with statistics: count, sum, mean, min, max
func (*MetricsCollector) GetResults ¶
func (mc *MetricsCollector) GetResults() []SessionResult
GetResults returns all collected session results. Thread-safe for concurrent access.
func (*MetricsCollector) GetStatistics ¶
func (mc *MetricsCollector) GetStatistics() map[string]interface{}
GetStatistics computes aggregated statistics across all collected results. Thread-safe for concurrent access.
Returns a map with statistics including:
- session_count: Total number of sessions
- completed_count: Number of completed sessions
- failed_count: Number of failed sessions
- success_rate: Ratio of completed to total sessions
- avg_duration: Average session duration in seconds
- total_errors: Total number of errors across all sessions
- avg_errors_per_session: Average errors per session
type NeedleInHaystackBenchmark ¶
type NeedleInHaystackBenchmark struct {
// contains filtered or unexported fields
}
NeedleInHaystackBenchmark is a needle-in-haystack benchmark for context retrieval.
Tests ability to retrieve specific information from large contexts. Essential for extreme-scale systems like endless.
func NewNeedleInHaystackBenchmark ¶
func NewNeedleInHaystackBenchmark(contextLength, needleCount, haystackMultiplier int) *NeedleInHaystackBenchmark
NewNeedleInHaystackBenchmark creates a new needle-in-haystack benchmark.
Args:
contextLength: Target context length in tokens needleCount: Number of needles to hide haystackMultiplier: How much filler per needle
Example:
benchmark := NewNeedleInHaystackBenchmark(10000, 5, 10)
func (*NeedleInHaystackBenchmark) Description ¶
func (b *NeedleInHaystackBenchmark) Description() string
Description returns the benchmark description.
func (*NeedleInHaystackBenchmark) GenerateTestCases ¶
func (b *NeedleInHaystackBenchmark) GenerateTestCases() ([]*TestCase, error)
GenerateTestCases generates needle-in-haystack test cases.
func (*NeedleInHaystackBenchmark) Name ¶
func (b *NeedleInHaystackBenchmark) Name() string
Name returns the benchmark name.
type ObjectiveFunc ¶
ObjectiveFunc evaluates a configuration and returns a score.
type OptimizationResult ¶
type OptimizationResult struct {
BestConfig map[string]interface{}
BestScore float64
History []OptimizationStep
NIterations int
StartTime time.Time
EndTime time.Time
Metadata map[string]interface{}
}
OptimizationResult contains the results of an optimization run.
func (*OptimizationResult) Duration ¶
func (r *OptimizationResult) Duration() time.Duration
Duration returns the total optimization duration.
func (*OptimizationResult) GetImprovement ¶
func (r *OptimizationResult) GetImprovement() float64
GetImprovement returns the improvement from initial to best score.
type OptimizationStep ¶
OptimizationStep represents a single evaluation in the optimization.
type OptimizationStrategy ¶
type OptimizationStrategy string
OptimizationStrategy represents prompt optimization strategies.
const ( // StrategyGrid exhaustive grid search StrategyGrid OptimizationStrategy = "grid" // StrategyRandom random sampling StrategyRandom OptimizationStrategy = "random" // StrategyGenetic genetic algorithm StrategyGenetic OptimizationStrategy = "genetic" )
type Optimizer ¶
type Optimizer interface {
// Optimize runs the optimization process.
//
// Args:
// ctx: Context for cancellation
// nIterations: Number of iterations to run
//
// Returns:
// OptimizationResult with best configuration and history
Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
}
Optimizer is the base interface for optimization algorithms.
Implementations should provide intelligent search over configuration spaces using various strategies (random search, Bayesian optimization, genetic algorithms, etc.).
type ParameterSpec ¶
type ParameterSpec struct {
Type ParameterType
Low float64 // For continuous/integer
High float64 // For continuous/integer
Values []interface{} // For discrete/categorical
}
ParameterSpec defines a parameter in the search space.
type ParameterType ¶
type ParameterType string
ParameterType specifies the type of a hyperparameter.
const ( // ParamTypeContinuous represents a continuous parameter (float) ParamTypeContinuous ParameterType = "continuous" // ParamTypeInteger represents an integer parameter ParamTypeInteger ParameterType = "integer" // ParamTypeDiscrete represents a discrete set of values ParamTypeDiscrete ParameterType = "discrete" // ParamTypeCategorical represents categorical values ParamTypeCategorical ParameterType = "categorical" )
type PrecisionRecallMetric ¶
type PrecisionRecallMetric struct {
// contains filtered or unexported fields
}
PrecisionRecallMetric measures precision and recall for classification tasks.
Useful for agents that categorize, filter, or make binary decisions.
Example:
metric := NewPrecisionRecallMetric()
// Agent classifies documents as relevant/not relevant
for _, doc := range testDocs {
score, _ := metric.Measure(agent, doc, output, ctx)
}
func NewPrecisionRecallMetric ¶
func NewPrecisionRecallMetric() *PrecisionRecallMetric
NewPrecisionRecallMetric creates a new precision/recall metric.
func (*PrecisionRecallMetric) Aggregate ¶
func (m *PrecisionRecallMetric) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates precision/recall metrics.
Returns:
Precision, recall, F1 score, and confusion matrix counts
func (*PrecisionRecallMetric) Measure ¶
func (m *PrecisionRecallMetric) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
Measure measures precision/recall for single classification.
Context must contain:
- "true_label": Ground truth (True/False or 1/0)
- "predicted_label": Agent's prediction (True/False or 1/0)
Returns:
1.0 if correct classification, 0.0 if incorrect
func (*PrecisionRecallMetric) Name ¶
func (m *PrecisionRecallMetric) Name() string
Name returns the metric name.
func (*PrecisionRecallMetric) Reset ¶
func (m *PrecisionRecallMetric) Reset()
Reset resets confusion matrix counts.
type PrecisionRecallStats ¶
type PrecisionRecallStats struct {
TruePositives int
FalsePositives int
FalseNegatives int
TrueNegatives int
}
PrecisionRecallStats contains precision and recall statistics.
func (*PrecisionRecallStats) F1Score ¶
func (s *PrecisionRecallStats) F1Score() float64
F1Score calculates F1 score.
func (*PrecisionRecallStats) Precision ¶
func (s *PrecisionRecallStats) Precision() float64
Precision calculates precision.
func (*PrecisionRecallStats) Recall ¶
func (s *PrecisionRecallStats) Recall() float64
Recall calculates recall.
func (*PrecisionRecallStats) ToDict ¶
func (s *PrecisionRecallStats) ToDict() map[string]float64
ToDict converts stats to dictionary.
type PromptEvaluation ¶
PromptEvaluation represents a single prompt evaluation.
type PromptOptimizationResult ¶
type PromptOptimizationResult struct {
// BestPrompt is the best prompt found
BestPrompt string
// BestConfig is the best variable configuration
BestConfig map[string]string
// BestScores contains best metric scores
BestScores map[string]float64
// History contains all (prompt, config, scores) tuples
History []PromptEvaluation
// NEvaluated is the number of prompts evaluated
NEvaluated int
// Strategy used
Strategy string
// StartTime in milliseconds
StartTime int64
// EndTime in milliseconds
EndTime int64
}
PromptOptimizationResult contains results from prompt optimization.
func (*PromptOptimizationResult) DurationSeconds ¶
func (r *PromptOptimizationResult) DurationSeconds() float64
DurationSeconds returns the duration in seconds.
func (*PromptOptimizationResult) ToDict ¶
func (r *PromptOptimizationResult) ToDict() map[string]interface{}
ToDict converts result to dictionary.
type PromptOptimizer ¶
type PromptOptimizer struct {
// contains filtered or unexported fields
}
PromptOptimizer optimizes prompts through systematic variation.
Supports multiple optimization strategies: - Grid search: Exhaustive evaluation of all combinations - Random search: Random sampling of combinations - Genetic algorithm: Evolutionary optimization
func NewPromptOptimizer ¶
func NewPromptOptimizer( template string, variations map[string][]string, agentFactory AgentFactory, metrics []string, objectiveMetric *string, ) *PromptOptimizer
NewPromptOptimizer creates a new prompt optimizer.
Args:
template: Prompt template with {variable} placeholders
variations: Map of variable names to possible values
agentFactory: Function that creates agent from prompt string
metrics: List of metrics to evaluate
objectiveMetric: Primary metric for optimization (nil = first metric)
maximize: Whether to maximize (true) or minimize (false) objective
func (*PromptOptimizer) Optimize ¶
func (p *PromptOptimizer) Optimize( ctx context.Context, testCases []map[string]interface{}, strategy string, options map[string]interface{}, ) (*PromptOptimizationResult, error)
Optimize runs prompt optimization with the specified strategy.
Args:
ctx: Context for cancellation
testCases: Test cases for evaluation
strategy: Optimization strategy ("grid", "random", "genetic")
options: Strategy-specific options (nSamples, populationSize, etc.)
func (*PromptOptimizer) OptimizeGenetic ¶
func (p *PromptOptimizer) OptimizeGenetic( ctx context.Context, testCases []map[string]interface{}, populationSize int, nGenerations int, mutationRate float64, ) (*PromptOptimizationResult, error)
OptimizeGenetic performs genetic algorithm optimization.
func (*PromptOptimizer) OptimizeGrid ¶
func (p *PromptOptimizer) OptimizeGrid( ctx context.Context, testCases []map[string]interface{}, ) (*PromptOptimizationResult, error)
OptimizeGrid performs grid search by evaluating all possible combinations.
func (*PromptOptimizer) OptimizeRandom ¶
func (p *PromptOptimizer) OptimizeRandom( ctx context.Context, testCases []map[string]interface{}, nSamples int, ) (*PromptOptimizationResult, error)
OptimizeRandom performs random search by sampling random combinations.
func (*PromptOptimizer) SetMaximize ¶
func (p *PromptOptimizer) SetMaximize(maximize bool)
SetMaximize sets whether to maximize (true) or minimize (false) the objective.
type QualityMetrics ¶
type QualityMetrics struct {
// contains filtered or unexported fields
}
QualityMetrics provides comprehensive quality scoring.
Evaluates multiple quality dimensions:
- Relevance: How relevant is response to query?
- Completeness: Does response answer all parts?
- Coherence: Is response logically structured?
- Accuracy: Is information factually correct?
Uses rule-based scoring.
Example:
metric := NewQualityMetrics(false, "", nil)
score, _ := metric.Measure(agent, inputMsg, outputMsg, nil)
fmt.Printf("Quality: %.2f\n", score) // 0.0 to 1.0
func NewQualityMetrics ¶
func NewQualityMetrics(useLLMJudge bool, judgeModel string, weights map[string]float64) *QualityMetrics
NewQualityMetrics creates a new quality metrics instance.
Args:
useLLMJudge: Use LLM to judge quality (not yet implemented) judgeModel: Model to use for judging (e.g., "claude-sonnet-4") weights: Weights for each dimension (relevance, completeness, etc.)
Example:
metric := NewQualityMetrics(false, "", nil)
func (*QualityMetrics) Aggregate ¶
func (m *QualityMetrics) Aggregate(measurements []float64) map[string]float64
Aggregate aggregates quality measurements.
Args:
measurements: List of quality scores
Returns:
Statistics: mean, min, max, std
func (*QualityMetrics) Measure ¶
func (m *QualityMetrics) Measure(agent agenkit.Agent, inputMessage, outputMessage *agenkit.Message, ctx map[string]interface{}) (float64, error)
Measure measures response quality.
Args:
agent: Agent being evaluated inputMessage: Input query outputMessage: Agent response ctx: Optional context
Returns:
Quality score (0.0 to 1.0)
type RandomSearchOptimizer ¶
type RandomSearchOptimizer struct {
// contains filtered or unexported fields
}
RandomSearchOptimizer implements baseline random search optimization.
Randomly samples configurations from the search space and evaluates them. Useful as a baseline for comparison with more sophisticated algorithms.
Example:
optimizer := evaluation.NewRandomSearchOptimizer(
objectiveFunc,
searchSpace,
true, // maximize
)
result, err := optimizer.Optimize(ctx, 20)
func NewRandomSearchOptimizer ¶
func NewRandomSearchOptimizer( objective ObjectiveFunc, searchSpace *SearchSpace, maximize bool, ) *RandomSearchOptimizer
NewRandomSearchOptimizer creates a new random search optimizer.
Args:
objective: Function to evaluate configurations searchSpace: SearchSpace defining parameter space maximize: Whether to maximize (true) or minimize (false) objective
Returns:
RandomSearchOptimizer instance
func (*RandomSearchOptimizer) GetHistory ¶
func (r *RandomSearchOptimizer) GetHistory() []OptimizationStep
GetHistory returns the optimization history.
func (*RandomSearchOptimizer) Optimize ¶
func (r *RandomSearchOptimizer) Optimize(ctx context.Context, nIterations int) (*OptimizationResult, error)
Optimize runs random search optimization.
Randomly samples nIterations configurations from the search space, evaluates each, and tracks the best configuration found.
Args:
ctx: Context for cancellation nIterations: Number of configurations to sample and evaluate
Returns:
OptimizationResult with best config, score, and history error if optimization fails
type RecordingStorage ¶
type RecordingStorage interface {
// SaveRecording saves recording.
SaveRecording(recording *SessionRecording) error
// LoadRecording loads recording by session ID.
LoadRecording(sessionID string) (*SessionRecording, error)
// ListRecordings lists recordings.
ListRecordings(limit, offset int) ([]*SessionRecording, error)
// DeleteRecording deletes recording.
DeleteRecording(sessionID string) error
}
RecordingStorage is the interface for recording storage backends.
Implement this to create custom storage (Redis, S3, Postgres, etc.).
type Regression ¶
type Regression struct {
MetricName string
BaselineValue float64
CurrentValue float64
DegradationPercent float64
Severity Severity
Timestamp time.Time
Context map[string]interface{}
}
Regression represents a detected regression in agent performance.
Contains information about what degraded and by how much.
func (*Regression) IsRegression ¶
func (r *Regression) IsRegression() bool
IsRegression checks if this is a real regression (not improvement).
func (*Regression) ToDict ¶
func (r *Regression) ToDict() map[string]interface{}
ToDict converts regression to dictionary.
type RegressionDetector ¶
type RegressionDetector struct {
// contains filtered or unexported fields
}
RegressionDetector detects performance regressions by comparing results.
Monitors agent quality over time and alerts when performance degrades beyond acceptable thresholds.
Example:
detector := NewRegressionDetector(nil, nil)
detector.SetBaseline(baselineResult)
// Later, after changes
regressions := detector.Detect(currentResult, true)
if len(regressions) > 0 {
fmt.Printf("Found %d regressions!\n", len(regressions))
for _, r := range regressions {
fmt.Printf(" %s: %.1f%% worse\n", r.MetricName, r.DegradationPercent)
}
}
func NewRegressionDetector ¶
func NewRegressionDetector(thresholds map[string]float64, baseline *EvaluationResult) *RegressionDetector
NewRegressionDetector creates a new regression detector.
Args:
thresholds: Acceptable degradation per metric (default 10%) baseline: Baseline evaluation result to compare against
Example:
detector := NewRegressionDetector(nil, baselineResult)
func (*RegressionDetector) ClearHistory ¶
func (d *RegressionDetector) ClearHistory()
ClearHistory clears evaluation history.
func (*RegressionDetector) CompareResults ¶
func (d *RegressionDetector) CompareResults(resultA, resultB *EvaluationResult) map[string]map[string]float64
CompareResults compares two evaluation results.
Args:
resultA: First result (baseline) resultB: Second result (comparison)
Returns:
Dictionary of metric comparisons
func (*RegressionDetector) Detect ¶
func (d *RegressionDetector) Detect(result *EvaluationResult, storeHistory bool) []*Regression
Detect detects regressions in evaluation result.
Compares current result to baseline and identifies metrics that have degraded beyond acceptable thresholds.
Args:
result: Current evaluation result storeHistory: Whether to store result in history
Returns:
List of detected regressions (empty if no regressions)
func (*RegressionDetector) GetSummary ¶
func (d *RegressionDetector) GetSummary() map[string]interface{}
GetSummary gets summary of detector state.
Returns:
Summary with baseline info and history count
func (*RegressionDetector) GetTrend ¶
func (d *RegressionDetector) GetTrend(metricName string, window int) map[string]interface{}
GetTrend gets trend for metric over recent history.
Args:
metricName: Metric to analyze window: Number of recent results to analyze
Returns:
Trend statistics (slope, direction, variance)
func (*RegressionDetector) SetBaseline ¶
func (d *RegressionDetector) SetBaseline(result *EvaluationResult)
SetBaseline sets baseline for comparison.
Args:
result: Evaluation result to use as baseline
type SearchSpace ¶
type SearchSpace struct {
Parameters map[string]ParameterSpec
}
SearchSpace defines the hyperparameter search space.
func (*SearchSpace) AddCategorical ¶
func (s *SearchSpace) AddCategorical(name string, values []string)
AddCategorical adds a categorical parameter with specific values.
func (*SearchSpace) AddContinuous ¶
func (s *SearchSpace) AddContinuous(name string, low, high float64)
AddContinuous adds a continuous parameter with range [low, high].
func (*SearchSpace) AddDiscrete ¶
func (s *SearchSpace) AddDiscrete(name string, values []interface{})
AddDiscrete adds a discrete parameter with specific values.
func (*SearchSpace) AddInteger ¶
func (s *SearchSpace) AddInteger(name string, low, high int)
AddInteger adds an integer parameter with range [low, high].
func (*SearchSpace) Sample ¶
func (s *SearchSpace) Sample() map[string]interface{}
Sample generates a random configuration from the search space.
type SessionRecorder ¶
type SessionRecorder struct {
// contains filtered or unexported fields
}
SessionRecorder records agent sessions for replay and analysis.
Automatically records all interactions with an agent, storing inputs, outputs, timing, and metadata.
Example:
recorder := NewSessionRecorder(NewFileRecordingStorage("./recordings"))
wrappedAgent := recorder.Wrap(agent)
// Use agent normally (automatically recorded)
response, _ := wrappedAgent.Process(ctx, message)
// Save recording
recorder.FinalizeSession("test-123")
func NewSessionRecorder ¶
func NewSessionRecorder(storage RecordingStorage) *SessionRecorder
NewSessionRecorder creates a new session recorder.
Args:
storage: Storage backend (nil = in-memory)
Example:
recorder := NewSessionRecorder(nil)
func (*SessionRecorder) DeleteRecording ¶
func (r *SessionRecorder) DeleteRecording(sessionID string) error
DeleteRecording deletes recording.
func (*SessionRecorder) FinalizeSession ¶
func (r *SessionRecorder) FinalizeSession(sessionID string) (*SessionRecording, error)
FinalizeSession finalizes and saves session recording.
Args:
sessionID: Session to finalize
Returns:
Session recording
func (*SessionRecorder) ListRecordings ¶
func (r *SessionRecorder) ListRecordings(limit, offset int) ([]*SessionRecording, error)
ListRecordings lists all recordings.
func (*SessionRecorder) LoadRecording ¶
func (r *SessionRecorder) LoadRecording(sessionID string) (*SessionRecording, error)
LoadRecording loads recording from storage.
Args:
sessionID: Session to load
Returns:
Session recording if found
func (*SessionRecorder) RecordInteraction ¶
func (r *SessionRecorder) RecordInteraction(sessionID string, inputMessage, outputMessage *agenkit.Message, latencyMs float64, metadata map[string]interface{})
RecordInteraction records single interaction.
Args:
sessionID: Session identifier inputMessage: Input to agent outputMessage: Agent response latencyMs: Processing time in milliseconds metadata: Optional interaction metadata
func (*SessionRecorder) StartSession ¶
func (r *SessionRecorder) StartSession(sessionID, agentName string, metadata map[string]interface{})
StartSession starts recording session.
Args:
sessionID: Session identifier agentName: Name of agent being recorded metadata: Optional session metadata
type SessionRecording ¶
type SessionRecording struct {
SessionID string
AgentName string
StartTime time.Time
EndTime *time.Time
Interactions []*InteractionRecord
Metadata map[string]interface{}
}
SessionRecording represents a recording of entire session.
Contains all interactions and session metadata.
func SessionRecordingFromDict ¶
func SessionRecordingFromDict(data map[string]interface{}) (*SessionRecording, error)
SessionRecordingFromDict creates recording from dictionary.
func (*SessionRecording) DurationSeconds ¶
func (r *SessionRecording) DurationSeconds() float64
DurationSeconds calculates session duration in seconds.
func (*SessionRecording) InteractionCount ¶
func (r *SessionRecording) InteractionCount() int
InteractionCount gets number of interactions.
func (*SessionRecording) ToDict ¶
func (r *SessionRecording) ToDict() map[string]interface{}
ToDict converts recording to dictionary.
func (*SessionRecording) TotalLatencyMs ¶
func (r *SessionRecording) TotalLatencyMs() float64
TotalLatencyMs gets total latency across all interactions.
type SessionReplay ¶
type SessionReplay struct{}
SessionReplay replays recorded sessions for analysis and A/B testing.
Takes recorded session and replays it through a (possibly different) agent to compare behavior.
Example:
replay := NewSessionReplay()
recording, _ := recorder.LoadRecording("test-123")
// Replay with original agent
resultsA, _ := replay.Replay(recording, agentV1, "")
// Replay with new agent (A/B test)
resultsB, _ := replay.Replay(recording, agentV2, "")
// Compare
comparison := replay.Compare(resultsA, resultsB)
func NewSessionReplay ¶
func NewSessionReplay() *SessionReplay
NewSessionReplay creates a new session replay.
func (*SessionReplay) Compare ¶
func (r *SessionReplay) Compare(resultsA, resultsB map[string]interface{}) map[string]interface{}
Compare compares two replay results.
Useful for A/B testing different agent versions.
Args:
resultsA: First replay results resultsB: Second replay results
Returns:
Comparison metrics
func (*SessionReplay) Replay ¶
func (r *SessionReplay) Replay(recording *SessionRecording, agent agenkit.Agent, sessionID string) (map[string]interface{}, error)
Replay replays session through agent.
Args:
recording: Session recording to replay agent: Agent to replay through sessionID: Optional session ID (defaults to original)
Returns:
Replay results with outputs and metrics
type SessionResult ¶
type SessionResult struct {
// SessionID uniquely identifies this session
SessionID string `json:"session_id"`
// AgentName identifies the agent being evaluated
AgentName string `json:"agent_name"`
// Status of the session
Status SessionStatus `json:"status"`
// StartTime when session started (RFC3339 format)
StartTime string `json:"start_time"`
// EndTime when session ended (RFC3339 format, nil if still running)
EndTime *string `json:"end_time,omitempty"`
// Measurements collected during session
Measurements []MetricMeasurement `json:"measurements"`
// Errors that occurred during session
Errors []ErrorRecord `json:"errors"`
// Metadata for additional context
Metadata map[string]interface{} `json:"metadata,omitempty"`
}
SessionResult contains results from evaluating an agent session with enhanced tracking.
This extends the core EvaluationResult with session status, error tracking, and richer metadata for long-running agent evaluations.
func FromJSON ¶
func FromJSON(jsonStr string) (*SessionResult, error)
FromJSON deserializes a session result from JSON.
func NewSessionResult ¶
func NewSessionResult(sessionID string, agentName string) *SessionResult
NewSessionResult creates a new session result.
func (*SessionResult) AddError ¶
func (sr *SessionResult) AddError(errorType string, message string, details map[string]interface{})
AddError records an error that occurred during the session.
func (*SessionResult) AddMetricMeasurement ¶
func (sr *SessionResult) AddMetricMeasurement(measurement *MetricMeasurement)
AddMetricMeasurement adds a metric measurement to this session.
func (*SessionResult) DurationSeconds ¶
func (sr *SessionResult) DurationSeconds() *float64
DurationSeconds calculates session duration in seconds.
Returns nil if session hasn't ended yet.
func (*SessionResult) GetMetric ¶
func (sr *SessionResult) GetMetric(name string) *MetricMeasurement
GetMetric retrieves a specific metric measurement by name (returns first match).
func (*SessionResult) GetMetricsByType ¶
func (sr *SessionResult) GetMetricsByType(metricType MetricType) []MetricMeasurement
GetMetricsByType retrieves all measurements of a specific type.
func (*SessionResult) SetStatus ¶
func (sr *SessionResult) SetStatus(status SessionStatus)
SetStatus updates the session status.
func (*SessionResult) ToJSON ¶
func (sr *SessionResult) ToJSON() (string, error)
ToJSON serializes the session result to JSON.
type SessionStatus ¶
type SessionStatus string
SessionStatus represents the status of an evaluation session.
const ( // SessionStatusRunning indicates session is currently running SessionStatusRunning SessionStatus = "running" // SessionStatusCompleted indicates session completed successfully SessionStatusCompleted SessionStatus = "completed" // SessionStatusFailed indicates session failed SessionStatusFailed SessionStatus = "failed" // SessionStatusTimeout indicates session timed out SessionStatusTimeout SessionStatus = "timeout" // SessionStatusCancelled indicates session was cancelled SessionStatusCancelled SessionStatus = "cancelled" )
type SignificanceLevel ¶
type SignificanceLevel float64
SignificanceLevel represents statistical significance thresholds.
const ( // SignificanceLevel0001 represents 99.9% confidence SignificanceLevel0001 SignificanceLevel = 0.001 // SignificanceLevel001 represents 99% confidence SignificanceLevel001 SignificanceLevel = 0.01 // SignificanceLevel005 represents 95% confidence (default) SignificanceLevel005 SignificanceLevel = 0.05 // SignificanceLevel010 represents 90% confidence SignificanceLevel010 SignificanceLevel = 0.10 )
type SimpleQABenchmark ¶
type SimpleQABenchmark struct{}
SimpleQABenchmark is a simple question-answering benchmark.
Tests basic knowledge and reasoning.
func NewSimpleQABenchmark ¶
func NewSimpleQABenchmark() *SimpleQABenchmark
NewSimpleQABenchmark creates a new simple Q&A benchmark.
func (*SimpleQABenchmark) Description ¶
func (b *SimpleQABenchmark) Description() string
Description returns the benchmark description.
func (*SimpleQABenchmark) GenerateTestCases ¶
func (b *SimpleQABenchmark) GenerateTestCases() ([]*TestCase, error)
GenerateTestCases generates simple Q&A test cases.
func (*SimpleQABenchmark) Name ¶
func (b *SimpleQABenchmark) Name() string
Name returns the benchmark name.
type StatisticalTestType ¶
type StatisticalTestType string
StatisticalTestType represents types of statistical tests.
const ( // TestTypeTTest is a parametric test assuming normal distribution TestTypeTTest StatisticalTestType = "t_test" // TestTypeMannWhitney is a non-parametric test TestTypeMannWhitney StatisticalTestType = "mann_whitney" )
type TestCase ¶
type TestCase struct {
Input string
Expected interface{} // String or validation function
Metadata map[string]interface{}
Tags []string
}
TestCase represents a single test case for evaluation.
Contains input, expected output, and metadata.
type ValidatorFunc ¶
ValidatorFunc is a custom validation function type.