Documentation
¶
Overview ¶
Package evals provides a comprehensive tracing framework for evaluating and monitoring agent interactions.
Overview ¶
The evals package enables detailed tracking of agent execution flows, including prompts, tool calls, results, and timing information. It provides a structured approach to capture evaluation data for analysis, debugging, and performance monitoring of AI agents.
Core Components ¶
The package is built around several key types:
- Tracer[T]: Generic interface for creating and managing traces with result type T
- Trace[T]: Complete agent interaction from prompt to result of type T
- ToolCall[T]: Individual tool invocation within a trace of type T
- Observer: Interface for evaluation observing and grading
- ObservableTraceCallback: Function type for trace evaluation callbacks
- NamespacedObserver: Hierarchical namespace management for evaluations
- ResultCollector: Observer wrapper that collects failure messages and grades
- Grade: Structured grade with score and reasoning
Generic Type Parameters ¶
All core types are generic with type parameter T that serves two purposes:
1. **Type Safety**: The Result field in Trace[T] is strongly typed as T instead of interface{} 2. **Context Disambiguation**: Multiple tracers with different result types can coexist in the same context
**Important**: Only Trace.Result is generic (type T). ToolCall.Result remains interface{} for maximum flexibility, as individual tool calls may return varied data types.
## Type Parameter Usage Patterns
### Simple Text Results For basic string results from agent interactions:
tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Generate summary")
trace.Complete("Summary: The analysis shows...", nil)
### Structured Results For complex, type-safe results using custom structs:
type AnalysisResult struct {
TotalFiles int `json:"total_files"`
IssuesFound int `json:"issues_found"`
Confidence float64 `json:"confidence"`
}
tracer := agenttrace.ByCode[AnalysisResult]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze codebase")
trace.Complete(AnalysisResult{
TotalFiles: 42,
IssuesFound: 3,
Confidence: 0.95,
}, nil)
### Multiple Tracers with Different Types The same context can hold tracers for different result types:
ctx := context.Background() // String tracer for text summaries stringTracer := agenttrace.ByCode[string](stringCallback) ctx = agenttrace.WithTracer[string](ctx, stringTracer) // Structured tracer for metrics metricsTracer := agenttrace.ByCode[MetricsData](metricsCallback) ctx = agenttrace.WithTracer[MetricsData](ctx, metricsTracer) // Both coexist without conflict summaryTrace := agenttrace.StartTrace[string](ctx, "Generate summary") metricsTrace := agenttrace.StartTrace[MetricsData](ctx, "Collect metrics")
**Note**: While Trace[interface{}] provides maximum flexibility when result types vary at runtime, prefer specific types when possible for better type safety and API clarity.
Features ¶
- Thread-safe trace and tool call recording
- Automatic trace completion and recording
- Flexible callback system for custom trace processing
- Context-based tracer management
- Structured trace output with timing information
- Support for both successful and failed tool calls
- Concurrent execution support with proper synchronization
- Built-in validation helpers for common evaluation patterns
- Observer interface for test integration and result collection
- Hierarchical namespacing for organized evaluation reporting
- Integration with Go's testing framework
Usage Patterns ¶
## Basic Trace Creation
All traces must be created using a tracer. The simplest approach uses ByCode with no callbacks:
tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze the security report")
toolCall := trace.StartToolCall("tc1", "file-reader", map[string]interface{}{
"path": "/var/logs/security.log",
})
toolCall.Complete("File content here", nil)
trace.Complete("Analysis complete", nil)
## Context-Based Tracing
For more sophisticated scenarios, use context-managed tracers:
ctx := context.Background()
tracer := agenttrace.ByCode[string](func(trace *agenttrace.Trace[string]) {
log.Printf("Trace completed: %s", trace.ID)
})
ctx = agenttrace.WithTracer[string](ctx, tracer)
trace := agenttrace.StartTrace[string](ctx, "Process user request")
// ... perform operations
trace.Complete("Request processed", nil)
## Custom Evaluation Callbacks
Create custom tracers with callback functions for specialized evaluation:
tracer := agenttrace.ByCode[string](
func(trace *agenttrace.Trace[string]) {
// Save to database
saveTraceToDatabase(trace)
},
func(trace *agenttrace.Trace[string]) {
// Send metrics
recordMetrics(trace.Duration(), len(trace.ToolCalls))
},
)
## Evaluation Helpers
The package provides built-in validation helpers for common evaluation patterns. All helper functions require explicit type parameters matching your trace result type:
// Validate exact number of tool calls
callback := evals.Inject[string](observer, evals.ExactToolCalls[string](2))
// Validate no errors occurred
callback = evals.Inject[string](observer, evals.NoErrors[string]())
// Validate required tool usage
callback = evals.Inject[string](observer, evals.RequiredToolCalls[string]([]string{"search", "analyze"}))
// Custom tool call validation
callback = evals.Inject[string](observer, evals.ToolCallValidator[string](func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
if tc.Name == "search" && tc.Result == nil {
return fmt.Errorf("search tool must return results")
}
return nil
}))
## Result Collection
Use ResultCollector to collect failure messages and grades from evaluations:
// Create a base observer (could be namespaced)
baseObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
return customObserver(name)
})
// Wrap with result collector to capture evaluation outcomes
collector := evals.NewResultCollector(baseObs)
// Use in evaluation callbacks
callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
if len(trace.ToolCalls) == 0 {
o.Fail("No tool calls found")
}
o.Grade(0.85, "Good performance")
}
// Run evaluation
tracer := agenttrace.ByCode[string](evals.Inject[string](collector, callback))
// ... create and complete traces
// Collect results
failures := collector.Failures() // []string of failure messages
grades := collector.Grades() // []Grade with scores and reasoning
## Observer and Namespaced Evaluation
Use the NamespacedObserver for hierarchical evaluation organization:
// Create a custom observer implementation
namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
return customObserver(name) // your custom implementation
})
// Use with evaluation helpers in organized namespaces
tracer := agenttrace.ByCode[string](
evals.Inject[string](namespacedObs.Child("accuracy"), evals.ExactToolCalls[string](1)),
evals.Inject[string](namespacedObs.Child("reliability"), evals.NoErrors[string]()),
)
Integration Patterns ¶
## Default Logging Integration
The package integrates with chainguard-dev/clog for structured logging:
ctx := context.Background() tracer := evals.NewDefaultTracer[string](ctx) // Uses clog from context trace := tracer.NewTrace(ctx, "Execute workflow")
## Error Handling
The package handles both tool-level and trace-level errors:
// Tool call that fails
toolCall := trace.StartToolCall("tc1", "api-call", params)
toolCall.Complete(nil, errors.New("API timeout"))
// Bad tool call (invalid parameters)
trace.BadToolCall("tc2", "unknown-tool", badParams, errors.New("unknown tool"))
// Trace that fails
trace.Complete(nil, errors.New("workflow failed"))
Thread Safety ¶
All operations are thread-safe. Multiple goroutines can safely:
- Create and complete tool calls concurrently
- Access trace duration and other methods
- Record traces through tracer callbacks
The package uses fine-grained locking to ensure data consistency while maintaining performance.
Performance Considerations ¶
- Trace IDs are generated with timestamp and randomness for uniqueness
- Tool call and trace durations are calculated efficiently
- String representations limit output size to prevent memory issues
- Callbacks are executed in parallel using errgroup for better performance
Index ¶
- func BuildCallbacks[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) []agenttrace.TraceCallback[T]
- func BuildTracer[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) agenttrace.Tracer[T]
- func Inject[T any](obs Observer, callback ObservableTraceCallback[T]) agenttrace.TraceCallback[T]
- func NewDefaultTracer[T any](ctx context.Context) agenttrace.Tracer[T]
- type Grade
- type MetricsObserver
- type NamespacedObserver
- func (n *NamespacedObserver[T]) Child(name string) *NamespacedObserver[T]
- func (n *NamespacedObserver[T]) Fail(msg string)
- func (n *NamespacedObserver[T]) Grade(score float64, reasoning string)
- func (n *NamespacedObserver[T]) Increment()
- func (n *NamespacedObserver[T]) Log(msg string)
- func (n *NamespacedObserver[T]) Total() int64
- func (n *NamespacedObserver[T]) Walk(visitor func(string, T))
- type ObservableTraceCallback
- func ExactToolCalls[T any](n int) ObservableTraceCallback[T]
- func MaximumNToolCalls[T any](n int) ObservableTraceCallback[T]
- func MinimumNToolCalls[T any](n int) ObservableTraceCallback[T]
- func NoErrors[T any]() ObservableTraceCallback[T]
- func NoToolCalls[T any]() ObservableTraceCallback[T]
- func OnlyToolCalls[T any](toolNames ...string) ObservableTraceCallback[T]
- func RangeToolCalls[T any](min, max int) ObservableTraceCallback[T]
- func RequiredToolCalls[T any](toolNames []string) ObservableTraceCallback[T]
- func ResultValidator[T any](validator func(result T) error) ObservableTraceCallback[T]
- func ToolCallNamed[T any](name string, validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]
- func ToolCallValidator[T any](validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]
- type Observer
- type ResultCollector
- func (r *ResultCollector) Fail(msg string)
- func (r *ResultCollector) Failures() []string
- func (r *ResultCollector) Grade(score float64, reasoning string)
- func (r *ResultCollector) Grades() []Grade
- func (r *ResultCollector) Increment()
- func (r *ResultCollector) Log(msg string)
- func (r *ResultCollector) Total() int64
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BuildCallbacks ¶
func BuildCallbacks[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) []agenttrace.TraceCallback[T]
BuildCallbacks creates a list of TraceCallbacks from a namespaced observer and evaluation map. This helper injects each evaluation function with a child observer to create TraceCallbacks that can be used with ByCode or other tracers.
func BuildTracer ¶
func BuildTracer[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) agenttrace.Tracer[T]
BuildTracer creates a ByCode tracer from a namespaced observer and evaluation map. This helper consolidates the common pattern of setting up comprehensive evaluation tracers by injecting each evaluation function with a child observer and building a ByCode tracer from the resulting callbacks.
func Inject ¶
func Inject[T any](obs Observer, callback ObservableTraceCallback[T]) agenttrace.TraceCallback[T]
Inject creates a TraceCallback by injecting an Observer implementation into an ObservableTraceCallback
func NewDefaultTracer ¶
func NewDefaultTracer[T any](ctx context.Context) agenttrace.Tracer[T]
NewDefaultTracer creates a new default tracer that logs to clog
Example ¶
ExampleNewDefaultTracer demonstrates the default logging tracer.
package main
import (
"context"
"fmt"
"chainguard.dev/driftlessaf/agents/evals"
)
func main() {
ctx := context.Background()
// Create default tracer (uses clog for logging)
tracer := evals.NewDefaultTracer[string](ctx)
// Create and complete a trace
trace := tracer.NewTrace(ctx, "System health check")
healthCall := trace.StartToolCall("health1", "check-services", nil)
healthCall.Complete("all services healthy", nil)
// Completing this trace will log structured information
trace.Complete("Health check passed", nil)
fmt.Println("Health check trace completed")
}
Output: Health check trace completed
Types ¶
type MetricsObserver ¶
type MetricsObserver struct {
// contains filtered or unexported fields
}
MetricsObserver implements Observer interface with Prometheus metrics
func NewMetricsObserver ¶
func NewMetricsObserver[T any](namespace string) *MetricsObserver
NewMetricsObserver creates a metrics observer for the given tracer type and namespace
func (*MetricsObserver) Fail ¶
func (m *MetricsObserver) Fail(msg string)
Fail implements Observer.Fail
func (*MetricsObserver) Grade ¶
func (m *MetricsObserver) Grade(score float64, reasoning string)
Grade implements Observer.Grade
func (*MetricsObserver) Increment ¶
func (m *MetricsObserver) Increment()
Increment implements Observer.Increment
func (*MetricsObserver) Log ¶
func (m *MetricsObserver) Log(msg string)
Log implements Observer.Log (no-op for metrics observer)
func (*MetricsObserver) Total ¶
func (m *MetricsObserver) Total() int64
Total implements Observer.Total
type NamespacedObserver ¶
type NamespacedObserver[T Observer] struct { // contains filtered or unexported fields }
NamespacedObserver provides hierarchical namespacing for Observer instances
func NewNamespacedObserver ¶
func NewNamespacedObserver[T Observer](factory func(string) T) *NamespacedObserver[T]
NewNamespacedObserver creates a new root NamespacedObserver with the given factory function
func (*NamespacedObserver[T]) Child ¶
func (n *NamespacedObserver[T]) Child(name string) *NamespacedObserver[T]
Child returns the child namespace with the given name, creating it if necessary
func (*NamespacedObserver[T]) Fail ¶
func (n *NamespacedObserver[T]) Fail(msg string)
Fail delegates to the inner Observer instance
func (*NamespacedObserver[T]) Grade ¶
func (n *NamespacedObserver[T]) Grade(score float64, reasoning string)
Grade delegates to the inner Observer instance
func (*NamespacedObserver[T]) Increment ¶
func (n *NamespacedObserver[T]) Increment()
Increment delegates to the inner Observer instance
func (*NamespacedObserver[T]) Log ¶
func (n *NamespacedObserver[T]) Log(msg string)
Log delegates to the inner Observer instance
func (*NamespacedObserver[T]) Total ¶
func (n *NamespacedObserver[T]) Total() int64
Total delegates to the inner Observer instance
func (*NamespacedObserver[T]) Walk ¶
func (n *NamespacedObserver[T]) Walk(visitor func(string, T))
Walk traverses the observer tree in depth-first order, calling the visitor function on the current node first, then on all children in sorted order by name
type ObservableTraceCallback ¶
type ObservableTraceCallback[T any] func(Observer, *agenttrace.Trace[T])
ObservableTraceCallback is a function that receives an Observer interface and completed traces
func ExactToolCalls ¶
func ExactToolCalls[T any](n int) ObservableTraceCallback[T]
ExactToolCalls returns an ObservableTraceCallback that validates the trace has exactly n tool calls.
Example ¶
ExampleExactToolCalls shows how to validate exact tool call counts
// Create a mock observer
obs := &mockObserver{}
// Use ExactToolCalls to validate exactly 2 tool calls
evalCallback := evals.ExactToolCalls[string](2)
// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))
// Create trace with exactly 2 tool calls
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")
// Add exactly 2 tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)
tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)
// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)
if len(obs.failures) == 0 {
fmt.Println("Validation passed: exactly 2 tool calls")
}
Output: Validation passed: exactly 2 tool calls
func MaximumNToolCalls ¶
func MaximumNToolCalls[T any](n int) ObservableTraceCallback[T]
MaximumNToolCalls returns an ObservableTraceCallback that validates the trace has at most n tool calls.
func MinimumNToolCalls ¶
func MinimumNToolCalls[T any](n int) ObservableTraceCallback[T]
MinimumNToolCalls returns an ObservableTraceCallback that validates the trace has at least n tool calls.
func NoErrors ¶
func NoErrors[T any]() ObservableTraceCallback[T]
NoErrors returns an ObservableTraceCallback that validates no tool calls resulted in errors.
Example ¶
ExampleNoErrors shows how to validate no errors occurred
// Create a mock observer
obs := &mockObserver{}
evalCallback := evals.NoErrors[string]()
// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))
// Create successful trace with no errors
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Read and analyze")
// Add successful tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log content", nil)
tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis complete", nil)
// Complete trace successfully (triggers evaluation)
trace.Complete("Processing complete", nil)
if len(obs.failures) == 0 {
fmt.Println("No errors found")
}
Output: No errors found
func NoToolCalls ¶
func NoToolCalls[T any]() ObservableTraceCallback[T]
NoToolCalls returns an ObservableTraceCallback that validates the trace has no tool calls.
func OnlyToolCalls ¶
func OnlyToolCalls[T any](toolNames ...string) ObservableTraceCallback[T]
OnlyToolCalls returns an ObservableTraceCallback that validates the trace only uses the specified tool names.
func RangeToolCalls ¶
func RangeToolCalls[T any](min, max int) ObservableTraceCallback[T]
RangeToolCalls returns an ObservableTraceCallback that validates the trace has between min and max tool calls (inclusive).
func RequiredToolCalls ¶
func RequiredToolCalls[T any](toolNames []string) ObservableTraceCallback[T]
RequiredToolCalls returns an ObservableTraceCallback that validates the trace uses all of the specified tool names at least once.
Example ¶
ExampleRequiredToolCalls shows how to ensure specific tools are called
// Create a mock observer
obs := &mockObserver{}
// Require both read_logs and analyze to be called
evalCallback := evals.RequiredToolCalls[string]([]string{"read_logs", "analyze"})
// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))
// Create trace and add required tools (plus extra)
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Process data")
// Add required tools
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)
tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)
// Add extra tool (should be fine)
tc3 := trace.StartToolCall("tc3", "summarize", nil)
tc3.Complete("summary", nil)
// Complete trace (triggers evaluation)
trace.Complete("Processing complete", nil)
if len(obs.failures) == 0 {
fmt.Println("All required tools were called")
}
Output: All required tools were called
func ResultValidator ¶
func ResultValidator[T any](validator func(result T) error) ObservableTraceCallback[T]
ResultValidator returns an ObservableTraceCallback that validates the result using a custom validator. The validator is only called if the result is non-nil. T should typically be a pointer type like *MyStruct.
func ToolCallNamed ¶
func ToolCallNamed[T any](name string, validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]
ToolCallNamed returns an ObservableTraceCallback that validates tool calls with a specific name using a custom validator.
func ToolCallValidator ¶
func ToolCallValidator[T any](validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]
ToolCallValidator creates an ObservableTraceCallback that validates individual tool calls using a custom validator function.
Example ¶
ExampleToolCallValidator shows custom validation of tool calls
// Create a mock observer
obs := &mockObserver{}
// Validate that all tool calls have a reasoning parameter
validator := func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
if _, ok := tc.Params["reasoning"]; !ok {
return errors.New("missing reasoning parameter")
}
return nil
}
evalCallback := evals.ToolCallValidator[string](validator)
// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))
// Create trace with proper reasoning parameters
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")
// Add tool calls with reasoning parameters
tc1 := trace.StartToolCall("tc1", "read_logs", map[string]any{
"reasoning": "need to analyze logs",
})
tc1.Complete("log data", nil)
tc2 := trace.StartToolCall("tc2", "analyze", map[string]any{
"reasoning": "extract error patterns",
})
tc2.Complete("analysis done", nil)
// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)
if len(obs.failures) == 0 {
fmt.Println("All tool calls have reasoning")
}
Output: All tool calls have reasoning
type Observer ¶
type Observer interface {
// Fail marks the evaluation as failed with the given message
// Should be called at most once per Trace evaluation
Fail(string)
// Log logs a message
// Can be called multiple times per Trace evaluation
Log(string)
// Grade assigns a rating (0.0-1.0) with reasoning to the trace result
// Should be called at most once per Trace evaluation
Grade(score float64, reasoning string)
// Increment is called each time a trace is evaluated
Increment()
// Total returns the number of observed instances
Total() int64
}
Observer defines an interface for observing and controlling evaluation execution
type ResultCollector ¶
type ResultCollector struct {
// contains filtered or unexported fields
}
ResultCollector wraps an Observer to collect failure messages and grades
Example ¶
package main
import (
"context"
"fmt"
"sync"
"chainguard.dev/driftlessaf/agents/agenttrace"
"chainguard.dev/driftlessaf/agents/evals"
)
// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
failures []string
logs []string
count int64
mu sync.Mutex
}
func (m *exampleObserver) Fail(msg string) {
m.mu.Lock()
defer m.mu.Unlock()
m.failures = append(m.failures, msg)
}
func (m *exampleObserver) Log(msg string) {
m.mu.Lock()
defer m.mu.Unlock()
m.logs = append(m.logs, msg)
}
func (m *exampleObserver) Grade(score float64, reasoning string) {
m.mu.Lock()
defer m.mu.Unlock()
m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}
func (m *exampleObserver) Increment() {
m.mu.Lock()
defer m.mu.Unlock()
m.count++
}
func (m *exampleObserver) Total() int64 {
m.mu.Lock()
defer m.mu.Unlock()
return m.count
}
func main() {
// Create a mock observer to demonstrate the pattern
baseObs := &exampleObserver{}
// Wrap it with a result collector
collector := evals.NewResultCollector(baseObs)
// Define an evaluation callback that validates tool calls
callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
o.Log("Analyzing trace")
if len(trace.ToolCalls) != 1 {
o.Fail("Expected exactly 1 tool call")
}
if trace.Error != nil {
o.Fail("Unexpected error: " + trace.Error.Error())
}
// Give the trace a grade
o.Grade(0.85, "Good tool usage")
}
// Create tracer with the collector
tracer := agenttrace.ByCode[string](evals.Inject(collector, callback))
// Create a trace that will trigger the evaluation
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Process data")
// Add a tool call
tc := trace.StartToolCall("tc1", "data-processor", map[string]any{
"input": "some data",
})
tc.Complete("processed", nil)
// Complete the trace (this triggers the evaluation)
trace.Complete("Processing complete", nil)
// Check collected results
failures := collector.Failures()
grades := collector.Grades()
fmt.Printf("Failures: %d\n", len(failures))
fmt.Printf("Grades: %d (score: %.2f)\n", len(grades), grades[0].Score)
}
Output: Failures: 0 Grades: 1 (score: 0.85)
Example (WithNamespacedObserver) ¶
package main
import (
"context"
"errors"
"fmt"
"sync"
"chainguard.dev/driftlessaf/agents/agenttrace"
"chainguard.dev/driftlessaf/agents/evals"
)
// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
failures []string
logs []string
count int64
mu sync.Mutex
}
func (m *exampleObserver) Fail(msg string) {
m.mu.Lock()
defer m.mu.Unlock()
m.failures = append(m.failures, msg)
}
func (m *exampleObserver) Log(msg string) {
m.mu.Lock()
defer m.mu.Unlock()
m.logs = append(m.logs, msg)
}
func (m *exampleObserver) Grade(score float64, reasoning string) {
m.mu.Lock()
defer m.mu.Unlock()
m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}
func (m *exampleObserver) Increment() {
m.mu.Lock()
defer m.mu.Unlock()
m.count++
}
func (m *exampleObserver) Total() int64 {
m.mu.Lock()
defer m.mu.Unlock()
return m.count
}
func main() {
// Create a namespaced observer using mock observers
namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
return &exampleObserver{}
})
// Create result collectors for different namespaces
toolCollector := evals.NewResultCollector(namespacedObs.Child("tools"))
errorCollector := evals.NewResultCollector(namespacedObs.Child("errors"))
// Define evaluations for tool calls
toolEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
for _, tc := range trace.ToolCalls {
if tc.Error != nil {
o.Fail(fmt.Sprintf("Tool %s failed: %v", tc.Name, tc.Error))
}
}
}
// Define evaluations for trace errors
errorEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
if trace.Error != nil {
o.Fail("Trace error: " + trace.Error.Error())
}
}
// Create tracer with multiple collectors
tracer := agenttrace.ByCode[string](
evals.Inject(toolCollector, toolEval),
evals.Inject(errorCollector, errorEval),
)
// Create a trace with a failing tool call
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Complex analysis")
tc := trace.StartToolCall("tc1", "analyzer", nil)
tc.Complete(nil, errors.New("analysis failed"))
// Complete the trace (this triggers both evaluations)
trace.Complete("Analysis complete", nil)
// Check failures by category
toolFailures := toolCollector.Failures()
errorFailures := errorCollector.Failures()
fmt.Printf("Tool failures: %d\n", len(toolFailures))
fmt.Printf("Error failures: %d\n", len(errorFailures))
}
Output: Tool failures: 1 Error failures: 0
func NewResultCollector ¶
func NewResultCollector(inner Observer) *ResultCollector
NewResultCollector creates a new ResultCollector that wraps the given Observer
func (*ResultCollector) Fail ¶
func (r *ResultCollector) Fail(msg string)
Fail logs the failure message and stores it in the failures list
func (*ResultCollector) Failures ¶
func (r *ResultCollector) Failures() []string
Failures returns a copy of all collected failure messages
func (*ResultCollector) Grade ¶
func (r *ResultCollector) Grade(score float64, reasoning string)
Grade passes through to the inner observer and stores the grade
func (*ResultCollector) Grades ¶
func (r *ResultCollector) Grades() []Grade
Grades returns a copy of all collected grades
func (*ResultCollector) Increment ¶
func (r *ResultCollector) Increment()
Increment passes through to the inner observer
func (*ResultCollector) Log ¶
func (r *ResultCollector) Log(msg string)
Log passes through to the inner observer
func (*ResultCollector) Total ¶
func (r *ResultCollector) Total() int64
Total passes through to the inner observer
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package report provides report generation functionality for evaluation results.
|
Package report provides report generation functionality for evaluation results. |
|
Package testevals provides a testing.T adapter for the evals framework.
|
Package testevals provides a testing.T adapter for the evals framework. |