evals

package

v0.0.0-...-89b5c40 Latest Latest Go to latest Published: Feb 26, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/driftlessaf/go-driftlessaf

Links

Open Source Insights

Documentation ¶

Overview ¶

Package evals provides a comprehensive tracing framework for evaluating and monitoring agent interactions.

Overview ¶

The evals package enables detailed tracking of agent execution flows, including prompts, tool calls, results, and timing information. It provides a structured approach to capture evaluation data for analysis, debugging, and performance monitoring of AI agents.

Core Components ¶

The package is built around several key types:

Tracer[T]: Generic interface for creating and managing traces with result type T
Trace[T]: Complete agent interaction from prompt to result of type T
ToolCall[T]: Individual tool invocation within a trace of type T
Observer: Interface for evaluation observing and grading
ObservableTraceCallback: Function type for trace evaluation callbacks
NamespacedObserver: Hierarchical namespace management for evaluations
ResultCollector: Observer wrapper that collects failure messages and grades
Grade: Structured grade with score and reasoning

Generic Type Parameters ¶

All core types are generic with type parameter T that serves two purposes:

1. **Type Safety**: The Result field in Trace[T] is strongly typed as T instead of interface{} 2. **Context Disambiguation**: Multiple tracers with different result types can coexist in the same context

**Important**: Only Trace.Result is generic (type T). ToolCall.Result remains interface{} for maximum flexibility, as individual tool calls may return varied data types.

## Type Parameter Usage Patterns

### Simple Text Results For basic string results from agent interactions:

tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Generate summary")
trace.Complete("Summary: The analysis shows...", nil)

### Structured Results For complex, type-safe results using custom structs:

type AnalysisResult struct {
	TotalFiles   int     `json:"total_files"`
	IssuesFound  int     `json:"issues_found"`
	Confidence   float64 `json:"confidence"`
}

tracer := agenttrace.ByCode[AnalysisResult]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze codebase")
trace.Complete(AnalysisResult{
	TotalFiles:  42,
	IssuesFound: 3,
	Confidence:  0.95,
}, nil)

### Multiple Tracers with Different Types The same context can hold tracers for different result types:

ctx := context.Background()

// String tracer for text summaries
stringTracer := agenttrace.ByCode[string](stringCallback)
ctx = agenttrace.WithTracer[string](ctx, stringTracer)

// Structured tracer for metrics
metricsTracer := agenttrace.ByCode[MetricsData](metricsCallback)
ctx = agenttrace.WithTracer[MetricsData](ctx, metricsTracer)

// Both coexist without conflict
summaryTrace := agenttrace.StartTrace[string](ctx, "Generate summary")
metricsTrace := agenttrace.StartTrace[MetricsData](ctx, "Collect metrics")

**Note**: While Trace[interface{}] provides maximum flexibility when result types vary at runtime, prefer specific types when possible for better type safety and API clarity.

Features ¶

Thread-safe trace and tool call recording
Automatic trace completion and recording
Flexible callback system for custom trace processing
Context-based tracer management
Structured trace output with timing information
Support for both successful and failed tool calls
Concurrent execution support with proper synchronization
Built-in validation helpers for common evaluation patterns
Observer interface for test integration and result collection
Hierarchical namespacing for organized evaluation reporting
Integration with Go's testing framework

Usage Patterns ¶

## Basic Trace Creation

All traces must be created using a tracer. The simplest approach uses ByCode with no callbacks:

tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze the security report")
toolCall := trace.StartToolCall("tc1", "file-reader", map[string]interface{}{
	"path": "/var/logs/security.log",
})
toolCall.Complete("File content here", nil)
trace.Complete("Analysis complete", nil)

## Context-Based Tracing

For more sophisticated scenarios, use context-managed tracers:

ctx := context.Background()
tracer := agenttrace.ByCode[string](func(trace *agenttrace.Trace[string]) {
	log.Printf("Trace completed: %s", trace.ID)
})
ctx = agenttrace.WithTracer[string](ctx, tracer)

trace := agenttrace.StartTrace[string](ctx, "Process user request")
// ... perform operations
trace.Complete("Request processed", nil)

## Custom Evaluation Callbacks

Create custom tracers with callback functions for specialized evaluation:

tracer := agenttrace.ByCode[string](
	func(trace *agenttrace.Trace[string]) {
		// Save to database
		saveTraceToDatabase(trace)
	},
	func(trace *agenttrace.Trace[string]) {
		// Send metrics
		recordMetrics(trace.Duration(), len(trace.ToolCalls))
	},
)

## Evaluation Helpers

The package provides built-in validation helpers for common evaluation patterns. All helper functions require explicit type parameters matching your trace result type:

// Validate exact number of tool calls
callback := evals.Inject[string](observer, evals.ExactToolCalls[string](2))

// Validate no errors occurred
callback = evals.Inject[string](observer, evals.NoErrors[string]())

// Validate required tool usage
callback = evals.Inject[string](observer, evals.RequiredToolCalls[string]([]string{"search", "analyze"}))

// Custom tool call validation
callback = evals.Inject[string](observer, evals.ToolCallValidator[string](func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
	if tc.Name == "search" && tc.Result == nil {
		return fmt.Errorf("search tool must return results")
	}
	return nil
}))

## Result Collection

Use ResultCollector to collect failure messages and grades from evaluations:

// Create a base observer (could be namespaced)
baseObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
	return customObserver(name)
})

// Wrap with result collector to capture evaluation outcomes
collector := evals.NewResultCollector(baseObs)

// Use in evaluation callbacks
callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
	if len(trace.ToolCalls) == 0 {
		o.Fail("No tool calls found")
	}
	o.Grade(0.85, "Good performance")
}

// Run evaluation
tracer := agenttrace.ByCode[string](evals.Inject[string](collector, callback))
// ... create and complete traces

// Collect results
failures := collector.Failures()  // []string of failure messages
grades := collector.Grades()      // []Grade with scores and reasoning

## Observer and Namespaced Evaluation

Use the NamespacedObserver for hierarchical evaluation organization:

// Create a custom observer implementation
namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
	return customObserver(name)  // your custom implementation
})

// Use with evaluation helpers in organized namespaces
tracer := agenttrace.ByCode[string](
	evals.Inject[string](namespacedObs.Child("accuracy"), evals.ExactToolCalls[string](1)),
	evals.Inject[string](namespacedObs.Child("reliability"), evals.NoErrors[string]()),
)

Integration Patterns ¶

## Default Logging Integration

The package integrates with chainguard-dev/clog for structured logging:

ctx := context.Background()
tracer := evals.NewDefaultTracer[string](ctx) // Uses clog from context
trace := tracer.NewTrace(ctx, "Execute workflow")

## Error Handling

The package handles both tool-level and trace-level errors:

// Tool call that fails
toolCall := trace.StartToolCall("tc1", "api-call", params)
toolCall.Complete(nil, errors.New("API timeout"))

// Bad tool call (invalid parameters)
trace.BadToolCall("tc2", "unknown-tool", badParams, errors.New("unknown tool"))

// Trace that fails
trace.Complete(nil, errors.New("workflow failed"))

Thread Safety ¶

All operations are thread-safe. Multiple goroutines can safely:

Create and complete tool calls concurrently
Access trace duration and other methods
Record traces through tracer callbacks

The package uses fine-grained locking to ensure data consistency while maintaining performance.

Performance Considerations ¶

Trace IDs are generated with timestamp and randomness for uniqueness
Tool call and trace durations are calculated efficiently
String representations limit output size to prevent memory issues
Callbacks are executed in parallel using errgroup for better performance

Index ¶

func BuildCallbacks[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) []agenttrace.TraceCallback[T]
func BuildTracer[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) agenttrace.Tracer[T]
func Inject[T any](obs Observer, callback ObservableTraceCallback[T]) agenttrace.TraceCallback[T]
func NewDefaultTracer[T any](ctx context.Context) agenttrace.Tracer[T]
type Grade
type MetricsObserver
- func NewMetricsObserver[T any](namespace string) *MetricsObserver
type NamespacedObserver
- func NewNamespacedObserver[T Observer](factory func(string) T) *NamespacedObserver[T]
type ObservableTraceCallback
type Observer
type ResultCollector
- func NewResultCollector(inner Observer) *ResultCollector

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func BuildCallbacks ¶

func BuildCallbacks[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) []agenttrace.TraceCallback[T]

BuildCallbacks creates a list of TraceCallbacks from a namespaced observer and evaluation map. This helper injects each evaluation function with a child observer to create TraceCallbacks that can be used with ByCode or other tracers.

func BuildTracer ¶

func BuildTracer[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) agenttrace.Tracer[T]

BuildTracer creates a ByCode tracer from a namespaced observer and evaluation map. This helper consolidates the common pattern of setting up comprehensive evaluation tracers by injecting each evaluation function with a child observer and building a ByCode tracer from the resulting callbacks.

func Inject ¶

func Inject[T any](obs Observer, callback ObservableTraceCallback[T]) agenttrace.TraceCallback[T]

Inject creates a TraceCallback by injecting an Observer implementation into an ObservableTraceCallback

func NewDefaultTracer ¶

func NewDefaultTracer[T any](ctx context.Context) agenttrace.Tracer[T]

NewDefaultTracer creates a new default tracer that logs to clog

Example ¶

ExampleNewDefaultTracer demonstrates the default logging tracer.

package main

import (
	"context"
	"fmt"

	"chainguard.dev/driftlessaf/agents/evals"
)

func main() {
	ctx := context.Background()

	// Create default tracer (uses clog for logging)
	tracer := evals.NewDefaultTracer[string](ctx)

	// Create and complete a trace
	trace := tracer.NewTrace(ctx, "System health check")

	healthCall := trace.StartToolCall("health1", "check-services", nil)
	healthCall.Complete("all services healthy", nil)

	// Completing this trace will log structured information
	trace.Complete("Health check passed", nil)

	fmt.Println("Health check trace completed")
}

Output:

Health check trace completed

Types ¶

type Grade ¶

type Grade struct {
	Score     float64
	Reasoning string
}

Grade represents a grade with score and reasoning

type MetricsObserver ¶

type MetricsObserver struct {
	// contains filtered or unexported fields
}

MetricsObserver implements Observer interface with Prometheus metrics

func NewMetricsObserver ¶

func NewMetricsObserver[T any](namespace string) *MetricsObserver

NewMetricsObserver creates a metrics observer for the given tracer type and namespace

func (*MetricsObserver) Fail ¶

func (m *MetricsObserver) Fail(msg string)

Fail implements Observer.Fail

func (*MetricsObserver) Grade ¶

func (m *MetricsObserver) Grade(score float64, reasoning string)

Grade implements Observer.Grade

func (*MetricsObserver) Increment ¶

func (m *MetricsObserver) Increment()

Increment implements Observer.Increment

func (*MetricsObserver) Log ¶

func (m *MetricsObserver) Log(msg string)

Log implements Observer.Log (no-op for metrics observer)

func (*MetricsObserver) Total ¶

func (m *MetricsObserver) Total() int64

Total implements Observer.Total

type NamespacedObserver ¶

type NamespacedObserver[T Observer] struct {
	// contains filtered or unexported fields
}

NamespacedObserver provides hierarchical namespacing for Observer instances

func NewNamespacedObserver ¶

func NewNamespacedObserver[T Observer](factory func(string) T) *NamespacedObserver[T]

NewNamespacedObserver creates a new root NamespacedObserver with the given factory function

func (*NamespacedObserver[T]) Child ¶

func (n *NamespacedObserver[T]) Child(name string) *NamespacedObserver[T]

Child returns the child namespace with the given name, creating it if necessary

func (*NamespacedObserver[T]) Fail ¶

func (n *NamespacedObserver[T]) Fail(msg string)

Fail delegates to the inner Observer instance

func (*NamespacedObserver[T]) Grade ¶

func (n *NamespacedObserver[T]) Grade(score float64, reasoning string)

Grade delegates to the inner Observer instance

func (*NamespacedObserver[T]) Increment ¶

func (n *NamespacedObserver[T]) Increment()

Increment delegates to the inner Observer instance

func (*NamespacedObserver[T]) Log ¶

func (n *NamespacedObserver[T]) Log(msg string)

Log delegates to the inner Observer instance

func (*NamespacedObserver[T]) Total ¶

func (n *NamespacedObserver[T]) Total() int64

Total delegates to the inner Observer instance

func (*NamespacedObserver[T]) Walk ¶

func (n *NamespacedObserver[T]) Walk(visitor func(string, T))

Walk traverses the observer tree in depth-first order, calling the visitor function on the current node first, then on all children in sorted order by name

type ObservableTraceCallback ¶

type ObservableTraceCallback[T any] func(Observer, *agenttrace.Trace[T])

ObservableTraceCallback is a function that receives an Observer interface and completed traces

func ExactToolCalls ¶

func ExactToolCalls[T any](n int) ObservableTraceCallback[T]

ExactToolCalls returns an ObservableTraceCallback that validates the trace has exactly n tool calls.

Example ¶

ExampleExactToolCalls shows how to validate exact tool call counts

// Create a mock observer
obs := &mockObserver{}

// Use ExactToolCalls to validate exactly 2 tool calls
evalCallback := evals.ExactToolCalls[string](2)

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace with exactly 2 tool calls
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")

// Add exactly 2 tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)

// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("Validation passed: exactly 2 tool calls")
}

Output:

Validation passed: exactly 2 tool calls

func MaximumNToolCalls ¶

func MaximumNToolCalls[T any](n int) ObservableTraceCallback[T]

MaximumNToolCalls returns an ObservableTraceCallback that validates the trace has at most n tool calls.

func MinimumNToolCalls ¶

func MinimumNToolCalls[T any](n int) ObservableTraceCallback[T]

MinimumNToolCalls returns an ObservableTraceCallback that validates the trace has at least n tool calls.

func NoErrors ¶

func NoErrors[T any]() ObservableTraceCallback[T]

NoErrors returns an ObservableTraceCallback that validates no tool calls resulted in errors.

Example ¶

ExampleNoErrors shows how to validate no errors occurred

// Create a mock observer
obs := &mockObserver{}

evalCallback := evals.NoErrors[string]()

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create successful trace with no errors
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Read and analyze")

// Add successful tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log content", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis complete", nil)

// Complete trace successfully (triggers evaluation)
trace.Complete("Processing complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("No errors found")
}

Output:

No errors found

func NoToolCalls ¶

func NoToolCalls[T any]() ObservableTraceCallback[T]

NoToolCalls returns an ObservableTraceCallback that validates the trace has no tool calls.

func OnlyToolCalls ¶

func OnlyToolCalls[T any](toolNames ...string) ObservableTraceCallback[T]

OnlyToolCalls returns an ObservableTraceCallback that validates the trace only uses the specified tool names.

func RangeToolCalls ¶

func RangeToolCalls[T any](min, max int) ObservableTraceCallback[T]

RangeToolCalls returns an ObservableTraceCallback that validates the trace has between min and max tool calls (inclusive).

func RequiredToolCalls ¶

func RequiredToolCalls[T any](toolNames []string) ObservableTraceCallback[T]

RequiredToolCalls returns an ObservableTraceCallback that validates the trace uses all of the specified tool names at least once.

Example ¶

ExampleRequiredToolCalls shows how to ensure specific tools are called

// Create a mock observer
obs := &mockObserver{}

// Require both read_logs and analyze to be called
evalCallback := evals.RequiredToolCalls[string]([]string{"read_logs", "analyze"})

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace and add required tools (plus extra)
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Process data")

// Add required tools
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)

// Add extra tool (should be fine)
tc3 := trace.StartToolCall("tc3", "summarize", nil)
tc3.Complete("summary", nil)

// Complete trace (triggers evaluation)
trace.Complete("Processing complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("All required tools were called")
}

Output:

All required tools were called

func ResultValidator ¶

func ResultValidator[T any](validator func(result T) error) ObservableTraceCallback[T]

ResultValidator returns an ObservableTraceCallback that validates the result using a custom validator. The validator is only called if the result is non-nil. T should typically be a pointer type like *MyStruct.

func ToolCallNamed ¶

func ToolCallNamed[T any](name string, validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]

ToolCallNamed returns an ObservableTraceCallback that validates tool calls with a specific name using a custom validator.

func ToolCallValidator ¶

func ToolCallValidator[T any](validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]

ToolCallValidator creates an ObservableTraceCallback that validates individual tool calls using a custom validator function.

Example ¶

ExampleToolCallValidator shows custom validation of tool calls

// Create a mock observer
obs := &mockObserver{}

// Validate that all tool calls have a reasoning parameter
validator := func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
	if _, ok := tc.Params["reasoning"]; !ok {
		return errors.New("missing reasoning parameter")
	}
	return nil
}

evalCallback := evals.ToolCallValidator[string](validator)

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace with proper reasoning parameters
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")

// Add tool calls with reasoning parameters
tc1 := trace.StartToolCall("tc1", "read_logs", map[string]any{
	"reasoning": "need to analyze logs",
})
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", map[string]any{
	"reasoning": "extract error patterns",
})
tc2.Complete("analysis done", nil)

// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("All tool calls have reasoning")
}

Output:

All tool calls have reasoning

type Observer ¶

type Observer interface {
	// Fail marks the evaluation as failed with the given message
	// Should be called at most once per Trace evaluation
	Fail(string)
	// Log logs a message
	// Can be called multiple times per Trace evaluation
	Log(string)
	// Grade assigns a rating (0.0-1.0) with reasoning to the trace result
	// Should be called at most once per Trace evaluation
	Grade(score float64, reasoning string)
	// Increment is called each time a trace is evaluated
	Increment()
	// Total returns the number of observed instances
	Total() int64
}

Observer defines an interface for observing and controlling evaluation execution

type ResultCollector ¶

type ResultCollector struct {
	// contains filtered or unexported fields
}

ResultCollector wraps an Observer to collect failure messages and grades

Example ¶

package main

import (
	"context"
	"fmt"
	"sync"

	"chainguard.dev/driftlessaf/agents/agenttrace"
	"chainguard.dev/driftlessaf/agents/evals"
)

// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
	failures []string
	logs     []string
	count    int64
	mu       sync.Mutex
}

func (m *exampleObserver) Fail(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.failures = append(m.failures, msg)
}

func (m *exampleObserver) Log(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, msg)
}

func (m *exampleObserver) Grade(score float64, reasoning string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}

func (m *exampleObserver) Increment() {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.count++
}

func (m *exampleObserver) Total() int64 {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.count
}

func main() {
	// Create a mock observer to demonstrate the pattern
	baseObs := &exampleObserver{}

	// Wrap it with a result collector
	collector := evals.NewResultCollector(baseObs)

	// Define an evaluation callback that validates tool calls
	callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		o.Log("Analyzing trace")

		if len(trace.ToolCalls) != 1 {
			o.Fail("Expected exactly 1 tool call")
		}

		if trace.Error != nil {
			o.Fail("Unexpected error: " + trace.Error.Error())
		}

		// Give the trace a grade
		o.Grade(0.85, "Good tool usage")
	}

	// Create tracer with the collector
	tracer := agenttrace.ByCode[string](evals.Inject(collector, callback))

	// Create a trace that will trigger the evaluation
	ctx := context.Background()
	trace := tracer.NewTrace(ctx, "Process data")

	// Add a tool call
	tc := trace.StartToolCall("tc1", "data-processor", map[string]any{
		"input": "some data",
	})
	tc.Complete("processed", nil)

	// Complete the trace (this triggers the evaluation)
	trace.Complete("Processing complete", nil)

	// Check collected results
	failures := collector.Failures()
	grades := collector.Grades()

	fmt.Printf("Failures: %d\n", len(failures))
	fmt.Printf("Grades: %d (score: %.2f)\n", len(grades), grades[0].Score)
}

Output:

Failures: 0
Grades: 1 (score: 0.85)

Example (WithNamespacedObserver) ¶

package main

import (
	"context"
	"errors"
	"fmt"
	"sync"

	"chainguard.dev/driftlessaf/agents/agenttrace"
	"chainguard.dev/driftlessaf/agents/evals"
)

// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
	failures []string
	logs     []string
	count    int64
	mu       sync.Mutex
}

func (m *exampleObserver) Fail(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.failures = append(m.failures, msg)
}

func (m *exampleObserver) Log(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, msg)
}

func (m *exampleObserver) Grade(score float64, reasoning string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}

func (m *exampleObserver) Increment() {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.count++
}

func (m *exampleObserver) Total() int64 {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.count
}

func main() {
	// Create a namespaced observer using mock observers
	namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
		return &exampleObserver{}
	})

	// Create result collectors for different namespaces
	toolCollector := evals.NewResultCollector(namespacedObs.Child("tools"))
	errorCollector := evals.NewResultCollector(namespacedObs.Child("errors"))

	// Define evaluations for tool calls
	toolEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		for _, tc := range trace.ToolCalls {
			if tc.Error != nil {
				o.Fail(fmt.Sprintf("Tool %s failed: %v", tc.Name, tc.Error))
			}
		}
	}

	// Define evaluations for trace errors
	errorEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		if trace.Error != nil {
			o.Fail("Trace error: " + trace.Error.Error())
		}
	}

	// Create tracer with multiple collectors
	tracer := agenttrace.ByCode[string](
		evals.Inject(toolCollector, toolEval),
		evals.Inject(errorCollector, errorEval),
	)

	// Create a trace with a failing tool call
	ctx := context.Background()
	trace := tracer.NewTrace(ctx, "Complex analysis")

	tc := trace.StartToolCall("tc1", "analyzer", nil)
	tc.Complete(nil, errors.New("analysis failed"))

	// Complete the trace (this triggers both evaluations)
	trace.Complete("Analysis complete", nil)

	// Check failures by category
	toolFailures := toolCollector.Failures()
	errorFailures := errorCollector.Failures()

	fmt.Printf("Tool failures: %d\n", len(toolFailures))
	fmt.Printf("Error failures: %d\n", len(errorFailures))
}

Output:

Tool failures: 1
Error failures: 0

func NewResultCollector ¶

func NewResultCollector(inner Observer) *ResultCollector

NewResultCollector creates a new ResultCollector that wraps the given Observer

func (*ResultCollector) Fail ¶

func (r *ResultCollector) Fail(msg string)

Fail logs the failure message and stores it in the failures list

func (*ResultCollector) Failures ¶

func (r *ResultCollector) Failures() []string

Failures returns a copy of all collected failure messages

func (*ResultCollector) Grade ¶

func (r *ResultCollector) Grade(score float64, reasoning string)

Grade passes through to the inner observer and stores the grade

func (*ResultCollector) Grades ¶

func (r *ResultCollector) Grades() []Grade

Grades returns a copy of all collected grades

func (*ResultCollector) Increment ¶

func (r *ResultCollector) Increment()

Increment passes through to the inner observer

func (*ResultCollector) Log ¶

func (r *ResultCollector) Log(msg string)

Log passes through to the inner observer

func (*ResultCollector) Total ¶

func (r *ResultCollector) Total() int64

Total passes through to the inner observer

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
report Package report provides report generation functionality for evaluation results.	Package report provides report generation functionality for evaluation results.
testevals Package testevals provides a testing.T adapter for the evals framework.	Package testevals provides a testing.T adapter for the evals framework.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL