evals

package
v0.0.0-...-89b5c40 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 26, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Overview

Package evals provides a comprehensive tracing framework for evaluating and monitoring agent interactions.

Overview

The evals package enables detailed tracking of agent execution flows, including prompts, tool calls, results, and timing information. It provides a structured approach to capture evaluation data for analysis, debugging, and performance monitoring of AI agents.

Core Components

The package is built around several key types:

  • Tracer[T]: Generic interface for creating and managing traces with result type T
  • Trace[T]: Complete agent interaction from prompt to result of type T
  • ToolCall[T]: Individual tool invocation within a trace of type T
  • Observer: Interface for evaluation observing and grading
  • ObservableTraceCallback: Function type for trace evaluation callbacks
  • NamespacedObserver: Hierarchical namespace management for evaluations
  • ResultCollector: Observer wrapper that collects failure messages and grades
  • Grade: Structured grade with score and reasoning

Generic Type Parameters

All core types are generic with type parameter T that serves two purposes:

1. **Type Safety**: The Result field in Trace[T] is strongly typed as T instead of interface{} 2. **Context Disambiguation**: Multiple tracers with different result types can coexist in the same context

**Important**: Only Trace.Result is generic (type T). ToolCall.Result remains interface{} for maximum flexibility, as individual tool calls may return varied data types.

## Type Parameter Usage Patterns

### Simple Text Results For basic string results from agent interactions:

tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Generate summary")
trace.Complete("Summary: The analysis shows...", nil)

### Structured Results For complex, type-safe results using custom structs:

type AnalysisResult struct {
	TotalFiles   int     `json:"total_files"`
	IssuesFound  int     `json:"issues_found"`
	Confidence   float64 `json:"confidence"`
}

tracer := agenttrace.ByCode[AnalysisResult]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze codebase")
trace.Complete(AnalysisResult{
	TotalFiles:  42,
	IssuesFound: 3,
	Confidence:  0.95,
}, nil)

### Multiple Tracers with Different Types The same context can hold tracers for different result types:

ctx := context.Background()

// String tracer for text summaries
stringTracer := agenttrace.ByCode[string](stringCallback)
ctx = agenttrace.WithTracer[string](ctx, stringTracer)

// Structured tracer for metrics
metricsTracer := agenttrace.ByCode[MetricsData](metricsCallback)
ctx = agenttrace.WithTracer[MetricsData](ctx, metricsTracer)

// Both coexist without conflict
summaryTrace := agenttrace.StartTrace[string](ctx, "Generate summary")
metricsTrace := agenttrace.StartTrace[MetricsData](ctx, "Collect metrics")

**Note**: While Trace[interface{}] provides maximum flexibility when result types vary at runtime, prefer specific types when possible for better type safety and API clarity.

Features

  • Thread-safe trace and tool call recording
  • Automatic trace completion and recording
  • Flexible callback system for custom trace processing
  • Context-based tracer management
  • Structured trace output with timing information
  • Support for both successful and failed tool calls
  • Concurrent execution support with proper synchronization
  • Built-in validation helpers for common evaluation patterns
  • Observer interface for test integration and result collection
  • Hierarchical namespacing for organized evaluation reporting
  • Integration with Go's testing framework

Usage Patterns

## Basic Trace Creation

All traces must be created using a tracer. The simplest approach uses ByCode with no callbacks:

tracer := agenttrace.ByCode[string]() // No callbacks
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze the security report")
toolCall := trace.StartToolCall("tc1", "file-reader", map[string]interface{}{
	"path": "/var/logs/security.log",
})
toolCall.Complete("File content here", nil)
trace.Complete("Analysis complete", nil)

## Context-Based Tracing

For more sophisticated scenarios, use context-managed tracers:

ctx := context.Background()
tracer := agenttrace.ByCode[string](func(trace *agenttrace.Trace[string]) {
	log.Printf("Trace completed: %s", trace.ID)
})
ctx = agenttrace.WithTracer[string](ctx, tracer)

trace := agenttrace.StartTrace[string](ctx, "Process user request")
// ... perform operations
trace.Complete("Request processed", nil)

## Custom Evaluation Callbacks

Create custom tracers with callback functions for specialized evaluation:

tracer := agenttrace.ByCode[string](
	func(trace *agenttrace.Trace[string]) {
		// Save to database
		saveTraceToDatabase(trace)
	},
	func(trace *agenttrace.Trace[string]) {
		// Send metrics
		recordMetrics(trace.Duration(), len(trace.ToolCalls))
	},
)

## Evaluation Helpers

The package provides built-in validation helpers for common evaluation patterns. All helper functions require explicit type parameters matching your trace result type:

// Validate exact number of tool calls
callback := evals.Inject[string](observer, evals.ExactToolCalls[string](2))

// Validate no errors occurred
callback = evals.Inject[string](observer, evals.NoErrors[string]())

// Validate required tool usage
callback = evals.Inject[string](observer, evals.RequiredToolCalls[string]([]string{"search", "analyze"}))

// Custom tool call validation
callback = evals.Inject[string](observer, evals.ToolCallValidator[string](func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
	if tc.Name == "search" && tc.Result == nil {
		return fmt.Errorf("search tool must return results")
	}
	return nil
}))

## Result Collection

Use ResultCollector to collect failure messages and grades from evaluations:

// Create a base observer (could be namespaced)
baseObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
	return customObserver(name)
})

// Wrap with result collector to capture evaluation outcomes
collector := evals.NewResultCollector(baseObs)

// Use in evaluation callbacks
callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
	if len(trace.ToolCalls) == 0 {
		o.Fail("No tool calls found")
	}
	o.Grade(0.85, "Good performance")
}

// Run evaluation
tracer := agenttrace.ByCode[string](evals.Inject[string](collector, callback))
// ... create and complete traces

// Collect results
failures := collector.Failures()  // []string of failure messages
grades := collector.Grades()      // []Grade with scores and reasoning

## Observer and Namespaced Evaluation

Use the NamespacedObserver for hierarchical evaluation organization:

// Create a custom observer implementation
namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
	return customObserver(name)  // your custom implementation
})

// Use with evaluation helpers in organized namespaces
tracer := agenttrace.ByCode[string](
	evals.Inject[string](namespacedObs.Child("accuracy"), evals.ExactToolCalls[string](1)),
	evals.Inject[string](namespacedObs.Child("reliability"), evals.NoErrors[string]()),
)

Integration Patterns

## Default Logging Integration

The package integrates with chainguard-dev/clog for structured logging:

ctx := context.Background()
tracer := evals.NewDefaultTracer[string](ctx) // Uses clog from context
trace := tracer.NewTrace(ctx, "Execute workflow")

## Error Handling

The package handles both tool-level and trace-level errors:

// Tool call that fails
toolCall := trace.StartToolCall("tc1", "api-call", params)
toolCall.Complete(nil, errors.New("API timeout"))

// Bad tool call (invalid parameters)
trace.BadToolCall("tc2", "unknown-tool", badParams, errors.New("unknown tool"))

// Trace that fails
trace.Complete(nil, errors.New("workflow failed"))

Thread Safety

All operations are thread-safe. Multiple goroutines can safely:

  • Create and complete tool calls concurrently
  • Access trace duration and other methods
  • Record traces through tracer callbacks

The package uses fine-grained locking to ensure data consistency while maintaining performance.

Performance Considerations

  • Trace IDs are generated with timestamp and randomness for uniqueness
  • Tool call and trace durations are calculated efficiently
  • String representations limit output size to prevent memory issues
  • Callbacks are executed in parallel using errgroup for better performance

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func BuildCallbacks

func BuildCallbacks[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) []agenttrace.TraceCallback[T]

BuildCallbacks creates a list of TraceCallbacks from a namespaced observer and evaluation map. This helper injects each evaluation function with a child observer to create TraceCallbacks that can be used with ByCode or other tracers.

func BuildTracer

func BuildTracer[T any, O Observer](observer *NamespacedObserver[O], evalMap map[string]ObservableTraceCallback[T]) agenttrace.Tracer[T]

BuildTracer creates a ByCode tracer from a namespaced observer and evaluation map. This helper consolidates the common pattern of setting up comprehensive evaluation tracers by injecting each evaluation function with a child observer and building a ByCode tracer from the resulting callbacks.

func Inject

func Inject[T any](obs Observer, callback ObservableTraceCallback[T]) agenttrace.TraceCallback[T]

Inject creates a TraceCallback by injecting an Observer implementation into an ObservableTraceCallback

func NewDefaultTracer

func NewDefaultTracer[T any](ctx context.Context) agenttrace.Tracer[T]

NewDefaultTracer creates a new default tracer that logs to clog

Example

ExampleNewDefaultTracer demonstrates the default logging tracer.

package main

import (
	"context"
	"fmt"

	"chainguard.dev/driftlessaf/agents/evals"
)

func main() {
	ctx := context.Background()

	// Create default tracer (uses clog for logging)
	tracer := evals.NewDefaultTracer[string](ctx)

	// Create and complete a trace
	trace := tracer.NewTrace(ctx, "System health check")

	healthCall := trace.StartToolCall("health1", "check-services", nil)
	healthCall.Complete("all services healthy", nil)

	// Completing this trace will log structured information
	trace.Complete("Health check passed", nil)

	fmt.Println("Health check trace completed")
}
Output:

Health check trace completed

Types

type Grade

type Grade struct {
	Score     float64
	Reasoning string
}

Grade represents a grade with score and reasoning

type MetricsObserver

type MetricsObserver struct {
	// contains filtered or unexported fields
}

MetricsObserver implements Observer interface with Prometheus metrics

func NewMetricsObserver

func NewMetricsObserver[T any](namespace string) *MetricsObserver

NewMetricsObserver creates a metrics observer for the given tracer type and namespace

func (*MetricsObserver) Fail

func (m *MetricsObserver) Fail(msg string)

Fail implements Observer.Fail

func (*MetricsObserver) Grade

func (m *MetricsObserver) Grade(score float64, reasoning string)

Grade implements Observer.Grade

func (*MetricsObserver) Increment

func (m *MetricsObserver) Increment()

Increment implements Observer.Increment

func (*MetricsObserver) Log

func (m *MetricsObserver) Log(msg string)

Log implements Observer.Log (no-op for metrics observer)

func (*MetricsObserver) Total

func (m *MetricsObserver) Total() int64

Total implements Observer.Total

type NamespacedObserver

type NamespacedObserver[T Observer] struct {
	// contains filtered or unexported fields
}

NamespacedObserver provides hierarchical namespacing for Observer instances

func NewNamespacedObserver

func NewNamespacedObserver[T Observer](factory func(string) T) *NamespacedObserver[T]

NewNamespacedObserver creates a new root NamespacedObserver with the given factory function

func (*NamespacedObserver[T]) Child

func (n *NamespacedObserver[T]) Child(name string) *NamespacedObserver[T]

Child returns the child namespace with the given name, creating it if necessary

func (*NamespacedObserver[T]) Fail

func (n *NamespacedObserver[T]) Fail(msg string)

Fail delegates to the inner Observer instance

func (*NamespacedObserver[T]) Grade

func (n *NamespacedObserver[T]) Grade(score float64, reasoning string)

Grade delegates to the inner Observer instance

func (*NamespacedObserver[T]) Increment

func (n *NamespacedObserver[T]) Increment()

Increment delegates to the inner Observer instance

func (*NamespacedObserver[T]) Log

func (n *NamespacedObserver[T]) Log(msg string)

Log delegates to the inner Observer instance

func (*NamespacedObserver[T]) Total

func (n *NamespacedObserver[T]) Total() int64

Total delegates to the inner Observer instance

func (*NamespacedObserver[T]) Walk

func (n *NamespacedObserver[T]) Walk(visitor func(string, T))

Walk traverses the observer tree in depth-first order, calling the visitor function on the current node first, then on all children in sorted order by name

type ObservableTraceCallback

type ObservableTraceCallback[T any] func(Observer, *agenttrace.Trace[T])

ObservableTraceCallback is a function that receives an Observer interface and completed traces

func ExactToolCalls

func ExactToolCalls[T any](n int) ObservableTraceCallback[T]

ExactToolCalls returns an ObservableTraceCallback that validates the trace has exactly n tool calls.

Example

ExampleExactToolCalls shows how to validate exact tool call counts

// Create a mock observer
obs := &mockObserver{}

// Use ExactToolCalls to validate exactly 2 tool calls
evalCallback := evals.ExactToolCalls[string](2)

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace with exactly 2 tool calls
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")

// Add exactly 2 tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)

// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("Validation passed: exactly 2 tool calls")
}
Output:

Validation passed: exactly 2 tool calls

func MaximumNToolCalls

func MaximumNToolCalls[T any](n int) ObservableTraceCallback[T]

MaximumNToolCalls returns an ObservableTraceCallback that validates the trace has at most n tool calls.

func MinimumNToolCalls

func MinimumNToolCalls[T any](n int) ObservableTraceCallback[T]

MinimumNToolCalls returns an ObservableTraceCallback that validates the trace has at least n tool calls.

func NoErrors

func NoErrors[T any]() ObservableTraceCallback[T]

NoErrors returns an ObservableTraceCallback that validates no tool calls resulted in errors.

Example

ExampleNoErrors shows how to validate no errors occurred

// Create a mock observer
obs := &mockObserver{}

evalCallback := evals.NoErrors[string]()

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create successful trace with no errors
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Read and analyze")

// Add successful tool calls
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log content", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis complete", nil)

// Complete trace successfully (triggers evaluation)
trace.Complete("Processing complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("No errors found")
}
Output:

No errors found

func NoToolCalls

func NoToolCalls[T any]() ObservableTraceCallback[T]

NoToolCalls returns an ObservableTraceCallback that validates the trace has no tool calls.

func OnlyToolCalls

func OnlyToolCalls[T any](toolNames ...string) ObservableTraceCallback[T]

OnlyToolCalls returns an ObservableTraceCallback that validates the trace only uses the specified tool names.

func RangeToolCalls

func RangeToolCalls[T any](min, max int) ObservableTraceCallback[T]

RangeToolCalls returns an ObservableTraceCallback that validates the trace has between min and max tool calls (inclusive).

func RequiredToolCalls

func RequiredToolCalls[T any](toolNames []string) ObservableTraceCallback[T]

RequiredToolCalls returns an ObservableTraceCallback that validates the trace uses all of the specified tool names at least once.

Example

ExampleRequiredToolCalls shows how to ensure specific tools are called

// Create a mock observer
obs := &mockObserver{}

// Require both read_logs and analyze to be called
evalCallback := evals.RequiredToolCalls[string]([]string{"read_logs", "analyze"})

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace and add required tools (plus extra)
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Process data")

// Add required tools
tc1 := trace.StartToolCall("tc1", "read_logs", nil)
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", nil)
tc2.Complete("analysis done", nil)

// Add extra tool (should be fine)
tc3 := trace.StartToolCall("tc3", "summarize", nil)
tc3.Complete("summary", nil)

// Complete trace (triggers evaluation)
trace.Complete("Processing complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("All required tools were called")
}
Output:

All required tools were called

func ResultValidator

func ResultValidator[T any](validator func(result T) error) ObservableTraceCallback[T]

ResultValidator returns an ObservableTraceCallback that validates the result using a custom validator. The validator is only called if the result is non-nil. T should typically be a pointer type like *MyStruct.

func ToolCallNamed

func ToolCallNamed[T any](name string, validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]

ToolCallNamed returns an ObservableTraceCallback that validates tool calls with a specific name using a custom validator.

func ToolCallValidator

func ToolCallValidator[T any](validator func(o Observer, tc *agenttrace.ToolCall[T]) error) ObservableTraceCallback[T]

ToolCallValidator creates an ObservableTraceCallback that validates individual tool calls using a custom validator function.

Example

ExampleToolCallValidator shows custom validation of tool calls

// Create a mock observer
obs := &mockObserver{}

// Validate that all tool calls have a reasoning parameter
validator := func(o evals.Observer, tc *agenttrace.ToolCall[string]) error {
	if _, ok := tc.Params["reasoning"]; !ok {
		return errors.New("missing reasoning parameter")
	}
	return nil
}

evalCallback := evals.ToolCallValidator[string](validator)

// Create tracer with the evaluation
tracer := agenttrace.ByCode[string](evals.Inject(obs, evalCallback))

// Create trace with proper reasoning parameters
ctx := context.Background()
trace := tracer.NewTrace(ctx, "Analyze logs")

// Add tool calls with reasoning parameters
tc1 := trace.StartToolCall("tc1", "read_logs", map[string]any{
	"reasoning": "need to analyze logs",
})
tc1.Complete("log data", nil)

tc2 := trace.StartToolCall("tc2", "analyze", map[string]any{
	"reasoning": "extract error patterns",
})
tc2.Complete("analysis done", nil)

// Complete trace (triggers evaluation)
trace.Complete("Analysis complete", nil)

if len(obs.failures) == 0 {
	fmt.Println("All tool calls have reasoning")
}
Output:

All tool calls have reasoning

type Observer

type Observer interface {
	// Fail marks the evaluation as failed with the given message
	// Should be called at most once per Trace evaluation
	Fail(string)
	// Log logs a message
	// Can be called multiple times per Trace evaluation
	Log(string)
	// Grade assigns a rating (0.0-1.0) with reasoning to the trace result
	// Should be called at most once per Trace evaluation
	Grade(score float64, reasoning string)
	// Increment is called each time a trace is evaluated
	Increment()
	// Total returns the number of observed instances
	Total() int64
}

Observer defines an interface for observing and controlling evaluation execution

type ResultCollector

type ResultCollector struct {
	// contains filtered or unexported fields
}

ResultCollector wraps an Observer to collect failure messages and grades

Example
package main

import (
	"context"
	"fmt"
	"sync"

	"chainguard.dev/driftlessaf/agents/agenttrace"
	"chainguard.dev/driftlessaf/agents/evals"
)

// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
	failures []string
	logs     []string
	count    int64
	mu       sync.Mutex
}

func (m *exampleObserver) Fail(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.failures = append(m.failures, msg)
}

func (m *exampleObserver) Log(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, msg)
}

func (m *exampleObserver) Grade(score float64, reasoning string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}

func (m *exampleObserver) Increment() {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.count++
}

func (m *exampleObserver) Total() int64 {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.count
}

func main() {
	// Create a mock observer to demonstrate the pattern
	baseObs := &exampleObserver{}

	// Wrap it with a result collector
	collector := evals.NewResultCollector(baseObs)

	// Define an evaluation callback that validates tool calls
	callback := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		o.Log("Analyzing trace")

		if len(trace.ToolCalls) != 1 {
			o.Fail("Expected exactly 1 tool call")
		}

		if trace.Error != nil {
			o.Fail("Unexpected error: " + trace.Error.Error())
		}

		// Give the trace a grade
		o.Grade(0.85, "Good tool usage")
	}

	// Create tracer with the collector
	tracer := agenttrace.ByCode[string](evals.Inject(collector, callback))

	// Create a trace that will trigger the evaluation
	ctx := context.Background()
	trace := tracer.NewTrace(ctx, "Process data")

	// Add a tool call
	tc := trace.StartToolCall("tc1", "data-processor", map[string]any{
		"input": "some data",
	})
	tc.Complete("processed", nil)

	// Complete the trace (this triggers the evaluation)
	trace.Complete("Processing complete", nil)

	// Check collected results
	failures := collector.Failures()
	grades := collector.Grades()

	fmt.Printf("Failures: %d\n", len(failures))
	fmt.Printf("Grades: %d (score: %.2f)\n", len(grades), grades[0].Score)
}
Output:

Failures: 0
Grades: 1 (score: 0.85)
Example (WithNamespacedObserver)
package main

import (
	"context"
	"errors"
	"fmt"
	"sync"

	"chainguard.dev/driftlessaf/agents/agenttrace"
	"chainguard.dev/driftlessaf/agents/evals"
)

// exampleObserver implements Observer for examples with thread-safety
type exampleObserver struct {
	failures []string
	logs     []string
	count    int64
	mu       sync.Mutex
}

func (m *exampleObserver) Fail(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.failures = append(m.failures, msg)
}

func (m *exampleObserver) Log(msg string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, msg)
}

func (m *exampleObserver) Grade(score float64, reasoning string) {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.logs = append(m.logs, fmt.Sprintf("Grade: %.2f - %s", score, reasoning))
}

func (m *exampleObserver) Increment() {
	m.mu.Lock()
	defer m.mu.Unlock()
	m.count++
}

func (m *exampleObserver) Total() int64 {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.count
}

func main() {
	// Create a namespaced observer using mock observers
	namespacedObs := evals.NewNamespacedObserver(func(name string) evals.Observer {
		return &exampleObserver{}
	})

	// Create result collectors for different namespaces
	toolCollector := evals.NewResultCollector(namespacedObs.Child("tools"))
	errorCollector := evals.NewResultCollector(namespacedObs.Child("errors"))

	// Define evaluations for tool calls
	toolEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		for _, tc := range trace.ToolCalls {
			if tc.Error != nil {
				o.Fail(fmt.Sprintf("Tool %s failed: %v", tc.Name, tc.Error))
			}
		}
	}

	// Define evaluations for trace errors
	errorEval := func(o evals.Observer, trace *agenttrace.Trace[string]) {
		if trace.Error != nil {
			o.Fail("Trace error: " + trace.Error.Error())
		}
	}

	// Create tracer with multiple collectors
	tracer := agenttrace.ByCode[string](
		evals.Inject(toolCollector, toolEval),
		evals.Inject(errorCollector, errorEval),
	)

	// Create a trace with a failing tool call
	ctx := context.Background()
	trace := tracer.NewTrace(ctx, "Complex analysis")

	tc := trace.StartToolCall("tc1", "analyzer", nil)
	tc.Complete(nil, errors.New("analysis failed"))

	// Complete the trace (this triggers both evaluations)
	trace.Complete("Analysis complete", nil)

	// Check failures by category
	toolFailures := toolCollector.Failures()
	errorFailures := errorCollector.Failures()

	fmt.Printf("Tool failures: %d\n", len(toolFailures))
	fmt.Printf("Error failures: %d\n", len(errorFailures))
}
Output:

Tool failures: 1
Error failures: 0

func NewResultCollector

func NewResultCollector(inner Observer) *ResultCollector

NewResultCollector creates a new ResultCollector that wraps the given Observer

func (*ResultCollector) Fail

func (r *ResultCollector) Fail(msg string)

Fail logs the failure message and stores it in the failures list

func (*ResultCollector) Failures

func (r *ResultCollector) Failures() []string

Failures returns a copy of all collected failure messages

func (*ResultCollector) Grade

func (r *ResultCollector) Grade(score float64, reasoning string)

Grade passes through to the inner observer and stores the grade

func (*ResultCollector) Grades

func (r *ResultCollector) Grades() []Grade

Grades returns a copy of all collected grades

func (*ResultCollector) Increment

func (r *ResultCollector) Increment()

Increment passes through to the inner observer

func (*ResultCollector) Log

func (r *ResultCollector) Log(msg string)

Log passes through to the inner observer

func (*ResultCollector) Total

func (r *ResultCollector) Total() int64

Total passes through to the inner observer

Directories

Path Synopsis
Package report provides report generation functionality for evaluation results.
Package report provides report generation functionality for evaluation results.
Package testevals provides a testing.T adapter for the evals framework.
Package testevals provides a testing.T adapter for the evals framework.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL