Agent Eval Lab

Agent Eval Lab helps developers turn rough agent workflows into testable evaluation scenarios with expected behavior, failure modes, scoring dimensions, and follow-up checks.

Intended use

Use this model page as the public home for a lightweight evaluation surface for AI agents and tool-calling systems.

Example workflows

A browser automation agent that books travel and fills web forms
A coding agent that edits files, runs tests, and summarizes failures
A support triage assistant that classifies tickets and drafts replies

Expected outputs

scenario title
structured task setup
expected agent behavior
failure-mode checklist
scoring rubric

Status

The public model page is live. Runtime versions can be pushed next with Cog when the implementation is ready.

Model created 3 weeks, 6 days ago