Agent Eval Lab
Agent Eval Lab helps developers turn rough agent workflows into testable evaluation scenarios with expected behavior, failure modes, scoring dimensions, and follow-up checks.
Intended use
Use this model page as the public home for a lightweight evaluation surface for AI agents and tool-calling systems.
Example workflows
- A browser automation agent that books travel and fills web forms
- A coding agent that edits files, runs tests, and summarizes failures
- A support triage assistant that classifies tickets and drafts replies
Expected outputs
- scenario title
- structured task setup
- expected agent behavior
- failure-mode checklist
- scoring rubric
Status
The public model page is live. Runtime versions can be pushed next with Cog when the implementation is ready.
Model created