Acme Learning/

Dashboard

Eval Studio

Track Fdemo

Write evals, golden sets, and regression tests for AI agents. Run them, see what breaks.

running
Eval suites
5
active
310 goldens total
Pass rate
90.3%
+1.4
Across all suites
Avg runtime
1m 41s
/run
Cached + parallel
Regressions caught
2
today
Caught before merge
Eval suites
Each suite is a versioned set of goldens. Pass requires all hard rubrics to clear and 0 regressions vs the last green run.
Agentssupport-agent / golden / v3
119
4
1
38s4m ago
RAGrag-recall / chunk-strategy
65
12
3
2m 12s11m ago
Safetyprompt-injection / red-team
26
5
1
1m 04s1h ago
Tool Usetool-call / arg-validation
56
0
0
22s2h ago
Voicevoice-agent / latency budget
14
2
2
4m 30s3h ago
Regressions caught today
Cases that were passing on the last green run and are now failing. The grader explains *why*.
rag-recall·long-table-mid-page regression
chunk size raised from 800→1200 broke table boundaries
support-agent·angry-billing-tier-2 regression
escalation threshold drift in v3 prompt