Quality regresses silently
A model upgrade. A prompt change. A new doc in the knowledge base. Any of these can quietly degrade performance on questions your team thinks are working. Without measurement, the first signal is a customer complaint.
Ready to get started?
Deploy sovereign AI on your infrastructure - in weeks, not months.
Platform · Evaluation · 7 evaluators · Eval gates · Red team
Seven evaluators score every agent version against your test suite before it reaches production. Promotion gates block anything that drops below your threshold.
Production failures become future tests. The system learns when something quietly breaks.
7 built-in evaluators Custom evaluators via SDK Eval-gated promotion Red team test packs
AI systems regress in ways your eyes won't catch. The model provider quietly updated. The prompt got tweaked. A new doc landed in the knowledge base that confused retrieval. Yesterday's "looks fine" is today's "why is the agent giving wrong answers to a question it answered correctly last week?"
A model upgrade. A prompt change. A new doc in the knowledge base. Any of these can quietly degrade performance on questions your team thinks are working. Without measurement, the first signal is a customer complaint.
Reviewing 50 outputs by hand to validate a prompt change worked is one thing. Doing it for 50 prompts across 12 agents every release is another. Manual review is the bottleneck that turns weekly releases into quarterly ones.
What looks great in a Friday demo doesn't survive the weekend traffic. Real users ask weird questions, push edge cases, hit the long tail. Without an eval suite that captures the long tail, you ship optimistic.
The RAG four cover the places retrieval-augmented agents go wrong: hallucinating beyond context, drifting off question, retrieving noise, missing sources. Trajectory and safety cover where agents go wrong. LLM-as-Judge is the escape hatch for anything domain-specific.
Does the answer stick to the retrieved context, or is it making things up? Claims in the response are decomposed and cross-checked against retrieved chunks. Anything unsupported drops the score.
Metric · 0.0 - 1.0
Does the answer address the question that was asked? The judge generates questions the answer would answer, compares them to the original, and scores the semantic overlap.
Metric · 0.0 - 1.0
How much of the retrieved context was actually useful? High precision means the retriever didn't drag in noise. A signal that retrieval is clean.
Metric · 0.0 - 1.0
Did the retriever find the information needed to answer correctly? High recall means the right sources made it into context. A signal that knowledge isn't missing coverage.
Metric · 0.0 - 1.0
For multi-step agents: did it call the right tools in the right order with the right arguments? Compares the actual trajectory against a reference or rule set. Catches loops, skipped steps, wrong sequencing.
Metric · pass / fail
Did the response pass content safety, harm detection, and policy compliance? Runs the same checks as the guardrails engine. Catches regressions before they reach production users.
Metric · pass / fail
Configurable criteria evaluated by a judge model. Write your own rubric: tone, structure, domain-specific correctness, policy adherence. The escape hatch when the six fixed evaluators miss something specific to your business.
Metric · 1 - 5
Register your own evaluator against the same API. A compliance scorer, a brand-voice detector, a domain-specific grader. Called on every run alongside the seven built-ins.
Seven is the floor, not the ceiling.
Live runs at /studio/evals. Versioned datasets you can re-run. The 12 evaluators that score every run. Toggle between them to see the chrome your team will actually use.
Evaluations
Run the same eval suites your CI does. 12 evaluators · 5 suites · score against any dataset.
Runs today
47
8 currently running
Pass rate
94.2%
44 of 47 passed
Avg score Δ
+0.03
vs last successful baseline
Avg run time
2m 41s
across 12 evaluators
| Run ID | Agent | Version | Started | Duration | Score | Δ vs baseline | Samples | Status |
|---|---|---|---|---|---|---|---|---|
| run_2847 | HR Policy Assistant | v0.7-candidate | 2m ago | 2m 14s | 0.94 | +0.04 | 127 | PASSED |
| run_2846 | Refund Triage | v1.3-candidate | 8m ago | 3m 41s | 0.89 | +0.02 | 84 | PASSED |
| run_2845 | Customer Support | v2.1-candidate | 12m ago | 4m 56s | 0.71 | -0.08 | 156 | FAILED |
| run_2844 | Sales Pipeline | v0.4-candidate | 21m ago | 1m 47s | - | - | 92 | RUNNING |
| run_2843 | Legal Doc Review | v0.2-candidate | 33m ago | 5m 12s | 0.92 | +0.07 | 64 | PASSED |
| run_2842 | IT Helpdesk | v1.8-candidate | 1h ago | 2m 28s | - | - | 108 | QUEUED |
Showing 6 of 2,847 runs · click any row for the full sample-by-sample breakdown
/studio/evals in your sandbox today. Read on for what an end-to-end run looks like and how the eval gate stops bad versions from reaching production.When you change a prompt, swap a model, or add a knowledge source, you trigger a run. The platform executes your test suite against the new version, scores it across the seven evaluators, and compares it against the last successful baseline. Regressions are flagged, sample failures are surfaced, and the gate either passes or doesn't.
Test cases
127
all completed
Pass rate
98%
125 of 127
vs baseline
+0.04
improved
Run time
2m 14s
across 7 evaluators
Cost
$0.84
judge model calls
SCORE BY EVALUATOR · CANDIDATE vs BASELINE v0.6
thresholds shown2 FAILURES · INSPECT EACH BEFORE PROMOTION
View all 127 cases →"What's the maximum carry-over for unused vacation days?"
Cited HR-2023-04 (superseded). Should have used HR-2024-08 amendment.
"Can a contractor request remote work approval?"
Missed the contractor-specific clause in HR-2024-12. Retrieval gap.
Connect evaluators to environments. Set thresholds per evaluator. An agent can't promote from dev to test, or test to prod, if its scores drop below the bar. The platform doesn't trust your judgment alone, and that's by design - it doesn't trust its own either.
Gate to PROD · thresholds
HR AssistantFaithfulness
minimum ≥ 0.90
Relevancy
minimum ≥ 0.85
Context recall
minimum ≥ 0.85
Trajectory eval
pass rate ≥ 95%
Safety check
must pass ≥ 100%
LLM Judge
minimum ≥ 4.0
vs baseline
delta ≥ no regression
The promotion lifecycle
Six maintained attack packs covering the categories that matter for production AI. Run them against any agent. Get a per-category pass rate. Failures are linked back to the guardrail or prompt change that needs attention. Updated quarterly with new attacks discovered in the wild.
Red team report · HR Assistant v0.7
328 attacks · 6 categories · run yesterday98%
84 cases · 82 blocked
Indirect injection via knowledge content, system prompt override attempts, role confusion attacks.
96%
62 cases · 60 blocked
DAN-style prompts, hypothetical reframes, encoding tricks, multilingual evasion attempts.
100%
48 cases · 48 blocked
Coercion to reveal training data, social engineering for user details, name extraction probes.
94%
37 cases · 35 blocked
Coerce the agent to call destructive tools, exfiltrate via API calls, escalate privileges.
92%
55 cases · 51 blocked
Questions designed to elicit confident wrong answers, fictional product features, made-up policies.
100%
42 cases · 42 blocked
Discrimination probes, harmful content elicitation, protected-class evaluation, harassment scripts.
When a user gives a thumbs-down in production, that conversation goes into a triage queue. Confirmed failures get added to your eval suite. The next agent version is tested against them automatically. The system gets better at noticing the things it used to miss.
Four stages · Each one closes the loop
Production conversations are logged
Every question, every answer, every tool call. With permissioned access for the people who need to review them.
Failures bubble up via thumbs-down
User feedback creates a review queue. Confirmed failures get reviewed by your team or by a triage agent. Real evidence of what's broken.
Confirmed failures join the eval suite
A bad answer becomes a test case with the correct expected response. The next version of the agent is tested against it automatically.
Better prompts, better retrieval, better fine-tunes
Every quarter, a fine-tune run uses the accumulated failure data to improve the model itself. The flywheel compounds.
How do I know if my prompt change broke something?
Every change triggers an eval run. Seven evaluators score the candidate against your test suite, the platform compares to the last successful baseline, and any score that drops below threshold flags. The run is gated - bad versions cannot promote.
Can we ship faster than weekly?
Yes. Eval automation replaces the manual review bottleneck. A prompt change, a model swap, or a new knowledge source kicks off a run that completes in under five minutes. Engineering teams move from quarterly releases to daily.
Where does my test suite live?
In the platform, version-controlled alongside the agent. Test cases include input, ground truth, and pass criteria per evaluator. Add cases manually, import from CSV, or auto-generate from production logs that got thumbs-down.
Can we evidence the safety controls are working?
Yes. The Safety check evaluator runs on every test case. Eval gates require Safety = 100% pass to promote to production. Red team packs run weekly. Failures and remediations are logged with timestamps - audit-ready evidence.
The bottleneck that ate your release cadence.
Smart engineers review 50 outputs per change. They catch what they catch. They miss what they miss. They burn hours on every release. The release slows. The platform calcifies.
Months of work on a non-product capability.
Pick a metric library, build the runner, wire up the dashboard, integrate with deployment. Half a year of platform engineering before a single agent gets evaluated systematically.
Seven evaluators. Gates. Flywheel. Day one.
7 built-in evaluators across RAG, agent, safety, and custom rubrics. Run dashboard with comparison vs baseline. Eval-gated promotion to test and prod. Six maintained red team packs. Production feedback closes the loop automatically.
Production AI without evaluation is production AI on a dare. The model gets quietly worse, the prompt subtly drifts, the retrieval slowly degrades, and the first warning is a customer escalation. Eval suites are not a feature. They are the prerequisite to trusting any AI system in production. If you cannot defend the score, you are guessing.
Sandbox access in 24 hours. Comes pre-loaded with a sample agent, a 50-case test suite, and the seven evaluators wired up. Edit the prompt, hit Run, watch the scorecard update.
Bring your own test suite when you're ready. Import from CSV or generate from production logs.
