Evaluations & Quality

Delphina auto-generates test cases from your knowledge base, runs them against your data, and an LLM judge scores each response. When a test fails, it points directly to a gap in your documentation — fix the gap with /knowledge, and the next run passes. The critic agent also reviews every chat response in real time, flagging missing knowledge and assumptions as inline annotations.

How evaluations work

An evaluation consists of test cases — each with a prompt (e.g., “What was MRR last month?”), expected SQL, and acceptance criteria that the judge scores against.

The agent answers each question.
An LLM judge scores each response against the criteria.
Results are collected into an experiment with an overall score and per-question details.

Evaluations run weekly automatically. Trigger one manually at Context > Evaluations > Run Evaluation.

Creating evaluations

You can create evaluations automatically, from a chat, or with more control over the process.

Automatic

Go to Context > Evaluations and click Create Evals. The agent analyzes your knowledge base and builds a plan of test cases. Once the plan completes, click Build to generate the cases. If you have existing evaluations or specific questions you want tested, upload them at Context > Sources > File Uploads before running Create Evals. The agent incorporates these alongside what it discovers from your knowledge base.

From a chat

Create or update individual cases using /knowledge:

/knowledge Create an eval case that tests whether the agent correctly
calculates MRR using the subscriptions table with status = 'active'.

/knowledge Update the churn rate eval case — the definition changed
to include customers inactive for 60 days instead of 90.

You can also share a file with test ideas directly in the chat.

Custom

For more control and custom context, use /evals-update plan to have the agent build a prioritized plan based on your inputs. Once the plan completes, click Build to generate the cases.

/evals-update plan Create evals from the questions and sql queries in [[raw/eval-ideas/]]

/evals-update plan Focus on the marketing domain — we just onboarded
campaign spend and attribution tables. Create ~10 new eval cases from these metrics

Reviewing results

Each experiment shows a pass rate and per-question results with the judge’s explanation. Example failure:

Test case: “What was MRR last month?” Expected: Uses subscriptions table with status = 'active' and sums monthly_price. Actual: Agent used payments table. Result was $2.1M instead of $1.8M. Judge: Wrong source table — no documented MRR metric, so the agent inferred from payments.

Inbox

When a test case fails, Delphina creates an issue automatically. View issues in the Context Layer > Inbox. To resolve: read the failure details, use /knowledge to fix the documentation, and re-run the evaluation.

Getting Started

Analytics

Context Layer

Administration

Sandbox

Legal

Evaluations & Quality

How evaluations work

Creating evaluations

Automatic

From a chat

Custom

Reviewing results

Inbox

​How evaluations work

​Creating evaluations

​Automatic

​From a chat

​Custom

​Reviewing results

​Inbox

How evaluations work

Creating evaluations

Automatic

From a chat

Custom

Reviewing results

Inbox