Delphina auto-generates test cases from your knowledge base, runs them against your data, and an LLM judge scores each response. When a test fails, it points directly to a gap in your documentation — fix the gap withDocumentation Index
Fetch the complete documentation index at: https://docs.delphina.ai/llms.txt
Use this file to discover all available pages before exploring further.
/knowledge, and the next run passes.
The critic agent also reviews every chat response in real time, flagging missing knowledge and assumptions as inline annotations.
How evaluations work
An evaluation consists of test cases — each with a prompt (e.g., “What was MRR last month?”), expected SQL, and acceptance criteria that the judge scores against.- The agent answers each question.
- An LLM judge scores each response against the criteria.
- Results are collected into an experiment with an overall score and per-question details.
Creating evaluations
You can create evaluations automatically, from a chat, or with more control over the process.Automatic
Go to Context > Evaluations and click Create Evals. The agent analyzes your knowledge base and builds a plan of test cases. Once the plan completes, click Build to generate the cases. If you have existing evaluations or specific questions you want tested, upload them at Context > Sources > File Uploads before running Create Evals. The agent incorporates these alongside what it discovers from your knowledge base.From a chat
Create or update individual cases using/knowledge:
Custom
For more control and custom context, use/evals-update plan to have the agent build a prioritized plan based on your inputs. Once the plan completes, click Build to generate the cases.
Reviewing results
Each experiment shows a pass rate and per-question results with the judge’s explanation. Example failure:Test case: “What was MRR last month?” Expected: Usessubscriptionstable withstatus = 'active'and sumsmonthly_price. Actual: Agent usedpaymentstable. Result was $2.1M instead of $1.8M. Judge: Wrong source table — no documented MRR metric, so the agent inferred from payments.
Issues
When a test case fails, Delphina creates an issue automatically. View issues at Context > Issues. To resolve: read the failure details, use/knowledge to fix the documentation, and re-run the evaluation.