Skip to content

Commit 208ca9b

Browse files
seirastoplanetf1
andauthored
docs: test based eval documentation (#916)
* documentation update * small wording * Update docs/docs/how-to/unit-test-generative-code.md Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com> * Update docs/docs/how-to/unit-test-generative-code.md Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com> * formatting improvements * updates based on PR suggestions * example and comments fix * link fix * cli alternative * Update docs/docs/how-to/unit-test-generative-code.md Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com> * Update docs/docs/how-to/unit-test-generative-code.md Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com> * markdownlint fix --------- Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
1 parent 365f863 commit 208ca9b

1 file changed

Lines changed: 46 additions & 16 deletions

File tree

docs/docs/how-to/unit-test-generative-code.md

Lines changed: 46 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,16 @@ fast and reliable.
1818

1919
## Three levels of assertion
2020

21-
Every test for a `@generative` function falls into one of three levels:
21+
Every test for a `@generative` function falls into one of four levels:
2222

2323
| Level | What you assert | Deterministic? |
2424
| ----- | --------------- | -------------- |
2525
| **Type check** | `isinstance(result, bool)` | Yes — constrained decoding always returns the declared type |
2626
| **Structural check** | `result in ["positive", "negative"]` or field names present | Yes — schema enforcement is deterministic |
2727
| **Qualitative check** | `assert result is True` | No — depends on the model and prompt |
28+
| **Semantic evaluation** | Judge model scores output against reference responses | No — run separately, not a pytest assertion |
29+
30+
*For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see [The `unit_test_eval` component for Generative Unit Tests](#the-unit_test_eval-component-for-generative-unit-tests) at the end of this page.*
2831

2932
Type and structural checks run in CI. Qualitative checks carry
3033
`@pytest.mark.qualitative` and are skipped in CI when `CICD=1` is set.
@@ -298,12 +301,21 @@ def test_with_simple_validate_requirement(session):
298301
assert isinstance(res.value, str)
299302
```
300303

301-
## The `unit_test_eval` component
304+
## The `unit_test_eval` component for Generative Unit Tests
302305

303306
`mellea.stdlib.components.unit_test_eval` provides `TestBasedEval`, a
304-
`Component` that formats an LLM-as-a-judge evaluation task. You load test cases
307+
`Component` that formats an LLM-as-a-judge evaluation task for generative unit testing. You load test cases
305308
from a JSON file and pass them to a judge session. This is useful for offline
306-
evaluation pipelines, not for individual pytest assertions.
309+
evaluation pipelines,
310+
not for individual pytest assertions.
311+
312+
Given a task, you provide test cases that consist of evaluation instructions
313+
and a set of examples, along with associated metadata. Each example, in conversational format, consists of an input
314+
and (optional) target / reference response(s), which can be used to guide evaluation along with the evaluation instructions.
315+
They can either be instantiated inline or in JSON format, with a separate JSON file per task.
316+
317+
There are no limitations on the number of test examples per task, and each input can have multiple reference responses.
318+
The evaluation instructions apply to all the test cases in your task.
307319

308320
### JSON file format
309321

@@ -312,15 +324,23 @@ Each entry in the JSON array defines one test:
312324
```json
313325
[
314326
{
315-
"source": "email-classifier",
316-
"name": "positive_case_001",
317-
"instructions": "Evaluate whether the prediction correctly identifies the category.",
327+
"source": "professional-email-writing",
328+
"name": "case_001",
329+
"instructions": "The email should follow the instructions in the input.",
318330
"id": "tc-001",
319331
"examples": [
320332
{
321333
"input_id": "ex-001",
322-
"input": [{"role": "user", "content": "Is this email spam?"}],
323-
"targets": [{"role": "assistant", "content": "no"}]
334+
"input": [{"role": "user", "content": "Write a brief professional follow-up email after a job interview"}],
335+
"targets": [
336+
{"role": "assistant", "content": "Thank you for taking the time to speak with me yesterday. I look forward to hearing about next steps at your convenience."},
337+
{"role": "assistant", "content": "I appreciate the opportunity to interview yesterday. Looking forward to hearing about next steps."}
338+
]
339+
},
340+
{
341+
"input_id": "ex-002",
342+
"input": [{"role": "user", "content": "I just finished a client demo can you create a formal thank-you email"}],
343+
"targets": [{"role": "assistant", "content": "It was a pleasure sharing a product demo with you. Thank you for meeting with us."}]
324344
}
325345
]
326346
}
@@ -332,28 +352,36 @@ Each entry in the JSON array defines one test:
332352
```python
333353
# Requires: mellea
334354
# Returns: None
335-
from mellea import MelleaSession, start_session
355+
from mellea import start_session
356+
from mellea.stdlib.components import SimpleComponent
336357
from mellea.stdlib.components.unit_test_eval import TestBasedEval
337358

338-
# Load one TestBasedEval per test definition in the file.
339-
test_evals = TestBasedEval.from_json_file("tests/eval_data/email_classifier.json")
359+
test_evals = TestBasedEval.from_json_file("tests/eval_data/email_writer.json")
340360

341-
judge_session = start_session()
361+
judge_session = start_session(backend_name="ollama", model_id="granite4:micro")
362+
generation_session = start_session(backend_name="ollama", model_id="granite4:micro")
342363

343364
for eval_case in test_evals:
344365
for idx, input_text in enumerate(eval_case.inputs):
345-
# Generate the prediction from the system under test.
346-
prediction = "no" # replace with your actual model call
366+
prediction = generation_session.act(
367+
SimpleComponent(instruction=input_text)
368+
).value
347369

348370
targets = eval_case.targets[idx] if eval_case.targets else []
349371
eval_case.set_judge_context(input_text, prediction, targets)
350372

351-
verdict = judge_session.instruct(eval_case)
373+
verdict = judge_session.act(eval_case)
374+
# Note: verdict.value is the raw JSON string returned by the judge — {"score": 0|1,
375+
# "justification": "..."}. Score 0 means the guidelines were violated; score 1 means the
376+
# output is well aligned. Parse it to use the score programmatically:
352377
print(f"{eval_case.name}: {verdict.value}")
353378
```
354379

355380
> **Note:** `TestBasedEval` calls the judge model once per input. For large
356381
> evaluation sets, consider batching or running evaluations asynchronously.
382+
> **CLI alternative:** The same evaluation can be run without writing Python:
383+
> `m eval run tests/eval_data/email_writer.json --backend ollama --model granite4:micro`
384+
> See `m eval run --help` for full options.
357385
358386
## CI strategy
359387

@@ -407,3 +435,5 @@ pytest -m qualitative
407435
`Requirement`, `simple_validate`, and `check` interact with the IVR loop
408436
- [Handling Exceptions](../how-to/handling-exceptions)
409437
catch and diagnose errors that occur during generation
438+
- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge)
439+
the `Requirement`-based approach for inline judge evaluation

0 commit comments

Comments
 (0)