docs: test based eval documentation (#916)

seirasto · planetf1 · web-flow · commit 208ca9b7dd04 · 2026-04-29T07:38:38.000Z
* documentation update

* small wording

* Update docs/docs/how-to/unit-test-generative-code.md

Co-authored-by: Nigel Jones &lt;nigel.l.jones+git@gmail.com&gt;

* Update docs/docs/how-to/unit-test-generative-code.md

Co-authored-by: Nigel Jones &lt;nigel.l.jones+git@gmail.com&gt;

* formatting improvements

* updates based on PR suggestions

* example and comments fix

* link fix

* cli alternative

* Update docs/docs/how-to/unit-test-generative-code.md

Co-authored-by: Nigel Jones &lt;nigel.l.jones+git@gmail.com&gt;

* Update docs/docs/how-to/unit-test-generative-code.md

Co-authored-by: Nigel Jones &lt;nigel.l.jones+git@gmail.com&gt;

* markdownlint fix

---------

Co-authored-by: Nigel Jones &lt;nigel.l.jones+git@gmail.com&gt;
diff --git a/docs/docs/how-to/unit-test-generative-code.md b/docs/docs/how-to/unit-test-generative-code.md
@@ -18,13 +18,16 @@ fast and reliable.
 
 ## Three levels of assertion
 
-Every test for a `@generative` function falls into one of three levels:
+Every test for a `@generative` function falls into one of four levels:
 
 | Level | What you assert | Deterministic? |
 | ----- | --------------- | -------------- |
 | **Type check** | `isinstance(result, bool)` | Yes — constrained decoding always returns the declared type |
 | **Structural check** | `result in ["positive", "negative"]` or field names present | Yes — schema enforcement is deterministic |
 | **Qualitative check** | `assert result is True` | No — depends on the model and prompt |
+| **Semantic evaluation** | Judge model scores output against reference responses | No — run separately, not a pytest assertion |
+
+*For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see [The `unit_test_eval` component for Generative Unit Tests](#the-unit_test_eval-component-for-generative-unit-tests) at the end of this page.*
 
 Type and structural checks run in CI. Qualitative checks carry
 `@pytest.mark.qualitative` and are skipped in CI when `CICD=1` is set.
@@ -298,12 +301,21 @@ def test_with_simple_validate_requirement(session):
     assert isinstance(res.value, str)
 ```
 
-## The `unit_test_eval` component
+## The `unit_test_eval` component for Generative Unit Tests
 
 `mellea.stdlib.components.unit_test_eval` provides `TestBasedEval`, a
-`Component` that formats an LLM-as-a-judge evaluation task. You load test cases
+`Component` that formats an LLM-as-a-judge evaluation task for generative unit testing. You load test cases
 from a JSON file and pass them to a judge session. This is useful for offline
-evaluation pipelines, not for individual pytest assertions.
+evaluation pipelines,
+not for individual pytest assertions.
+
+Given a task, you provide test cases that consist of evaluation instructions
+and a set of examples, along with associated metadata. Each example, in conversational format, consists of an input
+and (optional) target / reference response(s), which can be used to guide evaluation along with the evaluation instructions.
+They can either be instantiated inline or in JSON format, with a separate JSON file per task.
+
+There are no limitations on the number of test examples per task, and each input can have multiple reference responses.
+The evaluation instructions apply to all the test cases in your task.
 
 ### JSON file format
 
@@ -312,15 +324,23 @@ Each entry in the JSON array defines one test:
 ```json
 [
   {
-    "source": "email-classifier",
-    "name": "positive_case_001",
-    "instructions": "Evaluate whether the prediction correctly identifies the category.",
+    "source": "professional-email-writing",
+    "name": "case_001",
+    "instructions": "The email should follow the instructions in the input.",
     "id": "tc-001",
     "examples": [
       {
         "input_id": "ex-001",
-        "input": [{"role": "user", "content": "Is this email spam?"}],
-        "targets": [{"role": "assistant", "content": "no"}]
+        "input": [{"role": "user", "content": "Write a brief professional follow-up email after a job interview"}],
+        "targets": [
+            {"role": "assistant", "content": "Thank you for taking the time to speak with me yesterday. I look forward to hearing about next steps at your convenience."},
+            {"role": "assistant", "content": "I appreciate the opportunity to interview yesterday. Looking forward to hearing about next steps."}
+        ]
+      },
+      {
+        "input_id": "ex-002",
+        "input": [{"role": "user", "content": "I just finished a client demo can you create a formal thank-you email"}],
+        "targets": [{"role": "assistant", "content": "It was a pleasure sharing a product demo with you. Thank you for meeting with us."}]
       }
     ]
   }
@@ -332,28 +352,36 @@ Each entry in the JSON array defines one test:
 ```python
 # Requires: mellea
 # Returns: None
-from mellea import MelleaSession, start_session
+from mellea import start_session
+from mellea.stdlib.components import SimpleComponent
 from mellea.stdlib.components.unit_test_eval import TestBasedEval
 
-# Load one TestBasedEval per test definition in the file.
-test_evals = TestBasedEval.from_json_file("tests/eval_data/email_classifier.json")
+test_evals = TestBasedEval.from_json_file("tests/eval_data/email_writer.json")
 
-judge_session = start_session()
+judge_session = start_session(backend_name="ollama", model_id="granite4:micro")
+generation_session = start_session(backend_name="ollama", model_id="granite4:micro")
 
 for eval_case in test_evals:
     for idx, input_text in enumerate(eval_case.inputs):
-        # Generate the prediction from the system under test.
-        prediction = "no"  # replace with your actual model call
+        prediction = generation_session.act(
+            SimpleComponent(instruction=input_text)
+        ).value
 
         targets = eval_case.targets[idx] if eval_case.targets else []
         eval_case.set_judge_context(input_text, prediction, targets)
 
-        verdict = judge_session.instruct(eval_case)
+        verdict = judge_session.act(eval_case)
+        # Note: verdict.value is the raw JSON string returned by the judge — {"score": 0|1, 
+        # "justification": "..."}. Score 0 means the guidelines were violated; score 1 means the 
+        # output is well aligned. Parse it to use the score programmatically:
         print(f"{eval_case.name}: {verdict.value}")
 ```
 
 > **Note:** `TestBasedEval` calls the judge model once per input. For large
 > evaluation sets, consider batching or running evaluations asynchronously.
+> **CLI alternative:** The same evaluation can be run without writing Python:
+> `m eval run tests/eval_data/email_writer.json --backend ollama --model granite4:micro`
+> See `m eval run --help` for full options.
 
 ## CI strategy
 
@@ -407,3 +435,5 @@ pytest -m qualitative
   `Requirement`, `simple_validate`, and `check` interact with the IVR loop
 - [Handling Exceptions](../how-to/handling-exceptions) —
   catch and diagnose errors that occur during generation
+- [Evaluate with LLM-as-a-Judge](../evaluation-and-observability/evaluate-with-llm-as-a-judge) —
+  the `Requirement`-based approach for inline judge evaluation