You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/how-to/unit-test-generative-code.md
+46-16Lines changed: 46 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,13 +18,16 @@ fast and reliable.
18
18
19
19
## Three levels of assertion
20
20
21
-
Every test for a `@generative` function falls into one of three levels:
21
+
Every test for a `@generative` function falls into one of four levels:
22
22
23
23
| Level | What you assert | Deterministic? |
24
24
| ----- | --------------- | -------------- |
25
25
|**Type check**|`isinstance(result, bool)`| Yes — constrained decoding always returns the declared type |
26
26
|**Structural check**|`result in ["positive", "negative"]` or field names present | Yes — schema enforcement is deterministic |
27
27
|**Qualitative check**|`assert result is True`| No — depends on the model and prompt |
28
+
|**Semantic evaluation**| Judge model scores output against reference responses | No — run separately, not a pytest assertion |
29
+
30
+
*For levels 1–3, use pytest with the patterns below. For semantic evaluation against reference examples — where you want a judge model to score your model's outputs in bulk — see [The `unit_test_eval` component for Generative Unit Tests](#the-unit_test_eval-component-for-generative-unit-tests) at the end of this page.*
28
31
29
32
Type and structural checks run in CI. Qualitative checks carry
30
33
`@pytest.mark.qualitative` and are skipped in CI when `CICD=1` is set.
"input": [{"role": "user", "content": "Write a brief professional follow-up email after a job interview"}],
335
+
"targets": [
336
+
{"role": "assistant", "content": "Thank you for taking the time to speak with me yesterday. I look forward to hearing about next steps at your convenience."},
337
+
{"role": "assistant", "content": "I appreciate the opportunity to interview yesterday. Looking forward to hearing about next steps."}
338
+
]
339
+
},
340
+
{
341
+
"input_id": "ex-002",
342
+
"input": [{"role": "user", "content": "I just finished a client demo can you create a formal thank-you email"}],
343
+
"targets": [{"role": "assistant", "content": "It was a pleasure sharing a product demo with you. Thank you for meeting with us."}]
324
344
}
325
345
]
326
346
}
@@ -332,28 +352,36 @@ Each entry in the JSON array defines one test:
332
352
```python
333
353
# Requires: mellea
334
354
# Returns: None
335
-
from mellea import MelleaSession, start_session
355
+
from mellea import start_session
356
+
from mellea.stdlib.components import SimpleComponent
336
357
from mellea.stdlib.components.unit_test_eval import TestBasedEval
337
358
338
-
# Load one TestBasedEval per test definition in the file.
0 commit comments