feat: groundedness requirement#773
feat: groundedness requirement#773akihikokuroda wants to merge 19 commits intogenerative-computing:mainfrom
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
jakelorocco
left a comment
There was a problem hiding this comment.
This is a very interesting requirement. I think it's a good opportunity to show off Mellea intrinsics and requirement checking. I'm not sure we have many other requirements with as many llm calls.
One broader suggestion than the comments I left below: Could we parallelize the steps? Could we generate citations at the same time we check spans for needing citations? As we generate spans that need to be checked, could we check each in parallel or as they are given? If so, I think we should make this requirement work more asynchronously and have an early exit mode if a span fails the check (even if not all citations have been generated / not all spans have been checked).
|
cc @generative-computing/mellea-intrinsics |
|
@jakelorocco Thanks for review. I addressed all your comments except "Could we parallelize the steps?". I'm working on it. |
|
@jakelorocco There are 2 ideas improve the requirement. This one does not parallelize the processing but it make a batch call for the citation support step. |
|
The "parallelize" seems some work/investigation necessary. So I improved "citation support" step to make only one LLM call instead of calling LLM for each span. |
|
cc @yannisk2 |
|
@planetf1 Thanks for review. I addressed all your comments. |
planetf1
left a comment
There was a problem hiding this comment.
LGTM and thanks for addressing comments. I'll leave this as a comment as there are some items outstanding from @jakelorocco which I think may need addressing
|
@psschwei this is PR. |
yannisk2
left a comment
There was a problem hiding this comment.
@akihikokuroda Thank you for putting this together! Please see below for a few additional comments on improvements/changes.
| if self.documents is not None: | ||
| documents = self.documents | ||
| else: | ||
| documents = last_message._docs or [] |
There was a problem hiding this comment.
For the case where the documents are not directly provided to the requirement checker and are read off the context instead, do we have a standard design pattern for RAG showing where the documents should be attached to? I see at least three options:
- Documents are attached to the last assistant message (which the code in this PR assumes)
- Documents are attached to the last user message
- Documents are attached using the
grounding_context(which is what the RAG example at https://github.com/generative-computing/mellea/tree/main/docs/examples/rag uses).
I think that it would help to standardize the way that documents are passed in RAG scenarios and then align the code in this PR to read them from the corresponding location, so that the users that employ the requirement checker as part of an RAG IVR pattern would not have to pass the documents twice (once for response generation and a second time for the requirement check).
There was a problem hiding this comment.
I agree that standardizing document passing in RAG scenarios would improve the developer experience and avoid requiring documents to be passed twice (once for response generation and once for requirement validation).
The current design supports documents in the constructor or attached to the assistant message, which works but isn't aligned with the grounding_context pattern used in existing RAG examples like simple_rag_with_filter.py.
I'd like to defer this design alignment to a follow-on PR. That work should:
- Standardize on the grounding_context pattern for document passing
- Refactor ChatContext (or add grounding context support) to make documents available throughout the pipeline
- Update GroundednessRequirement to read documents from the context instead of requiring explicit constructor/message passing
- Update examples and documentation to show the unified pattern
This deserves dedicated attention, discussion with the core team and testing to ensure it works well across all RAG use cases. I'll open a tracking issue to capture this improvement.
There was a problem hiding this comment.
I agree that this will require a broader discussion and coordination with the core team. If we can capture this as a separate issue, I am completely fine with addressing it in a separate PR.
| try: | ||
| # Step 1: Citation Generation | ||
| # Call intrinsic directly for explicit control over model options | ||
| from ..components.intrinsic._util import call_intrinsic |
There was a problem hiding this comment.
Can the import statement be moved to the top of the file?
There was a problem hiding this comment.
Inline comment added to explain.
The lazy import is actually necessary due to a circular dependency:
- mellea.stdlib.requirements.rag imports from mellea.stdlib.components
- Which transitively imports from mellea.backends
- If the import to call_intrinsic is at module level, it triggers the backends module before initialization completes
| citation_context = context_before_response.add( | ||
| Message("assistant", response, documents=list(documents)) | ||
| ) | ||
| citations: list[dict] = call_intrinsic( |
There was a problem hiding this comment.
I would suggest using the find_citations function instead of the call_intrinsic function (which is internally called by the former) to avoid replicating here the code of the find_citations function. This should also make the code cleaner, as if we were to call the find_citations function, we would not need to add back to the context the last assistant message that we have just separated from the original context.
There was a problem hiding this comment.
I'll make changes. Thanks!
| covered_ranges.sort() | ||
| merged_ranges: list[tuple[int, int]] = [] | ||
| for begin, end in covered_ranges: | ||
| if merged_ranges and begin <= merged_ranges[-1][1]: |
There was a problem hiding this comment.
Ensure that consecutive non-overlapping spans are not merged (may have to replace <= with < above).
| current_span_start = 0 | ||
| current_is_covered = is_covered(0) if response else False | ||
|
|
||
| for i in range(1, len(response) + 1): |
There was a problem hiding this comment.
Identifying the spans could be done more efficiently by iterating over the merged_ranges instead of iterating over every single character in the response.
| result, _ = await backend.generate_from_context( | ||
| action, | ||
| context, | ||
| model_options={"temperature": 0.0, "max_new_tokens": 500}, |
There was a problem hiding this comment.
Could we have a higher default of max_new_tokens (or make it configurable)?
There was a problem hiding this comment.
make it configurable. Thanks!
| result, _ = await backend.generate_from_context( | ||
| action, | ||
| context, | ||
| model_options={"temperature": 0.0, "max_new_tokens": 500}, |
There was a problem hiding this comment.
Could we have a higher default of max_new_tokens (or make it configurable)?
There was a problem hiding this comment.
make it configurable. Thanks!
| @@ -0,0 +1,514 @@ | |||
| """Tests for GroundednessRequirement.""" | |||
There was a problem hiding this comment.
In addition to the current tests, can we also add a few tests that check end-to-end the correctness of the requirement checker? Test cases could include simple examples of grounded, ungrounded, or partially grounded responses, responses that does not need citations (e.g., I-do-not-know), etc.
There was a problem hiding this comment.
I agree that end-to-end correctness tests validating real grounded/ungrounded/partially-grounded responses would strengthen the test suite.
I'd like to defer adding those tests to a follow-on PR. The reason is that comprehensive correctness tests require careful data engineering to craft responses that are genuinely grounded vs. ungrounded by the provided documents, which deserves focused attention.
I'll open a follow-on issue/PR to track adding:
- Tests for fully grounded responses (should pass)
- Tests for ungrounded responses (should fail)
- Tests for partially grounded responses
- Tests for responses that don't need citations (I-don't-know, disclaimers, etc.)
These can be marked with `@pytest.mark.slow` since they'll require GPU inference and real backend validation.
| ) | ||
| return prompt | ||
|
|
||
| def _build_batch_support_prompt( |
There was a problem hiding this comment.
This prompt should also include the documents, as in order for an LLM to decide if a citation supports a response span, it may need to reason about the citation in the context of the document in which the citation appears. For instance, consider the following example:
- Document: "IBM ... Its headquarters are in Armonk, NY"
- Response sentence: "IBM is headquartered in Armonk, NY"
- Citation for response sentence: "Its headquarters are in Armonk, NY"
In this example, for an LLM to verify that the citation supports the response sentence, it has to be aware of the document where the citation appears, so that it can verify that the word "its" in the citation indeed refers to IBM.
Based on the above, the support prompt has to also include the documents.
There was a problem hiding this comment.
I'll fix it. Thanks!
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Misc PR
Type of PR
Description
Testing