Skip to content

feat: groundedness requirement#773

Open
akihikokuroda wants to merge 19 commits intogenerative-computing:mainfrom
akihikokuroda:citation
Open

feat: groundedness requirement#773
akihikokuroda wants to merge 19 commits intogenerative-computing:mainfrom
akihikokuroda:citation

Conversation

@akihikokuroda
Copy link
Copy Markdown
Member

@akihikokuroda akihikokuroda commented Apr 1, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

@akihikokuroda akihikokuroda requested a review from a team as a code owner April 1, 2026 20:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@akihikokuroda akihikokuroda changed the title groundedness requirement feat: groundedness requirement Apr 1, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Apr 1, 2026
Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting requirement. I think it's a good opportunity to show off Mellea intrinsics and requirement checking. I'm not sure we have many other requirements with as many llm calls.

One broader suggestion than the comments I left below: Could we parallelize the steps? Could we generate citations at the same time we check spans for needing citations? As we generate spans that need to be checked, could we check each in parallel or as they are given? If so, I think we should make this requirement work more asynchronously and have an early exit mode if a span fails the check (even if not all citations have been generated / not all spans have been checked).

Comment thread docs/examples/groundedness_requirement_example.py Outdated
Comment thread mellea/stdlib/requirements/rag.py Outdated
Comment thread mellea/stdlib/requirements/rag.py Outdated
Comment thread mellea/stdlib/requirements/rag.py
Comment thread mellea/stdlib/requirements/rag.py Outdated
Comment thread mellea/stdlib/requirements/rag.py
Comment thread mellea/stdlib/requirements/rag.py
Comment thread mellea/stdlib/requirements/rag.py
Comment thread mellea/stdlib/requirements/rag.py
Comment thread mellea/stdlib/requirements/rag.py Outdated
@psschwei
Copy link
Copy Markdown
Member

psschwei commented Apr 2, 2026

cc @generative-computing/mellea-intrinsics

@akihikokuroda
Copy link
Copy Markdown
Member Author

@jakelorocco Thanks for review. I addressed all your comments except "Could we parallelize the steps?". I'm working on it.

@akihikokuroda
Copy link
Copy Markdown
Member Author

akihikokuroda commented Apr 2, 2026

@jakelorocco There are 2 ideas improve the requirement.
For this:
OPTIMIZED_PIPELINE_DESIGN.md
I'm checking if citation intrinsic works in this usage.

This one does not parallelize the processing but it make a batch call for the citation support step.
COMBINED_SUPPORT_ASSESSMENT_DESIGN.md

@akihikokuroda
Copy link
Copy Markdown
Member Author

The "parallelize" seems some work/investigation necessary. So I improved "citation support" step to make only one LLM call instead of calling LLM for each span.

@psschwei
Copy link
Copy Markdown
Member

psschwei commented Apr 7, 2026

cc @yannisk2

Comment thread mellea/stdlib/requirements/rag.py
Comment thread test/stdlib/requirements/test_groundedness_requirement.py Outdated
Comment thread mellea/stdlib/requirements/rag.py Outdated
Comment thread mellea/stdlib/requirements/rag.py
@akihikokuroda akihikokuroda requested a review from a team as a code owner April 7, 2026 19:37
@akihikokuroda akihikokuroda requested a review from planetf1 April 7, 2026 19:37
@akihikokuroda
Copy link
Copy Markdown
Member Author

@planetf1 Thanks for review. I addressed all your comments.

@akihikokuroda akihikokuroda self-assigned this Apr 8, 2026
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for addressing comments. I'll leave this as a comment as there are some items outstanding from @jakelorocco which I think may need addressing

@akihikokuroda
Copy link
Copy Markdown
Member Author

@psschwei this is PR.

Copy link
Copy Markdown
Contributor

@yannisk2 yannisk2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akihikokuroda Thank you for putting this together! Please see below for a few additional comments on improvements/changes.

if self.documents is not None:
documents = self.documents
else:
documents = last_message._docs or []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where the documents are not directly provided to the requirement checker and are read off the context instead, do we have a standard design pattern for RAG showing where the documents should be attached to? I see at least three options:

I think that it would help to standardize the way that documents are passed in RAG scenarios and then align the code in this PR to read them from the corresponding location, so that the users that employ the requirement checker as part of an RAG IVR pattern would not have to pass the documents twice (once for response generation and a second time for the requirement check).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that standardizing document passing in RAG scenarios would improve the developer experience and avoid requiring documents to be passed twice (once for response generation and once for requirement validation).
The current design supports documents in the constructor or attached to the assistant message, which works but isn't aligned with the grounding_context pattern used in existing RAG examples like simple_rag_with_filter.py.

I'd like to defer this design alignment to a follow-on PR. That work should:

  1. Standardize on the grounding_context pattern for document passing
  2. Refactor ChatContext (or add grounding context support) to make documents available throughout the pipeline
  3. Update GroundednessRequirement to read documents from the context instead of requiring explicit constructor/message passing
  4. Update examples and documentation to show the unified pattern

This deserves dedicated attention, discussion with the core team and testing to ensure it works well across all RAG use cases. I'll open a tracking issue to capture this improvement.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this will require a broader discussion and coordination with the core team. If we can capture this as a separate issue, I am completely fine with addressing it in a separate PR.

Comment thread mellea/stdlib/requirements/rag.py Outdated
try:
# Step 1: Citation Generation
# Call intrinsic directly for explicit control over model options
from ..components.intrinsic._util import call_intrinsic
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the import statement be moved to the top of the file?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comment added to explain.

The lazy import is actually necessary due to a circular dependency:

  • mellea.stdlib.requirements.rag imports from mellea.stdlib.components
  • Which transitively imports from mellea.backends
  • If the import to call_intrinsic is at module level, it triggers the backends module before initialization completes

Comment thread mellea/stdlib/requirements/rag.py Outdated
citation_context = context_before_response.add(
Message("assistant", response, documents=list(documents))
)
citations: list[dict] = call_intrinsic(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the find_citations function instead of the call_intrinsic function (which is internally called by the former) to avoid replicating here the code of the find_citations function. This should also make the code cleaner, as if we were to call the find_citations function, we would not need to add back to the context the last assistant message that we have just separated from the original context.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make changes. Thanks!

Comment thread mellea/stdlib/requirements/rag.py Outdated
covered_ranges.sort()
merged_ranges: list[tuple[int, int]] = []
for begin, end in covered_ranges:
if merged_ranges and begin <= merged_ranges[-1][1]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that consecutive non-overlapping spans are not merged (may have to replace <= with < above).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix it. Thanks!

Comment thread mellea/stdlib/requirements/rag.py Outdated
current_span_start = 0
current_is_covered = is_covered(0) if response else False

for i in range(1, len(response) + 1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifying the spans could be done more efficiently by iterating over the merged_ranges instead of iterating over every single character in the response.

Comment thread mellea/stdlib/requirements/rag.py Outdated
result, _ = await backend.generate_from_context(
action,
context,
model_options={"temperature": 0.0, "max_new_tokens": 500},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a higher default of max_new_tokens (or make it configurable)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it configurable. Thanks!

Comment thread mellea/stdlib/requirements/rag.py Outdated
result, _ = await backend.generate_from_context(
action,
context,
model_options={"temperature": 0.0, "max_new_tokens": 500},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a higher default of max_new_tokens (or make it configurable)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it configurable. Thanks!

@@ -0,0 +1,514 @@
"""Tests for GroundednessRequirement."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the current tests, can we also add a few tests that check end-to-end the correctness of the requirement checker? Test cases could include simple examples of grounded, ungrounded, or partially grounded responses, responses that does not need citations (e.g., I-do-not-know), etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that end-to-end correctness tests validating real grounded/ungrounded/partially-grounded responses would strengthen the test suite.

I'd like to defer adding those tests to a follow-on PR. The reason is that comprehensive correctness tests require careful data engineering to craft responses that are genuinely grounded vs. ungrounded by the provided documents, which deserves focused attention.

I'll open a follow-on issue/PR to track adding:

  • Tests for fully grounded responses (should pass)
  • Tests for ungrounded responses (should fail)
  • Tests for partially grounded responses
  • Tests for responses that don't need citations (I-don't-know, disclaimers, etc.)

These can be marked with `@pytest.mark.slow` since they'll require GPU inference and real backend validation.

)
return prompt

def _build_batch_support_prompt(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prompt should also include the documents, as in order for an LLM to decide if a citation supports a response span, it may need to reason about the citation in the context of the document in which the citation appears. For instance, consider the following example:

  • Document: "IBM ... Its headquarters are in Armonk, NY"
  • Response sentence: "IBM is headquartered in Armonk, NY"
  • Citation for response sentence: "Its headquarters are in Armonk, NY"

In this example, for an LLM to verify that the citation supports the response sentence, it has to be aware of the document where the citation appears, so that it can verify that the word "its" in the citation indeed refers to IBM.

Based on the above, the support prompt has to also include the documents.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it. Thanks!

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>
@akihikokuroda akihikokuroda requested a review from yannisk2 April 30, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IVR Groundedness Validator Using Citations

5 participants