Skip to content

refactor(tests): framework-agnostic test infrastructure#1790

Open
Pouyanpi wants to merge 9 commits intodevelopfrom
refactor/langchain-decouple/stack-8-test-infrastructure
Open

refactor(tests): framework-agnostic test infrastructure#1790
Pouyanpi wants to merge 9 commits intodevelopfrom
refactor/langchain-decouple/stack-8-test-infrastructure

Conversation

@Pouyanpi
Copy link
Copy Markdown
Collaborator

@Pouyanpi Pouyanpi commented Apr 14, 2026

Summary

  • Add FakeLLMModel in tests/utils.py as a framework-agnostic test double implementing LLMModel protocol (replaces LangChain FakeLLM in core tests)
  • Move approx 20 LangChain specific test files to tests/integrations/langchain/
  • Add conftest.py auto-skip for LangChain tests when langchain is not installed
  • Rewrite core test configs (with_custom_llm, with_custom_chat_model) to use LLMModel protocol, keep LangChain copies under tests/integrations/langchain/test_configs/
  • Fix FakeLLM import paths in runnable_rails integration tests

Why the large diffs

Tests that used LangChain types directly are split into two versions:

  • Core (tests/test_*.py): rewritten to use FakeLLMModel and canonical
    types (LLMResponse, ToolCall). No LangChain dependency.
  • Integration (tests/integrations/langchain/test_*.py): copy of the
    original LangChain version, preserved for LangChain-specific coverage.

This means some files show as full rewrites (core) and their counterparts show as new files (integration copy). The test logic is the same, only the types and test doubles changed.

Review guidance

No new tests were added in this PR. The work is purely structural: moving, copying, and rewriting existing tests to use framework-agnostic types. The test logic and assertions are unchanged.

When reviewing:

  • New files under tests/integrations/langchain/: these are copies of
    existing tests, preserved as-is. A quick scan to confirm they match the
    originals is sufficient.
  • Rewritten core files (tests/test_*.py): focus on the type changes
    (AIMessage -> LLMResponse, FakeLLM -> FakeLLMModel, LangChain
    tool call dicts -> ToolCall/ToolCallFunction). The test structure
    and assertions should be equivalent.
  • tests/utils.py: the FakeLLMModel implementation is the key new
    code. Verify it correctly implements the LLMModel protocol (
    generate_async, stream_async, model_name, provider_name,
    provider_url).

Test plan

  • poetry run pytest tests/ -x --ignore=tests/integrations/langchain (core tests pass without LangChain)
  • poetry run pytest tests/integrations/langchain/ -x (LangChain tests pass)
  • poetry run pytest tests/ -x (full suite)

Summary by CodeRabbit

  • Tests
    • Added automatic skipping of integration tests when optional dependencies are unavailable.
    • Expanded test coverage for tool calling, reasoning trace extraction, and output rail validation in LangChain scenarios.
    • Refactored test infrastructure for improved maintainability and consistency.

@Pouyanpi Pouyanpi modified the milestones: v0.23.0, v0.22.0 Apr 14, 2026
@Pouyanpi Pouyanpi self-assigned this Apr 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR decouples the test infrastructure from LangChain by introducing FakeLLMModel — a framework-agnostic test double implementing the LLMModel protocol — and reorganising ~20 LangChain-specific test files under tests/integrations/langchain/. Core tests are rewritten to use LLMResponse/ToolCall protocol types; LangChain copies are preserved as integration tests with their original LangChain types intact.

Confidence Score: 5/5

Safe to merge; only one P2 style concern remains (unused streaming flag on FakeLLMModel).

All prior P0/P1 concerns from previous threads have been addressed (copy-mutation fixed with copy.copy, collect_ignore_glob extended to cover both flat and nested patterns). The sole remaining finding is a no-op streaming attribute — dead code that does not affect correctness or test coverage.

tests/utils.py — streaming stored but never consulted in generate_async/stream_async.

Important Files Changed

Filename Overview
tests/utils.py New FakeLLMModel cleanly implements the LLMModel protocol; streaming flag is stored but unused in both async methods — minor dead-code concern.
tests/conftest.py Adds collect_ignore_glob to skip LangChain tests when langchain_core is absent; prior-thread concerns were partially addressed by adding both *.py and **/*.py patterns.
tests/integrations/langchain/utils.py Relocated FakeLLM (LangChain) and get_bound_llm_magic_mock from tests/utils.py — now correctly isolated to the LangChain integration layer.
tests/test_llmrails.py All FakeLLMFakeLLMModel swaps; assertion updated to assert llm_rails.llm is injected_llm reflecting no-adapter wrapping for protocol-compliant models.
tests/test_output_rails_tool_calls.py Rewritten to use FakeLLMModel with LLMResponse/ToolCall protocol types instead of LangChain AIMessage; logic unchanged.
tests/integrations/langchain/runnable_rails/test_streaming.py New comprehensive streaming test suite for LangChain RunnableRails; correctly imports FakeLLM from the integration utils.py.

Sequence Diagram

sequenceDiagram
    participant CoreTest as Core Test (tests/test_*.py)
    participant IntegTest as Integration Test (tests/integrations/langchain/)
    participant FakeLLMModel as FakeLLMModel (tests/utils.py)
    participant FakeLLM as FakeLLM (tests/integrations/langchain/utils.py)
    participant LLMRails as LLMRails
    participant LLMModel as LLMModel Protocol

    CoreTest->>FakeLLMModel: FakeLLMModel(responses=[...])
    FakeLLMModel-->>CoreTest: instance (implements LLMModel)
    CoreTest->>LLMRails: LLMRails(config, llm=fake_llm)
    LLMRails->>LLMModel: generate_async(prompt) / stream_async(prompt)
    LLMModel-->>LLMRails: LLMResponse / AsyncIterator[LLMResponseChunk]
    LLMRails-->>CoreTest: result (no LangChain dependency)

    IntegTest->>FakeLLM: FakeLLM(responses=[...])
    FakeLLM-->>IntegTest: instance (LangChain BaseLLM)
    IntegTest->>LLMRails: RunnableRails(config, llm=fake_llm)
    LLMRails->>FakeLLM: ainvoke / _astream (LangChain API)
    FakeLLM-->>LLMRails: AIMessage / GenerationChunk
    LLMRails-->>IntegTest: result
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/utils.py
Line: 45-61

Comment:
**`streaming` flag is stored but never consulted**

`FakeLLMModel.__init__` accepts and stores `self.streaming`, but neither `generate_async` nor `stream_async` reads it. In the original `FakeLLM`, `streaming=True` activated LangChain's `_astream` code path; in the new design the framework is responsible for choosing which method to call, so the flag has no effect. Any test that passes `streaming=True` expecting a behavioural difference (e.g., chunked output instead of a full string) will silently exercise the same path as `streaming=False`.

If the attribute is intentionally a no-op (because the framework decides independently), a doc-comment to that effect would prevent confusion for future contributors.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (7): Last reviewed commit: "apply coderabbit review suggestion" | Re-trigger Greptile

Comment thread tests/utils.py
Comment thread tests/conftest.py Outdated
Comment thread tests/test_output_rails_tool_calls.py
@Pouyanpi Pouyanpi force-pushed the refactor/langchain-decouple/stack-8-test-infrastructure branch 3 times, most recently from 9c17d45 to a77a7f6 Compare April 14, 2026 13:49
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Pouyanpi Pouyanpi force-pushed the feat/langchain-decouple/stack-11-streaming-parity branch from a89e626 to fbf78bb Compare April 16, 2026 09:00
Base automatically changed from feat/langchain-decouple/stack-11-streaming-parity to develop April 16, 2026 09:09
…o-skip

Add FakeLLMModel implementing LLMModel protocol to tests/utils.py.
TestChat uses FakeLLMModel instead of LangChain FakeLLM. Add pytest
conftest to auto-skip tests in tests/integrations/langchain/ when
LangChain is not installed.
…langchain/

Pure file moves with git mv, no content changes. GitHub should
detect these as renames.
…ixes

- Create tests/integrations/langchain/utils.py with LangChain FakeLLM
- Add core-friendly rewrites using FakeLLMModel for tool call,
  output rail, reasoning, and passthrough tests
- Migrate FakeLLM→FakeLLMModel in all core test files
- Remove LangChainLLMAdapter wrapping from test_llama_guard,
  test_patronus_lynx
- Update integration test imports to use FakeLLM from new location
- Fix Windows timer resolution in test_logging
@Pouyanpi Pouyanpi force-pushed the refactor/langchain-decouple/stack-8-test-infrastructure branch from 3739452 to e19d879 Compare April 16, 2026 09:13
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

This PR refactors test infrastructure to decouple from LangChain-specific test doubles by introducing a framework-agnostic FakeLLMModel. It adds pytest hooks to conditionally skip LangChain integration tests when the dependency is unavailable, creates a dedicated LangChain integration test utilities module, and introduces comprehensive new integration tests covering tool calling, reasoning trace extraction, and output rail validation.

Changes

Cohort / File(s) Summary
Test Configuration & Pytest Hooks
tests/conftest.py
Added pytest_collection_modifyitems hook to conditionally skip LangChain integration tests when langchain_core is not installed.
LangChain Integration Test Package Structure
tests/integrations/langchain/**/__init__.py, tests/integrations/langchain/utils.py
Created directory structure with four new __init__.py files and a new utils.py module exporting FakeLLM (LangChain-specific) and get_bound_llm_magic_mock() for integration tests.
LangChain Runnable Rails Integration Tests
tests/integrations/langchain/runnable_rails/__init__.py, tests/integrations/langchain/runnable_rails/test_*.py
Added package structure and updated nine runnable rails test files to import FakeLLM from tests.integrations.langchain.utils instead of tests.utils.
New LangChain Integration Tests
tests/integrations/langchain/test_output_rails_tool_calls.py, test_reasoning_trace_extraction.py, test_tool_calling_passthrough_only.py, test_tool_calls_event_extraction.py, test_tool_output_rails.py
Added five comprehensive integration test modules (1,563 lines total) covering LangChain tool-call handling, reasoning trace extraction and storage, tool-call passthrough gating, event-driven tool call extraction, and tool output rail validation with system actions and complex scenarios.
Core Test Utilities Refactoring
tests/utils.py
Removed LangChain-specific FakeLLM class and added framework-agnostic FakeLLMModel class supporting both string and strongly-typed LLMResponse backends; updated TestChat.__init__ to accept optional llm parameter; removed get_bound_llm_magic_mock() helper.
Existing Tests Migration to FakeLLMModel
tests/test_content_safety_*.py, test_context_updates.py, test_execute_action.py, test_guardrails_ai_e2e_v1.py, test_integration_cache.py, test_jailbreak_cache.py, test_llama_guard.py, test_llmrails.py, test_output_rails_tool_calls.py, test_patronus_lynx.py, test_prompt_generation.py, test_reasoning_trace_extraction.py, test_system_message_conversion.py, test_tool_calling_passthrough_only.py, test_tool_calls_event_extraction.py, test_tool_output_rails.py, tests/tracing/spans/test_span_v2_integration.py
Updated 31 test files to use FakeLLMModel instead of FakeLLM, removing LangChainLLMAdapter wrapping where applicable; adjusted assertions and tool-call representations to match new framework-agnostic API (dict-based results, typed LLMResponse/ToolCall objects); updated one test assertion from BaseLLM type to LLMModel type.

Sequence Diagram(s)

No sequence diagrams generated. While the changes introduce new integration tests with multiple components, the scope is primarily test infrastructure refactoring (import path changes, test double replacement) and new test modules that implement testing logic rather than new production features with significant control flow interactions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR involves substantial cross-cutting changes: refactoring the core test infrastructure with two competing LLM test doubles, updating 31+ test files with systematic (but mixed-pattern) import and instantiation changes, and introducing 5 new integration test modules with moderately complex test logic. While many individual changes are repetitive imports/instantiations (lower effort), the heterogeneity of change patterns, the number of affected files, and the need to verify test isolation and correctness across the new integration test suite warrant careful review.

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR lacks documented test results, coverage metrics, and CI/CD verification for major test infrastructure refactoring affecting 35+ files. Include test execution summary with pass/fail counts, coverage changes, and link to CI/CD pipeline results confirming all tests pass.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: refactoring test infrastructure to be framework-agnostic, which is the primary objective of moving core tests away from LangChain and introducing FakeLLMModel.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/langchain-decouple/stack-8-test-infrastructure

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
tests/test_content_safety_integration.py (1)

353-353: Use the public action-param registration API instead of mutating runtime internals.

Assigning into runtime.registered_action_params makes this test depend on internal storage details. Registering the fake through register_action_param keeps the setup aligned with the supported API and is less brittle.

♻️ Suggested cleanup
-        chat.app.runtime.registered_action_params["llms"] = {"content_safety_reasoning": content_safety_llm}
+        chat.app.register_action_param(
+            "llms",
+            {"content_safety_reasoning": content_safety_llm},
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_content_safety_integration.py` at line 353, The test directly
mutates internal storage via chat.app.runtime.registered_action_params["llms"] =
{"content_safety_reasoning": content_safety_llm}; instead use the public API by
calling chat.app.runtime.register_action_param (or the app-level helper if
available) to register the fake LLM param under the "llms" namespace with key
"content_safety_reasoning" and value content_safety_llm so the test relies on
the supported registration mechanism rather than internal dict structure.
tests/test_llmrails.py (1)

769-775: Assert the full LLMModel contract here.

This regression test now checks only generate_async, so a partial fake could still pass while missing other members the core pipeline relies on. Since this file is validating the framework-agnostic migration, make the assertion match the protocol more directly.

♻️ Suggested assertion tightening
     from nemoguardrails.types import LLMModel

     `@action`(name="test_llm_action")
     async def test_llm_action(llm: LLMModel):
         assert llm is not None
-        assert hasattr(llm, "generate_async")
+        assert isinstance(llm, LLMModel)
         return "llm_action_success"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_llmrails.py` around lines 769 - 775, The test currently only
checks for generate_async on the LLMModel and can be bypassed by a partial fake;
update test_llm_action to assert the full LLMModel contract by iterating over
the members declared on the LLMModel type and asserting the provided llm has
each member (e.g. use LLMModel.__annotations__ or dir(LLMModel) to enumerate
expected attribute names) and for those that should be callables assert
callable(getattr(llm, name)); keep references to LLMModel and test_llm_action so
the check explicitly validates the protocol surface rather than a single method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/conftest.py`:
- Around line 38-50: pytest_collection_modifyitems runs too late to prevent
top-level imports from failing; add a collection-time gate (e.g., implement
pytest_ignore_collect or collect_ignore_glob) that checks for langchain
availability and skips matching modules before they are imported. Specifically,
implement a pytest_ignore_collect(path) function that tries to import
langchain_core (or uses pytest.importorskip) and returns True to ignore
collection for any path whose parts include "integrations" and "langchain" or
the helper config files under tests/test_configs/with_custom_llm,
tests/test_configs/with_custom_chat_model, and
tests/test_configs/with_custom_llm_prompt_action_v2_x when langchain_core is
missing; keep pytest_collection_modifyitems only for post-collection markers if
desired.

In `@tests/integrations/langchain/test_reasoning_trace_extraction.py`:
- Around line 289-307: The test exposes that llm_call currently returns raw
"<think>...</think>" markup when additional_kwargs["reasoning_content"] is
present; change llm_call (used with LangChainLLMAdapter and AIMessage) so it
still prefers storing additional_kwargs["reasoning_content"] into
reasoning_trace_var but always strips any <think>...</think> segments from the
returned AIMessage.content before returning it; locate the logic in llm_call
that determines stored_trace vs output content and ensure stored_trace selection
(from additional_kwargs or embedded tags) is kept, then apply a sanitize step to
the message content (removing <think>...</think> blocks) so user-visible content
never contains raw think tags.

In `@tests/integrations/langchain/test_tool_calls_event_extraction.py`:
- Around line 119-138: The test currently only asserts call_count >= 1 so
multiple reads would still pass; modify the assertion to enforce the single-pop
contract by checking call_count == 1 for the mocked
get_and_clear_tool_calls_contextvar (mock_get_and_clear), keeping the rest of
the expectations (result from TestChat.generate_async and result["tool_calls"]
assertions) unchanged so the test verifies exactly one read/clear occurred.
- Around line 167-189: The test mutates the shared ContextVar tool_calls_var and
never restores it; change test_tool_rails_cannot_clear_context_variable to
capture the token returned by tool_calls_var.set(...) and ensure
tool_calls_var.reset(token) is called in a finally block (or teardown) so the
ContextVar is restored even on assertion failures; reference the tool_calls_var
and validate_tool_parameters symbols when making this change.
- Around line 27-49: The helper functions validate_tool_parameters and
self_check_tool_calls currently only handle a nested function.arguments schema
and miss LangChain's top-level tool-call shape (with top-level "name" and
"args"), so dangerous strings never get inspected; update both functions to
normalize the incoming call shape first by accepting either call.get("function",
{}).get("arguments", {}) or call.get("args") (and likewise accept
call.get("function", {}).get("name") or call.get("name")), then iterate
parameter values from that normalized args mapping when checking dangerous
patterns in validate_tool_parameters, and update self_check_tool_calls to
consider a call valid if it has either a "function" dict with "id"/"function" or
the top-level "name" and "args" fields (and still require an "id" if expected),
ensuring both schemas are supported before performing the existing validations.

In `@tests/integrations/langchain/test_tool_output_rails.py`:
- Around line 27-49: The validators assume an OpenAI nested format but tests use
LangChain-style tool_calls; update validate_tool_parameters and
self_check_tool_calls to accept both formats by normalizing tool_calls entries:
for each call in validate_tool_parameters extract parameters from either
call.get("function", {}).get("arguments", {}) or call.get("args", {}) (falling
back appropriately) and continue checking strings against dangerous_patterns,
and in self_check_tool_calls return True when each call is a dict that contains
either the OpenAI keys ("function" and "id") or the LangChain keys ("name",
"args", and "id"); reference the existing functions validate_tool_parameters,
self_check_tool_calls, the dangerous_patterns list, and the tool_calls variable
when making the changes.

In `@tests/integrations/langchain/utils.py`:
- Around line 104-118: get_bound_llm_magic_mock is incorrectly wrapping dict
responses with MagicMock(**...), which makes dicts behave like objects; change
both places handling the dict branch so they return a real dict instead (e.g.,
use dict(ainvoke_return_value) or ainvoke_return_value directly) for
bound_llm_mock.ainvoke.return_value and for mock_llm.ainvoke = AsyncMock(...),
while keeping the existing AsyncMock/AIMessage branch behavior; update
references to bound_llm_mock, mock_llm, and ainvoke_return_value accordingly.

In `@tests/test_tool_calls_event_extraction.py`:
- Around line 321-335: The test sets provider_metadata on the fake LLM response
but never asserts it, so add an assertion to verify metadata propagation: after
calling LLMRails.generate_async (using FakeLLMModel, LLMResponse, LLMRails)
assert that the returned result includes the provider metadata (e.g.,
result["tool_calls"][0] contains the same provider_metadata dict or a public
metadata field) to ensure metadata isn't dropped; update or add an assertion
comparing the expected {"model": "test-model", "usage": {"tokens": 50}} to the
value returned by result so the test actually verifies metadata propagation.

In `@tests/utils.py`:
- Around line 158-172: The constructor logic in TestChat silently sets self.llm
= None when both llm and llm_completions are missing, causing tests to hit the
real LLM; change the else branch to either raise a clear error or instantiate a
default stubbed FakeLLMModel instead. Specifically, in the TestChat
initialization where llm and llm_completions are checked, replace "else:
self.llm = None" with one of: (a) raise ValueError("No llm or llm_completions
provided to TestChat; tests must supply a fake"), or (b) self.llm =
FakeLLMModel(responses=[], streaming=False, exception=None, token_usage=None,
should_return_token_usage=False) so tests fail-fast or use an empty fake
respectively; refer to the TestChat constructor and the FakeLLMModel symbol when
applying the change.

---

Nitpick comments:
In `@tests/test_content_safety_integration.py`:
- Line 353: The test directly mutates internal storage via
chat.app.runtime.registered_action_params["llms"] = {"content_safety_reasoning":
content_safety_llm}; instead use the public API by calling
chat.app.runtime.register_action_param (or the app-level helper if available) to
register the fake LLM param under the "llms" namespace with key
"content_safety_reasoning" and value content_safety_llm so the test relies on
the supported registration mechanism rather than internal dict structure.

In `@tests/test_llmrails.py`:
- Around line 769-775: The test currently only checks for generate_async on the
LLMModel and can be bypassed by a partial fake; update test_llm_action to assert
the full LLMModel contract by iterating over the members declared on the
LLMModel type and asserting the provided llm has each member (e.g. use
LLMModel.__annotations__ or dir(LLMModel) to enumerate expected attribute names)
and for those that should be callables assert callable(getattr(llm, name)); keep
references to LLMModel and test_llm_action so the check explicitly validates the
protocol surface rather than a single method.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d2e27196-c546-46d7-8416-3b256fa8d9af

📥 Commits

Reviewing files that changed from the base of the PR and between f2d4492 and e19d879.

📒 Files selected for processing (60)
  • tests/conftest.py
  • tests/integrations/langchain/llm/__init__.py
  • tests/integrations/langchain/llm/models/__init__.py
  • tests/integrations/langchain/llm/models/test_langchain_init_scenarios.py
  • tests/integrations/langchain/llm/models/test_langchain_initialization_methods.py
  • tests/integrations/langchain/llm/models/test_langchain_initializer.py
  • tests/integrations/langchain/llm/models/test_langchain_special_cases.py
  • tests/integrations/langchain/llm/providers/__init__.py
  • tests/integrations/langchain/llm/providers/test_deprecated_providers.py
  • tests/integrations/langchain/llm/providers/test_providers.py
  • tests/integrations/langchain/llm/providers/test_trtllm_provider.py
  • tests/integrations/langchain/llm/test_langchain_integration.py
  • tests/integrations/langchain/llm/test_version_compatibility.py
  • tests/integrations/langchain/runnable_rails/__init__.py
  • tests/integrations/langchain/runnable_rails/test_basic_operations.py
  • tests/integrations/langchain/runnable_rails/test_batch_as_completed.py
  • tests/integrations/langchain/runnable_rails/test_batching.py
  • tests/integrations/langchain/runnable_rails/test_composition.py
  • tests/integrations/langchain/runnable_rails/test_format_output.py
  • tests/integrations/langchain/runnable_rails/test_history.py
  • tests/integrations/langchain/runnable_rails/test_message_utils.py
  • tests/integrations/langchain/runnable_rails/test_metadata.py
  • tests/integrations/langchain/runnable_rails/test_piping.py
  • tests/integrations/langchain/runnable_rails/test_runnable_rails.py
  • tests/integrations/langchain/runnable_rails/test_streaming.py
  • tests/integrations/langchain/runnable_rails/test_tool_calling.py
  • tests/integrations/langchain/runnable_rails/test_transform_input.py
  • tests/integrations/langchain/runnable_rails/test_types.py
  • tests/integrations/langchain/test_actions_llm_utils.py
  • tests/integrations/langchain/test_langchain_llm_adapter.py
  • tests/integrations/langchain/test_output_rails_tool_calls.py
  • tests/integrations/langchain/test_reasoning_trace_extraction.py
  • tests/integrations/langchain/test_server_streaming.py
  • tests/integrations/langchain/test_streaming.py
  • tests/integrations/langchain/test_tool_calling_passthrough_only.py
  • tests/integrations/langchain/test_tool_calling_utils.py
  • tests/integrations/langchain/test_tool_calls_event_extraction.py
  • tests/integrations/langchain/test_tool_output_rails.py
  • tests/integrations/langchain/utils.py
  • tests/test_content_safety_actions.py
  • tests/test_content_safety_cache.py
  • tests/test_content_safety_integration.py
  • tests/test_context_updates.py
  • tests/test_execute_action.py
  • tests/test_guardrails_ai_e2e_v1.py
  • tests/test_integration_cache.py
  • tests/test_jailbreak_cache.py
  • tests/test_llama_guard.py
  • tests/test_llmrails.py
  • tests/test_output_rails_tool_calls.py
  • tests/test_patronus_lynx.py
  • tests/test_prompt_generation.py
  • tests/test_reasoning_trace_extraction.py
  • tests/test_system_message_conversion.py
  • tests/test_tool_calling_passthrough_only.py
  • tests/test_tool_calls_event_extraction.py
  • tests/test_tool_output_rails.py
  • tests/test_topic_safety_cache.py
  • tests/tracing/spans/test_span_v2_integration.py
  • tests/utils.py

Comment thread tests/conftest.py Outdated
Comment on lines +289 to +307
async def test_llm_call_prefers_additional_kwargs_over_think_tags(self):
reasoning_from_kwargs = "This should be used"
reasoning_from_tags = "This should be ignored"

mock_llm = AsyncMock()
mock_response = AIMessage(
content=f"<think>{reasoning_from_tags}</think>Response",
additional_kwargs={"reasoning_content": reasoning_from_kwargs},
)
mock_llm.ainvoke = AsyncMock(return_value=mock_response)

from nemoguardrails.actions.llm.utils import llm_call

reasoning_trace_var.set(None)
result = await llm_call(LangChainLLMAdapter(mock_llm), "Query")

assert result.content == f"<think>{reasoning_from_tags}</think>Response"
stored_trace = reasoning_trace_var.get()
assert stored_trace == reasoning_from_kwargs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't lock in <think> tags as user-visible output.

This assertion makes the mixed case preserve raw <think>...</think> content whenever additional_kwargs["reasoning_content"] is present. That hardens the exact leakage path the neighboring tests sanitize. The precedence rule should only decide which trace gets stored; the rendered content should still be stripped.

Possible fix
-        assert result.content == f"<think>{reasoning_from_tags}</think>Response"
+        assert result.content == "Response"
+        assert "<think>" not in result.content
         stored_trace = reasoning_trace_var.get()
         assert stored_trace == reasoning_from_kwargs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_llm_call_prefers_additional_kwargs_over_think_tags(self):
reasoning_from_kwargs = "This should be used"
reasoning_from_tags = "This should be ignored"
mock_llm = AsyncMock()
mock_response = AIMessage(
content=f"<think>{reasoning_from_tags}</think>Response",
additional_kwargs={"reasoning_content": reasoning_from_kwargs},
)
mock_llm.ainvoke = AsyncMock(return_value=mock_response)
from nemoguardrails.actions.llm.utils import llm_call
reasoning_trace_var.set(None)
result = await llm_call(LangChainLLMAdapter(mock_llm), "Query")
assert result.content == f"<think>{reasoning_from_tags}</think>Response"
stored_trace = reasoning_trace_var.get()
assert stored_trace == reasoning_from_kwargs
async def test_llm_call_prefers_additional_kwargs_over_think_tags(self):
reasoning_from_kwargs = "This should be used"
reasoning_from_tags = "This should be ignored"
mock_llm = AsyncMock()
mock_response = AIMessage(
content=f"<think>{reasoning_from_tags}</think>Response",
additional_kwargs={"reasoning_content": reasoning_from_kwargs},
)
mock_llm.ainvoke = AsyncMock(return_value=mock_response)
from nemoguardrails.actions.llm.utils import llm_call
reasoning_trace_var.set(None)
result = await llm_call(LangChainLLMAdapter(mock_llm), "Query")
assert result.content == "Response"
assert "<think>" not in result.content
stored_trace = reasoning_trace_var.get()
assert stored_trace == reasoning_from_kwargs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/test_reasoning_trace_extraction.py` around lines
289 - 307, The test exposes that llm_call currently returns raw
"<think>...</think>" markup when additional_kwargs["reasoning_content"] is
present; change llm_call (used with LangChainLLMAdapter and AIMessage) so it
still prefers storing additional_kwargs["reasoning_content"] into
reasoning_trace_var but always strips any <think>...</think> segments from the
returned AIMessage.content before returning it; locate the logic in llm_call
that determines stored_trace vs output content and ensure stored_trace selection
(from additional_kwargs or embedded tags) is kept, then apply a sanitize step to
the message content (removing <think>...</think> blocks) so user-visible content
never contains raw think tags.

Comment on lines +27 to +49
@action(is_system_action=True)
async def validate_tool_parameters(tool_calls, context=None, **kwargs):
"""Test implementation of tool parameter validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])

dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]

for tool_call in tool_calls:
func = tool_call.get("function", {})
args = func.get("arguments", {})
for param_value in args.values():
if isinstance(param_value, str):
if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
return False
return True


@action(is_system_action=True)
async def self_check_tool_calls(tool_calls, context=None, **kwargs):
"""Test implementation of tool call validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])

return all(isinstance(call, dict) and "function" in call and "id" in call for call in tool_calls)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In langchain-core 0.2.14, what is the schema of each item in AIMessage.tool_calls? Does it use top-level "name"/"args"/"id"/"type", or a nested "function": {"name","arguments"} object?

💡 Result:

In langchain-core 0.2.14, each item in AIMessage.tool_calls is a ToolCall (TypedDict) with top-level keys: "name" (str), "args" (dict[str, Any]), "id" (str | None), and optionally "type" (Literal["tool_call"]). It does not use a nested "function": {"name", "arguments"} object. This schema was introduced in the PR adding tool calls to AIMessage (merged April 2024), and issues from late 2024 (e.g., #27938 with langchain-core==0.2.33) confirm LangChain's ToolCall uses top-level "name"/"args"/"id", distinguishing it from OpenAI's nested format. Source code examples across versions consistently show this structure.

Citations:


🏁 Script executed:

# First, let's find the test file and read it
find . -name "test_tool_calls_event_extraction.py" -type f

Repository: NVIDIA-NeMo/Guardrails

Length of output: 176


🏁 Script executed:

# Let's read the test file that's referenced in the review
wc -l ./tests/integrations/langchain/test_tool_calls_event_extraction.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 136


🏁 Script executed:

# Read the helper functions at lines 27-49
sed -n '27,49p' ./tests/integrations/langchain/test_tool_calls_event_extraction.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 1092


🏁 Script executed:

# Check the test at line 54
sed -n '50,80p' ./tests/integrations/langchain/test_tool_calls_event_extraction.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 891


🏁 Script executed:

# Check the test at line 196
sed -n '190,220p' ./tests/integrations/langchain/test_tool_calls_event_extraction.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 914


🏁 Script executed:

# Check the test at line 384
sed -n '378,410p' ./tests/integrations/langchain/test_tool_calls_event_extraction.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 929


Fix helper functions to handle LangChain's top-level tool-call schema.

The helpers at lines 35–36 and 49 read a nested function.arguments structure, but the tests at lines 54, 196, and 384 use LangChain's actual schema with top-level name/args fields. Since the helpers look for the wrong schema, the dangerous argument strings are never inspected—these blocking-path tests don't actually exercise the security checks they claim to validate.

The fix requires normalizing both schemas:

Proposed fix
 `@action`(is_system_action=True)
 async def validate_tool_parameters(tool_calls, context=None, **kwargs):
     """Test implementation of tool parameter validation."""
     tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
 
     dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]
 
     for tool_call in tool_calls:
-        func = tool_call.get("function", {})
-        args = func.get("arguments", {})
+        if "function" in tool_call:
+            args = tool_call.get("function", {}).get("arguments", {}) or {}
+        else:
+            args = tool_call.get("args", {}) or {}
         for param_value in args.values():
             if isinstance(param_value, str):
                 if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
                     return False
     return True
@@
 `@action`(is_system_action=True)
 async def self_check_tool_calls(tool_calls, context=None, **kwargs):
     """Test implementation of tool call validation."""
     tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
 
-    return all(isinstance(call, dict) and "function" in call and "id" in call for call in tool_calls)
+    return all(
+        isinstance(call, dict)
+        and "id" in call
+        and ("function" in call or ("name" in call and "args" in call))
+        for call in tool_calls
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@action(is_system_action=True)
async def validate_tool_parameters(tool_calls, context=None, **kwargs):
"""Test implementation of tool parameter validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]
for tool_call in tool_calls:
func = tool_call.get("function", {})
args = func.get("arguments", {})
for param_value in args.values():
if isinstance(param_value, str):
if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
return False
return True
@action(is_system_action=True)
async def self_check_tool_calls(tool_calls, context=None, **kwargs):
"""Test implementation of tool call validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
return all(isinstance(call, dict) and "function" in call and "id" in call for call in tool_calls)
`@action`(is_system_action=True)
async def validate_tool_parameters(tool_calls, context=None, **kwargs):
"""Test implementation of tool parameter validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]
for tool_call in tool_calls:
if "function" in tool_call:
args = tool_call.get("function", {}).get("arguments", {}) or {}
else:
args = tool_call.get("args", {}) or {}
for param_value in args.values():
if isinstance(param_value, str):
if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
return False
return True
`@action`(is_system_action=True)
async def self_check_tool_calls(tool_calls, context=None, **kwargs):
"""Test implementation of tool call validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
return all(
isinstance(call, dict)
and "id" in call
and ("function" in call or ("name" in call and "args" in call))
for call in tool_calls
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/test_tool_calls_event_extraction.py` around
lines 27 - 49, The helper functions validate_tool_parameters and
self_check_tool_calls currently only handle a nested function.arguments schema
and miss LangChain's top-level tool-call shape (with top-level "name" and
"args"), so dangerous strings never get inspected; update both functions to
normalize the incoming call shape first by accepting either call.get("function",
{}).get("arguments", {}) or call.get("args") (and likewise accept
call.get("function", {}).get("name") or call.get("name")), then iterate
parameter values from that normalized args mapping when checking dangerous
patterns in validate_tool_parameters, and update self_check_tool_calls to
consider a call valid if it has either a "function" dict with "id"/"function" or
the top-level "name" and "args" fields (and still require an "id" if expected),
ensuring both schemas are supported before performing the existing validations.

Comment on lines +119 to +138
call_count = 0

def mock_get_and_clear():
nonlocal call_count
call_count += 1
if call_count == 1:
return test_tool_calls
return None

with patch(
"nemoguardrails.actions.llm.utils.get_and_clear_tool_calls_contextvar",
side_effect=mock_get_and_clear,
):
chat = TestChat(config, llm_completions=[""])

result = await chat.app.generate_async(messages=[{"role": "user", "content": "Test"}])

assert call_count >= 1, "get_and_clear_tool_calls_contextvar should be called"
assert result["tool_calls"] is not None
assert result["tool_calls"][0]["name"] == "test_tool"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the “once” contract explicitly.

Line 136 only checks that the hook was called at least once, so an implementation that reads and clears the context multiple times would still pass. If this test is meant to lock down single-pop behavior, it should assert call_count == 1.

Proposed fix
-        assert call_count >= 1, "get_and_clear_tool_calls_contextvar should be called"
+        assert call_count == 1, "get_and_clear_tool_calls_contextvar should only be called once"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/test_tool_calls_event_extraction.py` around
lines 119 - 138, The test currently only asserts call_count >= 1 so multiple
reads would still pass; modify the assertion to enforce the single-pop contract
by checking call_count == 1 for the mocked get_and_clear_tool_calls_contextvar
(mock_get_and_clear), keeping the rest of the expectations (result from
TestChat.generate_async and result["tool_calls"] assertions) unchanged so the
test verifies exactly one read/clear occurred.

Comment on lines +167 to +189
@pytest.mark.asyncio
async def test_tool_rails_cannot_clear_context_variable():
from nemoguardrails.context import tool_calls_var

test_tool_calls = [
{
"id": "call_blocked",
"type": "function",
"function": {
"name": "blocked_tool",
"arguments": {"param": "rm -rf /"},
},
}
]

tool_calls_var.set(test_tool_calls)

context = {"tool_calls": test_tool_calls}
result = await validate_tool_parameters(test_tool_calls, context=context)

assert result is False
assert tool_calls_var.get() is not None, "Context variable should not be cleared by tool rails"
assert tool_calls_var.get()[0]["function"]["name"] == "blocked_tool"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore tool_calls_var after the test.

Line 182 mutates shared ContextVar state and never resets it. That can leak stale tool calls into later tests that read tool_calls_var, making failures order-dependent.

Proposed fix
-    tool_calls_var.set(test_tool_calls)
-
-    context = {"tool_calls": test_tool_calls}
-    result = await validate_tool_parameters(test_tool_calls, context=context)
-
-    assert result is False
-    assert tool_calls_var.get() is not None, "Context variable should not be cleared by tool rails"
-    assert tool_calls_var.get()[0]["function"]["name"] == "blocked_tool"
+    token = tool_calls_var.set(test_tool_calls)
+    try:
+        context = {"tool_calls": test_tool_calls}
+        result = await validate_tool_parameters(test_tool_calls, context=context)
+
+        assert result is False
+        assert tool_calls_var.get() is not None, "Context variable should not be cleared by tool rails"
+        assert tool_calls_var.get()[0]["function"]["name"] == "blocked_tool"
+    finally:
+        tool_calls_var.reset(token)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.asyncio
async def test_tool_rails_cannot_clear_context_variable():
from nemoguardrails.context import tool_calls_var
test_tool_calls = [
{
"id": "call_blocked",
"type": "function",
"function": {
"name": "blocked_tool",
"arguments": {"param": "rm -rf /"},
},
}
]
tool_calls_var.set(test_tool_calls)
context = {"tool_calls": test_tool_calls}
result = await validate_tool_parameters(test_tool_calls, context=context)
assert result is False
assert tool_calls_var.get() is not None, "Context variable should not be cleared by tool rails"
assert tool_calls_var.get()[0]["function"]["name"] == "blocked_tool"
`@pytest.mark.asyncio`
async def test_tool_rails_cannot_clear_context_variable():
from nemoguardrails.context import tool_calls_var
test_tool_calls = [
{
"id": "call_blocked",
"type": "function",
"function": {
"name": "blocked_tool",
"arguments": {"param": "rm -rf /"},
},
}
]
token = tool_calls_var.set(test_tool_calls)
try:
context = {"tool_calls": test_tool_calls}
result = await validate_tool_parameters(test_tool_calls, context=context)
assert result is False
assert tool_calls_var.get() is not None, "Context variable should not be cleared by tool rails"
assert tool_calls_var.get()[0]["function"]["name"] == "blocked_tool"
finally:
tool_calls_var.reset(token)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/test_tool_calls_event_extraction.py` around
lines 167 - 189, The test mutates the shared ContextVar tool_calls_var and never
restores it; change test_tool_rails_cannot_clear_context_variable to capture the
token returned by tool_calls_var.set(...) and ensure tool_calls_var.reset(token)
is called in a finally block (or teardown) so the ContextVar is restored even on
assertion failures; reference the tool_calls_var and validate_tool_parameters
symbols when making this change.

Comment on lines +27 to +49
@action(is_system_action=True)
async def validate_tool_parameters(tool_calls, context=None, **kwargs):
"""Test implementation of tool parameter validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])

dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]

for tool_call in tool_calls:
func = tool_call.get("function", {})
args = func.get("arguments", {})
for param_value in args.values():
if isinstance(param_value, str):
if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
return False
return True


@action(is_system_action=True)
async def self_check_tool_calls(tool_calls, context=None, **kwargs):
"""Test implementation of tool call validation."""
tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])

return all(isinstance(call, dict) and "function" in call and "id" in call for call in tool_calls)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In langchain-core 0.2.14, what is the schema of each item in AIMessage.tool_calls? Does it use top-level "name"/"args"/"id"/"type", or a nested "function": {"name","arguments"} object?

💡 Result:

In langchain-core 0.2.14, each item in AIMessage.tool_calls is a ToolCall object (TypedDict) with top-level keys: name (str), args (dict[str, Any]), id (str | None), and optionally type (Literal["tool_call"]). It does not use a nested "function" object with "name" and "arguments".

Citations:


🏁 Script executed:

fd "test_tool_output_rails.py" -t f

Repository: NVIDIA-NeMo/Guardrails

Length of output: 152


🏁 Script executed:

cat -n tests/integrations/langchain/test_tool_output_rails.py

Repository: NVIDIA-NeMo/Guardrails

Length of output: 8777


Fix validators to handle LangChain-style tool_calls format.

The test fixtures at lines 56, 107, and 167 provide LangChain-style tool_calls with top-level "name", "args", "id", and "type" keys. However, validate_tool_parameters (line 35–36) and self_check_tool_calls (line 49) are hardcoded for OpenAI-style nested format with "function" and "arguments". This causes:

  • validate_tool_parameters to silently skip dangerous patterns (always returns True)
  • self_check_tool_calls to reject valid tool_calls (always returns False)

Tests currently assert the opposite of intended behavior.

Proposed fix
 `@action`(is_system_action=True)
 async def validate_tool_parameters(tool_calls, context=None, **kwargs):
     """Test implementation of tool parameter validation."""
     tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
 
     dangerous_patterns = ["eval", "exec", "system", "../", "rm -", "DROP", "DELETE"]
 
     for tool_call in tool_calls:
-        func = tool_call.get("function", {})
-        args = func.get("arguments", {})
+        if "function" in tool_call:
+            args = tool_call.get("function", {}).get("arguments", {}) or {}
+        else:
+            args = tool_call.get("args", {}) or {}
         for param_value in args.values():
             if isinstance(param_value, str):
                 if any(pattern.lower() in param_value.lower() for pattern in dangerous_patterns):
                     return False
     return True
@@
 `@action`(is_system_action=True)
 async def self_check_tool_calls(tool_calls, context=None, **kwargs):
     """Test implementation of tool call validation."""
     tool_calls = tool_calls or (context.get("tool_calls", []) if context else [])
 
-    return all(isinstance(call, dict) and "function" in call and "id" in call for call in tool_calls)
+    return all(
+        isinstance(call, dict)
+        and "id" in call
+        and ("function" in call or ("name" in call and "args" in call))
+        for call in tool_calls
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/test_tool_output_rails.py` around lines 27 - 49,
The validators assume an OpenAI nested format but tests use LangChain-style
tool_calls; update validate_tool_parameters and self_check_tool_calls to accept
both formats by normalizing tool_calls entries: for each call in
validate_tool_parameters extract parameters from either call.get("function",
{}).get("arguments", {}) or call.get("args", {}) (falling back appropriately)
and continue checking strings against dangerous_patterns, and in
self_check_tool_calls return True when each call is a dict that contains either
the OpenAI keys ("function" and "id") or the LangChain keys ("name", "args", and
"id"); reference the existing functions validate_tool_parameters,
self_check_tool_calls, the dangerous_patterns list, and the tool_calls variable
when making the changes.

Comment on lines +104 to +118
def get_bound_llm_magic_mock(ainvoke_return_value: Union[AIMessage, dict]) -> MagicMock:
mock_llm = MagicMock()
mock_llm.return_value = mock_llm

bound_llm_mock = AsyncMock()
if isinstance(ainvoke_return_value, dict):
bound_llm_mock.ainvoke.return_value = MagicMock(**ainvoke_return_value)
else:
bound_llm_mock.ainvoke.return_value = ainvoke_return_value

mock_llm.bind.return_value = bound_llm_mock
if isinstance(ainvoke_return_value, dict):
mock_llm.ainvoke = AsyncMock(return_value=MagicMock(**ainvoke_return_value))
else:
mock_llm.ainvoke = AsyncMock(return_value=ainvoke_return_value)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return a real dict from the helper's dict path.

MagicMock(**ainvoke_return_value) turns dict keys into attributes, not mapping behavior. Any code path that does isinstance(res, dict), res["key"], or res.get(...) will see a different shape than production, so this helper can mask bugs in the exact branch it claims to support.

Possible fix
 def get_bound_llm_magic_mock(ainvoke_return_value: Union[AIMessage, dict]) -> MagicMock:
     mock_llm = MagicMock()
     mock_llm.return_value = mock_llm

     bound_llm_mock = AsyncMock()
     if isinstance(ainvoke_return_value, dict):
-        bound_llm_mock.ainvoke.return_value = MagicMock(**ainvoke_return_value)
+        bound_llm_mock.ainvoke.return_value = ainvoke_return_value
     else:
         bound_llm_mock.ainvoke.return_value = ainvoke_return_value

     mock_llm.bind.return_value = bound_llm_mock
     if isinstance(ainvoke_return_value, dict):
-        mock_llm.ainvoke = AsyncMock(return_value=MagicMock(**ainvoke_return_value))
+        mock_llm.ainvoke = AsyncMock(return_value=ainvoke_return_value)
     else:
         mock_llm.ainvoke = AsyncMock(return_value=ainvoke_return_value)
     return mock_llm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/langchain/utils.py` around lines 104 - 118,
get_bound_llm_magic_mock is incorrectly wrapping dict responses with
MagicMock(**...), which makes dicts behave like objects; change both places
handling the dict branch so they return a real dict instead (e.g., use
dict(ainvoke_return_value) or ainvoke_return_value directly) for
bound_llm_mock.ainvoke.return_value and for mock_llm.ainvoke = AsyncMock(...),
while keeping the existing AsyncMock/AIMessage branch behavior; update
references to bound_llm_mock, mock_llm, and ainvoke_return_value accordingly.

Comment on lines +321 to +335
fake_llm = FakeLLMModel(
llm_responses=[
LLMResponse(
content="Processing with metadata.",
tool_calls=test_tool_calls,
provider_metadata={"model": "test-model", "usage": {"tokens": 50}},
)
]
)
rails = LLMRails(config, llm=fake_llm)
result = await rails.generate_async(messages=[{"role": "user", "content": "Process with metadata"}])

assert result.tool_calls is not None
assert result.tool_calls[0]["name"] == "preserve_test"
assert result.content == ""
assert hasattr(result, "response_metadata")
assert result["tool_calls"] is not None
assert result["tool_calls"][0]["function"]["name"] == "preserve_test"
assert result["content"] == ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This test no longer verifies metadata propagation.

provider_metadata is part of the fixture here, but none of the assertions inspect it. A regression that drops metadata entirely would still pass, so the test name and the behavior it actually covers have drifted apart. Either assert the public metadata output in this test or remove/rename the metadata setup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tool_calls_event_extraction.py` around lines 321 - 335, The test
sets provider_metadata on the fake LLM response but never asserts it, so add an
assertion to verify metadata propagation: after calling LLMRails.generate_async
(using FakeLLMModel, LLMResponse, LLMRails) assert that the returned result
includes the provider metadata (e.g., result["tool_calls"][0] contains the same
provider_metadata dict or a public metadata field) to ensure metadata isn't
dropped; update or add an assertion comparing the expected {"model":
"test-model", "usage": {"tokens": 50}} to the value returned by result so the
test actually verifies metadata propagation.

Comment thread tests/utils.py
Comment on lines +158 to +172
if llm is not None:
self.llm = llm
elif llm_completions is not None:
main_model = next((model for model in config.models if model.type == "main"), None)
should_return_token_usage = bool(main_model and main_model.engine in _TEST_PROVIDERS_WITH_TOKEN_USAGE)

self.llm = FakeLLM(
self.llm = FakeLLMModel(
responses=llm_completions,
streaming=streaming,
exception=llm_exception,
token_usage=token_usage,
should_return_token_usage=should_return_token_usage,
)
if llm_exception:
self.llm.exception = llm_exception
else:
self.llm = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path

for path in Path(".").rglob("*.py"):
    try:
        tree = ast.parse(path.read_text())
    except Exception:
        continue

    for node in ast.walk(tree):
        if isinstance(node, ast.Call) and getattr(node.func, "id", None) == "TestChat":
            kw_names = {kw.arg for kw in node.keywords if kw.arg}
            if len(node.args) == 1 and "llm" not in kw_names and "llm_completions" not in kw_names:
                print(f"{path}:{node.lineno}: TestChat called without llm or llm_completions")
PY

Repository: NVIDIA-NeMo/Guardrails

Length of output: 940


Address TestChat silent fallback to None LLM.

The new else: self.llm = None at line 171 silently switches instances created without llm or llm_completions to bypass the test double entirely. This affects at least 11 existing test calls:

  • tests/test_ai_defense.py (lines 72, 642, 1431, 1465, 1502, 1540, 1584, 1625)
  • tests/test_jailbreak_heuristics.py (lines 52, 61)
  • tests/test_jailbreak_models.py (line 56)

Instead of fail-fast behavior, tests now delegate to real LLMRails without a fake, hiding unexpected model calls and changing failure modes from "missing stub" to config/provider-dependent errors. Either provide an empty FakeLLMModel as the default or raise an error when both parameters are missing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils.py` around lines 158 - 172, The constructor logic in TestChat
silently sets self.llm = None when both llm and llm_completions are missing,
causing tests to hit the real LLM; change the else branch to either raise a
clear error or instantiate a default stubbed FakeLLMModel instead. Specifically,
in the TestChat initialization where llm and llm_completions are checked,
replace "else: self.llm = None" with one of: (a) raise ValueError("No llm or
llm_completions provided to TestChat; tests must supply a fake"), or (b)
self.llm = FakeLLMModel(responses=[], streaming=False, exception=None,
token_usage=None, should_return_token_usage=False) so tests fail-fast or use an
empty fake respectively; refer to the TestChat constructor and the FakeLLMModel
symbol when applying the change.

Copy link
Copy Markdown
Collaborator

@tgasser-nv tgasser-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some tests outside of tests/integrations/langchain so they're not covered by conftest.py skip logic. These would need moving to the new location.

  • tests/test_configs/with_custom_llm/config.py → imports CustomLLM
  • tests/test_configs/with_custom_llm/custom_llm.py → extends langchain_core.language_models.BaseLLM
  • tests/test_configs/with_custom_chat_model/config.py → imports CustomChatModel
  • tests/test_configs/with_custom_chat_model/custom_chat_model.py → extends langchain_core.language_models.BaseChatModel

Could you also take a look at the conftest.py comment from Code rabbit

Comment thread tests/test_llmrails.py
@Pouyanpi
Copy link
Copy Markdown
Collaborator Author

There are some tests outside of tests/integrations/langchain so they're not covered by conftest.py skip logic. These would need moving to the new location.

  • tests/test_configs/with_custom_llm/config.py → imports CustomLLM
  • tests/test_configs/with_custom_llm/custom_llm.py → extends langchain_core.language_models.BaseLLM
  • tests/test_configs/with_custom_chat_model/config.py → imports CustomChatModel
  • tests/test_configs/with_custom_chat_model/custom_chat_model.py → extends langchain_core.language_models.BaseChatModel

These would surface when changing the default framework. I've fixed these in the last stack. I'll bring them here.

Could you also take a look at the conftest.py comment from Code rabbit

Yes. This one is critical. Fixed it, will push 👍🏻

…integrations/langchain/

test_custom_llm.py is LangChain-specific — it asserts that custom_llm
and custom_chat_model hooks register against the LangChain provider
registries. It belongs alongside the other LangChain-only tests
already under tests/integrations/langchain/.

The three LangChain-based test configs (with_custom_llm,
with_custom_chat_model, with_custom_llm_prompt_action_v2_x) are
duplicated into tests/integrations/langchain/test_configs/ so the
moved test continues to resolve them via its sibling ./test_configs
path. The originals under tests/test_configs/ remain in place as the
default-framework versions — they will be rewritten for the LLMModel
protocol in the framework cutover that introduces the new default.
Comment thread tests/conftest.py Outdated
Replace pytest_collection_modifyitems with collect_ignore_glob so LangChain-gated test modules are skipped before pytest imports them.
@Pouyanpi Pouyanpi force-pushed the refactor/langchain-decouple/stack-8-test-infrastructure branch from ca11b55 to 344c8f7 Compare April 17, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants