create video doc example and unit test for it by jperez999 · Pull Request #1972 · NVIDIA/NeMo-Retriever

jperez999 · 2026-05-05T21:34:25Z

Description

This PR adds the video OCR documentation example.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-05T21:37:53Z

Greptile Summary

This PR adds a video OCR documentation example to the nemo_retriever README and a corresponding unit test file that validates the GraphIngestor video pipeline shape (MediaChunk → ASR → embed), VDB upload, and the extract_video/extract_audio fallback behaviour.

README (nemo_retriever/README.md): New "Video with GraphIngestor" section documents extract_video, optional intermediate stages (split/dedup/caption/store), embedding, and VDB upload via IngestVdbOperator; includes a graceful fallback note for checkouts without extract_video.
Test file (test_readme_graphingestor_extract_video.py): Five unit tests cover graph topology assertions, VDB mock invocation, and an end-to-end mocked ingest→VDB flow; VDB backends are faked with plain MagicMock() rather than the specced FakeVDB(VDB) pattern already established in test_nv_ingest_vdb_operator.py.

Confidence Score: 5/5

Safe to merge — adds documentation and test coverage only; no production logic is changed.

The README addition is purely additive documentation. The new test file exercises graph topology and VDB wiring via mocks; all five tests are scoped to already-public APIs and the existing GraphIngestor/IngestVdbOperator code paths are unchanged.

No files require special attention; the test file is the only new code and its changes are self-contained.

Important Files Changed

Filename	Overview
nemo_retriever/tests/test_readme_graphingestor_extract_video.py	New test file covering README video/GraphIngestor example; imports private `_BatchEmbedActor`, VDB mocks use unspecced `MagicMock()` instead of the concrete `FakeVDB(VDB)` pattern established elsewhere in the test suite.
nemo_retriever/README.md	Adds a new Video with GraphIngestor section documenting extract_video/extract_audio fallback, optional pipeline stages, and VDB upload; prose and code example look correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GraphIngestor.files([clip.mp4])"] --> B{"extract_video defined?"}
    B -- Yes --> C["extract_video(AudioChunkParams, ASRParams)"]
    B -- No --> D["extract_audio(AudioChunkParams, ASRParams)"]
    C --> E["embed(EmbedParams)"]
    D --> E
    E --> F["ingestor.ingest() → DataFrame"]
    F --> G["df.to_dict('records')"]
    G --> H["IngestVdbOperator(lancedb)(records)"]
    H --> I["_construct_vdb → vdb.run(records)"]

    subgraph "Internal Graph (build_graph)"
        J["MediaChunkActor"] --> K["ASRActor"] --> L["_BatchEmbedActor"]
    end
    E -.builds.-> J

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/tests/test_readme_graphingestor_extract_video.py:94-108
**Unspecced `MagicMock` won't catch VDB interface drift**

Both VDB mock setups (here and at line 143) create a plain `MagicMock()` without `spec=`. If `VDB.run` is ever renamed or removed, `mock_vdb.run.assert_called_once()` still passes because an unspecced mock creates attributes on access. The existing VDB test (`test_nv_ingest_vdb_operator.py`) solves this with a concrete `FakeVDB(VDB)` subclass — using the same pattern here, or at minimum `MagicMock(spec=VDB)`, would make API drift detectable at test time.

_{Reviews (4): Last reviewed commit: "fix example use extract video" | Re-trigger Greptile}

greptile-apps · 2026-05-05T21:37:56Z

+
+    assert ingestor._extraction_mode == "image"
+    assert ingestor._documents == frame_globs
+    ep = ingestor._extract_params
+    assert isinstance(ep, ExtractParams)
+    assert ep.method == "ocr"
+    assert ep.dpi == 300
+    assert ep.extract_text is True
+    assert ep.extract_tables is True
+    assert ep.extract_charts is True
+    assert ep.extract_infographics is True
+    assert isinstance(ingestor._embed_params, EmbedParams)
+
+    post_extract = tuple(s for s in ingestor._stage_order if s != "extract")
+    graph = build_graph(
+        extraction_mode=ingestor._extraction_mode,
+        extract_params=ingestor._extract_params,
+        embed_params=ingestor._embed_params,
+        stage_order=post_extract,
+    )
+    assert graph.roots[0].name == "MultiTypeExtractOperator"


Test couples to private implementation details

The entire test body reads from _extraction_mode, _documents, _extract_params, _embed_params, and _stage_order — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state.

A more resilient approach would be to mock the operators at their boundaries and call ingestor.ingest(), or at minimum use the public files(), extract_image_files(), and embed() chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the post_extract filter logic from GraphIngestor.ingest() on line 32, so any change to that private method will leave the test stale.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/tests/test_readme_video_ocr_example.py Line: 18-38 Comment: **Test couples to private implementation details** The entire test body reads from `_extraction_mode`, `_documents`, `_extract_params`, `_embed_params`, and `_stage_order` — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state. A more resilient approach would be to mock the operators at their boundaries and call `ingestor.ingest()`, or at minimum use the public `files()`, `extract_image_files()`, and `embed()` chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the `post_extract` filter logic from `GraphIngestor.ingest()` on line 32, so any change to that private method will leave the test stale. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-05-05T21:37:59Z

@@ -0,0 +1,5 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.


The SPDX copyright header uses 2024-25, but per the spdx-license-header rule new files should use the current year. The sibling vdb/operators.py already uses 2026.

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/examples/__init__.py Line: 1 Comment: The SPDX copyright header uses `2024-25`, but per the `spdx-license-header` rule new files should use the current year. The sibling `vdb/operators.py` already uses `2026`. ```suggestion # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. ``` **Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-05T21:38:00Z

@@ -0,0 +1,56 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.


Same copyright year issue as the __init__.py — new files added in this PR should carry the current year (2026) to match the rest of the codebase.

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/examples/readme_video_ocr.py Line: 1 Comment: Same copyright year issue as the `__init__.py` — new files added in this PR should carry the current year (`2026`) to match the rest of the codebase. ```suggestion # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. ``` **Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-05T21:38:03Z

@@ -0,0 +1,38 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.


Copyright year in the test file should also be 2026.

Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/tests/test_readme_video_ocr_example.py Line: 1 Comment: Copyright year in the test file should also be `2026`. ```suggestion # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. ``` **Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-05T21:59:24Z

+def _linear_node_names(graph) -> list[str]:
+    node = graph.roots[0]
+    names: list[str] = []
+    while True:
+        names.append(node.name)
+        if not node.children:
+            return names
+        node = node.children[0]


_linear_node_names silently picks children[0] without asserting that each node has exactly one child, and takes roots[0] without asserting there is exactly one root. If build_graph ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so names[0..2] still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard.

Suggested change

def _linear_node_names(graph) -> list[str]:

node = graph.roots[0]

names: list[str] = []

while True:

names.append(node.name)

if not node.children:

return names

node = node.children[0]

def _linear_node_names(graph) -> list[str]:

assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}"

node = graph.roots[0]

names: list[str] = []

while True:

names.append(node.name)

if not node.children:

return names

assert len(node.children) == 1, (

f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}"

)

node = node.children[0]

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/tests/test_readme_video_mp4_example.py Line: 23-30 Comment: `_linear_node_names` silently picks `children[0]` without asserting that each node has exactly one child, and takes `roots[0]` without asserting there is exactly one root. If `build_graph` ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so `names[0..2]` still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard. ```suggestion def _linear_node_names(graph) -> list[str]: assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}" node = graph.roots[0] names: list[str] = [] while True: names.append(node.name) if not node.children: return names assert len(node.children) == 1, ( f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}" ) node = node.children[0] ``` How can I resolve this? If you propose a fix, please make it concise.

randerzander

Approving for now, so we have it written down

But we have followup work to clean up the snippets to be more gist like and reflect the fewer LOC necessary for typical usage

create doc example and unit test for it

9f09911

jperez999 self-assigned this May 5, 2026

jperez999 requested review from a team as code owners May 5, 2026 21:34

jperez999 requested a review from jioffe502 May 5, 2026 21:34

greptile-apps Bot reviewed May 5, 2026

View reviewed changes

jperez999 added 2 commits May 5, 2026 14:39

remove added example

29d0521

remove format change

dfbd4bc

jperez999 changed the title ~~create doc example and unit test for it~~ create video doc example and unit test for it May 5, 2026

fix to use mp4 inputs

67faf6a

greptile-apps Bot reviewed May 5, 2026

View reviewed changes

fix example use extract video

5568f41

randerzander approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create video doc example and unit test for it#1972

create video doc example and unit test for it#1972
jperez999 wants to merge 5 commits intoNVIDIA:mainfrom
jperez999:vid-doc-up

jperez999 commented May 5, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 5, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

randerzander left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,5 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

		@@ -0,0 +1,56 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

		@@ -0,0 +1,38 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.

Conversation

jperez999 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

greptile-apps Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

randerzander left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jperez999 commented May 5, 2026 •

edited

Loading

greptile-apps Bot commented May 5, 2026 •

edited

Loading