Skip to content

create video doc example and unit test for it#1972

Open
jperez999 wants to merge 5 commits intoNVIDIA:mainfrom
jperez999:vid-doc-up
Open

create video doc example and unit test for it#1972
jperez999 wants to merge 5 commits intoNVIDIA:mainfrom
jperez999:vid-doc-up

Conversation

@jperez999
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 commented May 5, 2026

Description

This PR adds the video OCR documentation example.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jperez999 jperez999 self-assigned this May 5, 2026
@jperez999 jperez999 requested review from a team as code owners May 5, 2026 21:34
@jperez999 jperez999 requested a review from jioffe502 May 5, 2026 21:34
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR adds a video OCR documentation example to the nemo_retriever README and a corresponding unit test file that validates the GraphIngestor video pipeline shape (MediaChunk → ASR → embed), VDB upload, and the extract_video/extract_audio fallback behaviour.

  • README (nemo_retriever/README.md): New "Video with GraphIngestor" section documents extract_video, optional intermediate stages (split/dedup/caption/store), embedding, and VDB upload via IngestVdbOperator; includes a graceful fallback note for checkouts without extract_video.
  • Test file (test_readme_graphingestor_extract_video.py): Five unit tests cover graph topology assertions, VDB mock invocation, and an end-to-end mocked ingest→VDB flow; VDB backends are faked with plain MagicMock() rather than the specced FakeVDB(VDB) pattern already established in test_nv_ingest_vdb_operator.py.

Confidence Score: 5/5

Safe to merge — adds documentation and test coverage only; no production logic is changed.

The README addition is purely additive documentation. The new test file exercises graph topology and VDB wiring via mocks; all five tests are scoped to already-public APIs and the existing GraphIngestor/IngestVdbOperator code paths are unchanged.

No files require special attention; the test file is the only new code and its changes are self-contained.

Important Files Changed

Filename Overview
nemo_retriever/tests/test_readme_graphingestor_extract_video.py New test file covering README video/GraphIngestor example; imports private _BatchEmbedActor, VDB mocks use unspecced MagicMock() instead of the concrete FakeVDB(VDB) pattern established elsewhere in the test suite.
nemo_retriever/README.md Adds a new Video with GraphIngestor section documenting extract_video/extract_audio fallback, optional pipeline stages, and VDB upload; prose and code example look correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GraphIngestor.files([clip.mp4])"] --> B{"extract_video defined?"}
    B -- Yes --> C["extract_video(AudioChunkParams, ASRParams)"]
    B -- No --> D["extract_audio(AudioChunkParams, ASRParams)"]
    C --> E["embed(EmbedParams)"]
    D --> E
    E --> F["ingestor.ingest() → DataFrame"]
    F --> G["df.to_dict('records')"]
    G --> H["IngestVdbOperator(lancedb)(records)"]
    H --> I["_construct_vdb → vdb.run(records)"]

    subgraph "Internal Graph (build_graph)"
        J["MediaChunkActor"] --> K["ASRActor"] --> L["_BatchEmbedActor"]
    end
    E -.builds.-> J
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemo_retriever/tests/test_readme_graphingestor_extract_video.py:94-108
**Unspecced `MagicMock` won't catch VDB interface drift**

Both VDB mock setups (here and at line 143) create a plain `MagicMock()` without `spec=`. If `VDB.run` is ever renamed or removed, `mock_vdb.run.assert_called_once()` still passes because an unspecced mock creates attributes on access. The existing VDB test (`test_nv_ingest_vdb_operator.py`) solves this with a concrete `FakeVDB(VDB)` subclass — using the same pattern here, or at minimum `MagicMock(spec=VDB)`, would make API drift detectable at test time.

Reviews (4): Last reviewed commit: "fix example use extract video" | Re-trigger Greptile

Comment on lines +18 to +38

assert ingestor._extraction_mode == "image"
assert ingestor._documents == frame_globs
ep = ingestor._extract_params
assert isinstance(ep, ExtractParams)
assert ep.method == "ocr"
assert ep.dpi == 300
assert ep.extract_text is True
assert ep.extract_tables is True
assert ep.extract_charts is True
assert ep.extract_infographics is True
assert isinstance(ingestor._embed_params, EmbedParams)

post_extract = tuple(s for s in ingestor._stage_order if s != "extract")
graph = build_graph(
extraction_mode=ingestor._extraction_mode,
extract_params=ingestor._extract_params,
embed_params=ingestor._embed_params,
stage_order=post_extract,
)
assert graph.roots[0].name == "MultiTypeExtractOperator"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Test couples to private implementation details

The entire test body reads from _extraction_mode, _documents, _extract_params, _embed_params, and _stage_order — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state.

A more resilient approach would be to mock the operators at their boundaries and call ingestor.ingest(), or at minimum use the public files(), extract_image_files(), and embed() chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the post_extract filter logic from GraphIngestor.ingest() on line 32, so any change to that private method will leave the test stale.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_ocr_example.py
Line: 18-38

Comment:
**Test couples to private implementation details**

The entire test body reads from `_extraction_mode`, `_documents`, `_extract_params`, `_embed_params`, and `_stage_order` — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state.

A more resilient approach would be to mock the operators at their boundaries and call `ingestor.ingest()`, or at minimum use the public `files()`, `extract_image_files()`, and `embed()` chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the `post_extract` filter logic from `GraphIngestor.ingest()` on line 32, so any change to that private method will leave the test stale.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@@ -0,0 +1,5 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The SPDX copyright header uses 2024-25, but per the spdx-license-header rule new files should use the current year. The sibling vdb/operators.py already uses 2026.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/__init__.py
Line: 1

Comment:
The SPDX copyright header uses `2024-25`, but per the `spdx-license-header` rule new files should use the current year. The sibling `vdb/operators.py` already uses `2026`.

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,56 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same copyright year issue as the __init__.py — new files added in this PR should carry the current year (2026) to match the rest of the codebase.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/readme_video_ocr.py
Line: 1

Comment:
Same copyright year issue as the `__init__.py` — new files added in this PR should carry the current year (`2026`) to match the rest of the codebase.

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,38 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Copyright year in the test file should also be 2026.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_ocr_example.py
Line: 1

Comment:
Copyright year in the test file should also be `2026`.

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

Comment thread nemo_retriever/README.md
@jperez999 jperez999 changed the title create doc example and unit test for it create video doc example and unit test for it May 5, 2026
Comment on lines +23 to +30
def _linear_node_names(graph) -> list[str]:
node = graph.roots[0]
names: list[str] = []
while True:
names.append(node.name)
if not node.children:
return names
node = node.children[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _linear_node_names silently picks children[0] without asserting that each node has exactly one child, and takes roots[0] without asserting there is exactly one root. If build_graph ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so names[0..2] still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard.

Suggested change
def _linear_node_names(graph) -> list[str]:
node = graph.roots[0]
names: list[str] = []
while True:
names.append(node.name)
if not node.children:
return names
node = node.children[0]
def _linear_node_names(graph) -> list[str]:
assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}"
node = graph.roots[0]
names: list[str] = []
while True:
names.append(node.name)
if not node.children:
return names
assert len(node.children) == 1, (
f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}"
)
node = node.children[0]
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_mp4_example.py
Line: 23-30

Comment:
`_linear_node_names` silently picks `children[0]` without asserting that each node has exactly one child, and takes `roots[0]` without asserting there is exactly one root. If `build_graph` ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so `names[0..2]` still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard.

```suggestion
def _linear_node_names(graph) -> list[str]:
    assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}"
    node = graph.roots[0]
    names: list[str] = []
    while True:
        names.append(node.name)
        if not node.children:
            return names
        assert len(node.children) == 1, (
            f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}"
        )
        node = node.children[0]
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator

@randerzander randerzander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now, so we have it written down

But we have followup work to clean up the snippets to be more gist like and reflect the fewer LOC necessary for typical usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants