create video doc example and unit test for it#1972
create video doc example and unit test for it#1972jperez999 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds a video OCR documentation example to the
|
| Filename | Overview |
|---|---|
| nemo_retriever/tests/test_readme_graphingestor_extract_video.py | New test file covering README video/GraphIngestor example; imports private _BatchEmbedActor, VDB mocks use unspecced MagicMock() instead of the concrete FakeVDB(VDB) pattern established elsewhere in the test suite. |
| nemo_retriever/README.md | Adds a new Video with GraphIngestor section documenting extract_video/extract_audio fallback, optional pipeline stages, and VDB upload; prose and code example look correct. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["GraphIngestor.files([clip.mp4])"] --> B{"extract_video defined?"}
B -- Yes --> C["extract_video(AudioChunkParams, ASRParams)"]
B -- No --> D["extract_audio(AudioChunkParams, ASRParams)"]
C --> E["embed(EmbedParams)"]
D --> E
E --> F["ingestor.ingest() → DataFrame"]
F --> G["df.to_dict('records')"]
G --> H["IngestVdbOperator(lancedb)(records)"]
H --> I["_construct_vdb → vdb.run(records)"]
subgraph "Internal Graph (build_graph)"
J["MediaChunkActor"] --> K["ASRActor"] --> L["_BatchEmbedActor"]
end
E -.builds.-> J
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/tests/test_readme_graphingestor_extract_video.py:94-108
**Unspecced `MagicMock` won't catch VDB interface drift**
Both VDB mock setups (here and at line 143) create a plain `MagicMock()` without `spec=`. If `VDB.run` is ever renamed or removed, `mock_vdb.run.assert_called_once()` still passes because an unspecced mock creates attributes on access. The existing VDB test (`test_nv_ingest_vdb_operator.py`) solves this with a concrete `FakeVDB(VDB)` subclass — using the same pattern here, or at minimum `MagicMock(spec=VDB)`, would make API drift detectable at test time.
Reviews (4): Last reviewed commit: "fix example use extract video" | Re-trigger Greptile
|
|
||
| assert ingestor._extraction_mode == "image" | ||
| assert ingestor._documents == frame_globs | ||
| ep = ingestor._extract_params | ||
| assert isinstance(ep, ExtractParams) | ||
| assert ep.method == "ocr" | ||
| assert ep.dpi == 300 | ||
| assert ep.extract_text is True | ||
| assert ep.extract_tables is True | ||
| assert ep.extract_charts is True | ||
| assert ep.extract_infographics is True | ||
| assert isinstance(ingestor._embed_params, EmbedParams) | ||
|
|
||
| post_extract = tuple(s for s in ingestor._stage_order if s != "extract") | ||
| graph = build_graph( | ||
| extraction_mode=ingestor._extraction_mode, | ||
| extract_params=ingestor._extract_params, | ||
| embed_params=ingestor._embed_params, | ||
| stage_order=post_extract, | ||
| ) | ||
| assert graph.roots[0].name == "MultiTypeExtractOperator" |
There was a problem hiding this comment.
Test couples to private implementation details
The entire test body reads from _extraction_mode, _documents, _extract_params, _embed_params, and _stage_order — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state.
A more resilient approach would be to mock the operators at their boundaries and call ingestor.ingest(), or at minimum use the public files(), extract_image_files(), and embed() chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the post_extract filter logic from GraphIngestor.ingest() on line 32, so any change to that private method will leave the test stale.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_ocr_example.py
Line: 18-38
Comment:
**Test couples to private implementation details**
The entire test body reads from `_extraction_mode`, `_documents`, `_extract_params`, `_embed_params`, and `_stage_order` — all private attributes with leading underscores. If any of these attribute names are refactored, the test breaks silently without any real functional regression. The testing standard requires asserting on behavior rather than inspecting internal state.
A more resilient approach would be to mock the operators at their boundaries and call `ingestor.ingest()`, or at minimum use the public `files()`, `extract_image_files()`, and `embed()` chainable builder return values to verify the pipeline is configured correctly. The current approach also duplicates the `post_extract` filter logic from `GraphIngestor.ingest()` on line 32, so any change to that private method will leave the test stale.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| @@ -0,0 +1,5 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
The SPDX copyright header uses
2024-25, but per the spdx-license-header rule new files should use the current year. The sibling vdb/operators.py already uses 2026.
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. |
Rule Used: Python files added in this PR must include the SPD... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/__init__.py
Line: 1
Comment:
The SPDX copyright header uses `2024-25`, but per the `spdx-license-header` rule new files should use the current year. The sibling `vdb/operators.py` already uses `2026`.
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```
**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,56 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
Same copyright year issue as the
__init__.py — new files added in this PR should carry the current year (2026) to match the rest of the codebase.
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. |
Rule Used: Python files added in this PR must include the SPD... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/examples/readme_video_ocr.py
Line: 1
Comment:
Same copyright year issue as the `__init__.py` — new files added in this PR should carry the current year (`2026`) to match the rest of the codebase.
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```
**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,38 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |||
There was a problem hiding this comment.
Copyright year in the test file should also be
2026.
| # SPDX-FileCopyrightText: Copyright (c) 2024-25, NVIDIA CORPORATION & AFFILIATES. | |
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. |
Rule Used: Python files added in this PR must include the SPD... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_ocr_example.py
Line: 1
Comment:
Copyright year in the test file should also be `2026`.
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
```
**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))
How can I resolve this? If you propose a fix, please make it concise.| def _linear_node_names(graph) -> list[str]: | ||
| node = graph.roots[0] | ||
| names: list[str] = [] | ||
| while True: | ||
| names.append(node.name) | ||
| if not node.children: | ||
| return names | ||
| node = node.children[0] |
There was a problem hiding this comment.
_linear_node_names silently picks children[0] without asserting that each node has exactly one child, and takes roots[0] without asserting there is exactly one root. If build_graph ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so names[0..2] still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard.
| def _linear_node_names(graph) -> list[str]: | |
| node = graph.roots[0] | |
| names: list[str] = [] | |
| while True: | |
| names.append(node.name) | |
| if not node.children: | |
| return names | |
| node = node.children[0] | |
| def _linear_node_names(graph) -> list[str]: | |
| assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}" | |
| node = graph.roots[0] | |
| names: list[str] = [] | |
| while True: | |
| names.append(node.name) | |
| if not node.children: | |
| return names | |
| assert len(node.children) == 1, ( | |
| f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}" | |
| ) | |
| node = node.children[0] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_readme_video_mp4_example.py
Line: 23-30
Comment:
`_linear_node_names` silently picks `children[0]` without asserting that each node has exactly one child, and takes `roots[0]` without asserting there is exactly one root. If `build_graph` ever produces a branching or multi-root graph, the traversal quietly follows only the first branch, so `names[0..2]` still passes while the unexpected extra nodes are never checked. Add explicit linearity assertions to make this a true structural guard.
```suggestion
def _linear_node_names(graph) -> list[str]:
assert len(graph.roots) == 1, f"Expected a single-root graph, got roots: {[r.name for r in graph.roots]}"
node = graph.roots[0]
names: list[str] = []
while True:
names.append(node.name)
if not node.children:
return names
assert len(node.children) == 1, (
f"Expected linear graph at '{node.name}', got children: {[c.name for c in node.children]}"
)
node = node.children[0]
```
How can I resolve this? If you propose a fix, please make it concise.
randerzander
left a comment
There was a problem hiding this comment.
Approving for now, so we have it written down
But we have followup work to clean up the snippets to be more gist like and reflect the fewer LOC necessary for typical usage
Description
This PR adds the video OCR documentation example.
Checklist