docs: add AGENTS.md for AI agent and contributor guidance#1769
docs: add AGENTS.md for AI agent and contributor guidance#1769
Conversation
Provides a structured reference for AI coding agents covering environment setup, test commands, the Task/ProcessingStage/Pipeline/Executor architecture, stage authoring rules, commit/DCO requirements, and PR expectations. Also removes AGENTS.md from .gitignore (was mistakenly added by a template). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Greptile SummaryAdds Confidence Score: 5/5Safe to merge; all prior P1 concerns have been addressed and the one remaining finding is a minor self-contradiction (P2 style). All three previously flagged issues (docstring rule, pip install pre-commit, coverage enforcement) have been resolved. The only remaining issue is the AGENTS.md line 22-23 (pip install uv bootstrap self-contradiction)
|
| Filename | Overview |
|---|---|
| AGENTS.md | New agent/contributor guidance file; well-structured with accurate coverage enforcement, corrected pre-commit setup, and nuanced docstring note — one self-contradiction remains with pip install uv vs. the "never use bare pip" rule. |
| .gitignore | Removes erroneously template-generated AGENTS.md entry — correct change. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Pick location in stages modality/category] --> B[Subclass ProcessingStage]
B --> C[Declare class attributes: name, resources, batch_size]
C --> D[Implement process, inputs, outputs methods]
D --> E[Export via __init__.py]
E --> F[Write mirrored tests in tests/stages/modality/category]
F --> G[pre-commit run --all-files]
G --> H{Coverage >= 80pct?}
H -->|Yes| I[git commit -sS]
H -->|No| F
I --> J[Open PR targeting main]
J --> K[Codecov patch check: target 80pct, if_ci_failed=error]
Reviews (5): Last reviewed commit: "docs: address Sarah's review comments on..." | Re-trigger Greptile
Replace `pip install pre-commit` with `uv sync --extra all` (which already includes pre-commit via the dev group) and `uv run pre-commit` to stay consistent with the "never use bare pip" rule stated earlier in the same file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>
pydocstyle rules ("D") are entirely disabled in ruff's ignore list, so
docstrings are never CI-enforced. Clarify that docstrings are a
CONTRIBUTING.md convention but ruff will not flag their absence.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Remove the duplicate "Set up pre-commit hooks" block introduced when rebasing over an external commit that also addressed the same review comment. Keep the uv sync --extra all approach and restore the missing section separator before Running Tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>
Mention the Ray actor pool executor alongside RayData and Xenna, with a note that it is intended for specific workflows like deduplication and most pipelines should prefer RayDataExecutor or XennaExecutor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
An external commit's patch was applied as literal text during rebase, leaving raw diff markers (- / +) in the Architecture Overview section. Remove the stale lines; the correct docstring clarification is already present in the Linting section above. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
|
|
||
| **Test layout mirrors source layout.** New stage in `nemo_curator/stages/text/filters/` → tests in `tests/stages/text/filters/`. GPU-only tests are marked `@pytest.mark.gpu`. CPU tests must pass without any GPU. | ||
|
|
||
| **Coverage requirement: 80% on changed lines.** CI will block PRs that don't meet this. |
There was a problem hiding this comment.
80% coverage threshold is not enforced by CI
The document states "CI will block PRs that don't meet this," but the actual CI workflow (cicd-main.yml) runs coverage report purely for reporting — there is no --fail-under=80 flag and no fail_under key in pyproject.toml's [tool.coverage.report] section (which doesn't exist). An agent that believes this will be surprised when a PR with low coverage passes CI, or will waste time padding tests to hit an imaginary gate. Either remove the blocking-CI claim or add an actual enforcement step to keep the docs honest.
There was a problem hiding this comment.
@thomasdhc what do you suggest we do here?
There was a problem hiding this comment.
Codecov configuration lives here: codecov.yml not in pyproject.toml
| | Don't | Do instead | | ||
| |-------|-----------| | ||
| | `pip install ...` | `uv sync --extra <group>` | | ||
| | `python script.py` | `uv run python script.py` | |
There was a problem hiding this comment.
uv run python script.py works within any uv-managed project directory (i.e. anywhere under the repo root after uv sync has been run). It invokes Python from the project virtual environment without requiring the venv to be manually activated — which is the recommended uv pattern. Happy to change the wording if you prefer uv run -- python or a note clarifying the prerequisite.
There was a problem hiding this comment.
Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?
There was a problem hiding this comment.
in that case; should we stick to uv pip install "nemo-curator"?
There was a problem hiding this comment.
Nope this (uv run) doesn't work for us always; we still need to learn why it doesn't work.. For now we've been doing source .venv/bin/activate && python script.py
In fact uv run led to CI/CD failures, that Onur helped resolve here https://github.com/NVIDIA-NeMo/Curator/pull/1557/changes#diff-fc966eacbe7f865fa8fee5be263e78011c386a7bd0be696a28271e169def612dR98-R101
- Add math_cpu and math_cuda12 to available extras list - Move inputs()/outputs() from Optional to Expected overrides; every stage should implement them - Clarify setup_on_node (download/verify models on disk) vs setup (load models into memory) - Add note that process() must propagate _stage_perf and _metadata - Update stage example to propagate _stage_perf and _metadata - Correct executor ordering: XennaExecutor is the current default, RayDataExecutor is the common alternative Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>
| uv sync --extra audio_cpu --extra video_cpu | ||
| ``` | ||
|
|
||
| Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`. |
There was a problem hiding this comment.
I don't think deduplication should be included here, users should never install deduplication by itself:
| Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`. | |
| Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`. |
| | Don't | Do instead | | ||
| |-------|-----------| | ||
| | `pip install ...` | `uv sync --extra <group>` | | ||
| | `python script.py` | `uv run python script.py` | |
There was a problem hiding this comment.
Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?
There was a problem hiding this comment.
Let's add instructions about the copyright header too. The copyright should be added to all newly added, non-empty Python files.
| # Single GPU stage | ||
| resources = Resources(gpu_memory_gb=40.0) |
There was a problem hiding this comment.
We haven't tested this code path as much so let's remove this
| ## Quick Reference | ||
|
|
||
| ```bash | ||
| uv sync --extra all # install all dependencies |
There was a problem hiding this comment.
| uv sync --extra all # install all dependencies | |
| uv sync --all-extras --all-groups # install all dependencies and development dependencies | |
| source .venv/bin/activate # to activate the uv venv |
Summary
Adds
AGENTS.md— a structured reference for AI coding agents (Claude Code, Copilot, Cursor, etc.) contributing to NeMo Curator.What it covers
uv(all extras)ruffvia pre-commitTask,ProcessingStage,CompositeStage,Pipeline,Executorname/resourcesmust be class attributes, not@property;with_()pattern; idempotency requirement)-sflag)Co-authored-bytrailerpip,@propertyon stage attrs, skipping signoff, etc.)Also removes
AGENTS.mdfrom.gitignorewhere it was mistakenly added by a template generator.🤖 Generated with Claude Code