docs: add AGENTS.md for AI agent and contributor guidance by arhamm1 · Pull Request #1769 · NVIDIA-NeMo/Curator

arhamm1 · 2026-04-08T18:37:09Z

Summary

Adds AGENTS.md — a structured reference for AI coding agents (Claude Code, Copilot, Cursor, etc.) contributing to NeMo Curator.

What it covers

Environment setup with uv (all extras)
Test commands, GPU vs CPU split, 80% coverage requirement
Linting with ruff via pre-commit
Architecture overview: Task, ProcessingStage, CompositeStage, Pipeline, Executor
Step-by-step guide to adding a new stage with code examples
Stage authoring rules (name/resources must be class attributes, not @property; with_() pattern; idempotency requirement)
Commit requirements (DCO -s flag)
PR requirements including AI assistance disclosure and Co-authored-by trailer
Common mistakes table (bare pip, @property on stage attrs, skipping signoff, etc.)

Also removes AGENTS.md from .gitignore where it was mistakenly added by a template generator.

🤖 Generated with Claude Code

Provides a structured reference for AI coding agents covering environment setup, test commands, the Task/ProcessingStage/Pipeline/Executor architecture, stage authoring rules, commit/DCO requirements, and PR expectations. Also removes AGENTS.md from .gitignore (was mistakenly added by a template). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

copy-pr-bot · 2026-04-08T18:37:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-08T18:39:32Z

Greptile Summary

Adds AGENTS.md — a structured guide for AI coding agents contributing to NeMo Curator — and removes it from .gitignore where a template generator had erroneously added it. Previous review feedback has been addressed: the docstring section now correctly reflects that ruff ignores the D group while CONTRIBUTING.md still asks for them, the pre-commit setup block no longer uses pip install pre-commit, and the 80% coverage claim is accurate (enforced via codecov.yml with target: 80% / if_ci_failed: error on patch coverage).

Confidence Score: 5/5

Safe to merge; all prior P1 concerns have been addressed and the one remaining finding is a minor self-contradiction (P2 style).

All three previously flagged issues (docstring rule, pip install pre-commit, coverage enforcement) have been resolved. The only remaining issue is the pip install uv self-contradiction, which is a P2 style suggestion and does not block merging.

AGENTS.md line 22-23 (pip install uv bootstrap self-contradiction)

Vulnerabilities

No security concerns identified. The file is documentation only and contains no code that processes user input, handles credentials, or interacts with external systems.

Important Files Changed

Filename	Overview
AGENTS.md	New agent/contributor guidance file; well-structured with accurate coverage enforcement, corrected pre-commit setup, and nuanced docstring note — one self-contradiction remains with `pip install uv` vs. the "never use bare pip" rule.
.gitignore	Removes erroneously template-generated `AGENTS.md` entry — correct change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pick location in stages modality/category] --> B[Subclass ProcessingStage]
    B --> C[Declare class attributes: name, resources, batch_size]
    C --> D[Implement process, inputs, outputs methods]
    D --> E[Export via __init__.py]
    E --> F[Write mirrored tests in tests/stages/modality/category]
    F --> G[pre-commit run --all-files]
    G --> H{Coverage >= 80pct?}
    H -->|Yes| I[git commit -sS]
    H -->|No| F
    I --> J[Open PR targeting main]
    J --> K[Codecov patch check: target 80pct, if_ci_failed=error]

_{Reviews (5): Last reviewed commit: "docs: address Sarah's review comments on..." | Re-trigger Greptile}

Replace `pip install pre-commit` with `uv sync --extra all` (which already includes pre-commit via the dev group) and `uv run pre-commit` to stay consistent with the "never use bare pip" rule stated earlier in the same file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>

pydocstyle rules ("D") are entirely disabled in ruff's ignore list, so docstrings are never CI-enforced. Clarify that docstrings are a CONTRIBUTING.md convention but ruff will not flag their absence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Remove the duplicate "Set up pre-commit hooks" block introduced when rebasing over an external commit that also addressed the same review comment. Keep the uv sync --extra all approach and restore the missing section separator before Running Tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>

Mention the Ray actor pool executor alongside RayData and Xenna, with a note that it is intended for specific workflows like deduplication and most pipelines should prefer RayDataExecutor or XennaExecutor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

An external commit's patch was applied as literal text during rebase, leaving raw diff markers (- / +) in the Architecture Overview section. Remove the stale lines; the correct docstring clarification is already present in the Linting section above. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

greptile-apps · 2026-04-08T18:50:32Z

+
+**Test layout mirrors source layout.** New stage in `nemo_curator/stages/text/filters/` → tests in `tests/stages/text/filters/`. GPU-only tests are marked `@pytest.mark.gpu`. CPU tests must pass without any GPU.
+
+**Coverage requirement: 80% on changed lines.** CI will block PRs that don't meet this.


80% coverage threshold is not enforced by CI

The document states "CI will block PRs that don't meet this," but the actual CI workflow (cicd-main.yml) runs coverage report purely for reporting — there is no --fail-under=80 flag and no fail_under key in pyproject.toml's [tool.coverage.report] section (which doesn't exist). An agent that believes this will be surprised when a PR with low coverage passes CI, or will waste time padding tests to hit an imaginary gate. Either remove the blocking-CI claim or add an actual enforcement step to keep the docs honest.

@thomasdhc what do you suggest we do here?

Codecov configuration lives here: codecov.yml not in pyproject.toml

sarahyurick · 2026-04-08T18:56:57Z

+| Don't | Do instead |
+|-------|-----------|
+| `pip install ...` | `uv sync --extra <group>` |
+| `python script.py` | `uv run python script.py` |


Does this always work?

uv run python script.py works within any uv-managed project directory (i.e. anywhere under the repo root after uv sync has been run). It invokes Python from the project virtual environment without requiring the venv to be manually activated — which is the recommended uv pattern. Happy to change the wording if you prefer uv run -- python or a note clarifying the prerequisite.

Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?

in that case; should we stick to uv pip install "nemo-curator"?

Nope this (uv run) doesn't work for us always; we still need to learn why it doesn't work.. For now we've been doing source .venv/bin/activate && python script.py
In fact uv run led to CI/CD failures, that Onur helped resolve here https://github.com/NVIDIA-NeMo/Curator/pull/1557/changes#diff-fc966eacbe7f865fa8fee5be263e78011c386a7bd0be696a28271e169def612dR98-R101

- Add math_cpu and math_cuda12 to available extras list - Move inputs()/outputs() from Optional to Expected overrides; every stage should implement them - Clarify setup_on_node (download/verify models on disk) vs setup (load models into memory) - Add note that process() must propagate _stage_perf and _metadata - Update stage example to propagate _stage_perf and _metadata - Correct executor ordering: XennaExecutor is the current default, RayDataExecutor is the common alternative Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Arham Mehta <arhamm@nvidia.com>

sarahyurick

Thanks!

sarahyurick · 2026-04-08T20:53:19Z

+uv sync --extra audio_cpu --extra video_cpu
+```
+
+Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.


I don't think deduplication should be included here, users should never install deduplication by itself:

Suggested change

Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.

Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.

sarahyurick · 2026-04-08T20:55:33Z

+| Don't | Do instead |
+|-------|-----------|
+| `pip install ...` | `uv sync --extra <group>` |
+| `python script.py` | `uv run python script.py` |


Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?

sarahyurick · 2026-04-09T15:55:51Z

Let's add instructions about the copyright header too. The copyright should be added to all newly added, non-empty Python files.

praateekmahajan · 2026-04-09T16:26:38Z

+# Single GPU stage
+resources = Resources(gpu_memory_gb=40.0)


We haven't tested this code path as much so let's remove this

praateekmahajan · 2026-04-09T16:32:11Z

+## Quick Reference
+
+```bash
+uv sync --extra all          # install all dependencies


Suggested change

uv sync --extra all # install all dependencies

uv sync --all-extras --all-groups # install all dependencies and development dependencies

source .venv/bin/activate # to activate the uv venv

arhamm1 requested a review from a team as a code owner April 8, 2026 18:37

arhamm1 requested review from sarahyurick and removed request for a team April 8, 2026 18:37

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

Comment thread AGENTS.md

Comment thread AGENTS.md

sarahyurick reviewed Apr 8, 2026

View reviewed changes

Comment thread AGENTS.md

arhamm1 and others added 7 commits April 8, 2026 14:41

Update AGENTS.md

53843c4

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>

Update AGENTS.md

9cbb503

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

Merge branch 'main' into docs/add-agents-md

31bc807

sarahyurick reviewed Apr 8, 2026

View reviewed changes

sarahyurick reviewed Apr 9, 2026

View reviewed changes

praateekmahajan reviewed Apr 9, 2026

View reviewed changes


		Test layout mirrors source layout. New stage in `nemo_curator/stages/text/filters/` → tests in `tests/stages/text/filters/`. GPU-only tests are marked `@pytest.mark.gpu`. CPU tests must pass without any GPU.

		Coverage requirement: 80% on changed lines. CI will block PRs that don't meet this.

	Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.
	Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.

	uv sync --extra all # install all dependencies
	uv sync --all-extras --all-groups # install all dependencies and development dependencies
	source .venv/bin/activate # to activate the uv venv

Conversation

arhamm1 commented Apr 8, 2026

Summary

What it covers

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasdhc Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhamm1 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praateekmahajan Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Apr 8, 2026 •

edited

Loading

thomasdhc Apr 8, 2026 •

edited

Loading

arhamm1 Apr 8, 2026 •

edited

Loading

praateekmahajan Apr 9, 2026 •

edited

Loading