Skip to content

docs: add AGENTS.md for AI agent and contributor guidance#1769

Open
arhamm1 wants to merge 10 commits intomainfrom
docs/add-agents-md
Open

docs: add AGENTS.md for AI agent and contributor guidance#1769
arhamm1 wants to merge 10 commits intomainfrom
docs/add-agents-md

Conversation

@arhamm1
Copy link
Copy Markdown
Contributor

@arhamm1 arhamm1 commented Apr 8, 2026

Summary

Adds AGENTS.md — a structured reference for AI coding agents (Claude Code, Copilot, Cursor, etc.) contributing to NeMo Curator.

What it covers

  • Environment setup with uv (all extras)
  • Test commands, GPU vs CPU split, 80% coverage requirement
  • Linting with ruff via pre-commit
  • Architecture overview: Task, ProcessingStage, CompositeStage, Pipeline, Executor
  • Step-by-step guide to adding a new stage with code examples
  • Stage authoring rules (name/resources must be class attributes, not @property; with_() pattern; idempotency requirement)
  • Commit requirements (DCO -s flag)
  • PR requirements including AI assistance disclosure and Co-authored-by trailer
  • Common mistakes table (bare pip, @property on stage attrs, skipping signoff, etc.)

Also removes AGENTS.md from .gitignore where it was mistakenly added by a template generator.

🤖 Generated with Claude Code

Provides a structured reference for AI coding agents covering environment
setup, test commands, the Task/ProcessingStage/Pipeline/Executor architecture,
stage authoring rules, commit/DCO requirements, and PR expectations.

Also removes AGENTS.md from .gitignore (was mistakenly added by a template).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
@arhamm1 arhamm1 requested a review from a team as a code owner April 8, 2026 18:37
@arhamm1 arhamm1 requested review from sarahyurick and removed request for a team April 8, 2026 18:37
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

Adds AGENTS.md — a structured guide for AI coding agents contributing to NeMo Curator — and removes it from .gitignore where a template generator had erroneously added it. Previous review feedback has been addressed: the docstring section now correctly reflects that ruff ignores the D group while CONTRIBUTING.md still asks for them, the pre-commit setup block no longer uses pip install pre-commit, and the 80% coverage claim is accurate (enforced via codecov.yml with target: 80% / if_ci_failed: error on patch coverage).

Confidence Score: 5/5

Safe to merge; all prior P1 concerns have been addressed and the one remaining finding is a minor self-contradiction (P2 style).

All three previously flagged issues (docstring rule, pip install pre-commit, coverage enforcement) have been resolved. The only remaining issue is the pip install uv self-contradiction, which is a P2 style suggestion and does not block merging.

AGENTS.md line 22-23 (pip install uv bootstrap self-contradiction)

Vulnerabilities

No security concerns identified. The file is documentation only and contains no code that processes user input, handles credentials, or interacts with external systems.

Important Files Changed

Filename Overview
AGENTS.md New agent/contributor guidance file; well-structured with accurate coverage enforcement, corrected pre-commit setup, and nuanced docstring note — one self-contradiction remains with pip install uv vs. the "never use bare pip" rule.
.gitignore Removes erroneously template-generated AGENTS.md entry — correct change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Pick location in stages modality/category] --> B[Subclass ProcessingStage]
    B --> C[Declare class attributes: name, resources, batch_size]
    C --> D[Implement process, inputs, outputs methods]
    D --> E[Export via __init__.py]
    E --> F[Write mirrored tests in tests/stages/modality/category]
    F --> G[pre-commit run --all-files]
    G --> H{Coverage >= 80pct?}
    H -->|Yes| I[git commit -sS]
    H -->|No| F
    I --> J[Open PR targeting main]
    J --> K[Codecov patch check: target 80pct, if_ci_failed=error]
Loading

Reviews (5): Last reviewed commit: "docs: address Sarah's review comments on..." | Re-trigger Greptile

Comment thread AGENTS.md
Comment thread AGENTS.md
Comment thread AGENTS.md
arhamm1 and others added 7 commits April 8, 2026 14:41
Replace `pip install pre-commit` with `uv sync --extra all` (which
already includes pre-commit via the dev group) and `uv run pre-commit`
to stay consistent with the "never use bare pip" rule stated earlier
in the same file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>
pydocstyle rules ("D") are entirely disabled in ruff's ignore list, so
docstrings are never CI-enforced. Clarify that docstrings are a
CONTRIBUTING.md convention but ruff will not flag their absence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Remove the duplicate "Set up pre-commit hooks" block introduced when
rebasing over an external commit that also addressed the same review
comment. Keep the uv sync --extra all approach and restore the missing
section separator before Running Tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arham Mehta <141266146+arhamm1@users.noreply.github.com>
Mention the Ray actor pool executor alongside RayData and Xenna, with
a note that it is intended for specific workflows like deduplication
and most pipelines should prefer RayDataExecutor or XennaExecutor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
An external commit's patch was applied as literal text during rebase,
leaving raw diff markers (- / +) in the Architecture Overview section.
Remove the stale lines; the correct docstring clarification is already
present in the Linting section above.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Comment thread AGENTS.md

**Test layout mirrors source layout.** New stage in `nemo_curator/stages/text/filters/` → tests in `tests/stages/text/filters/`. GPU-only tests are marked `@pytest.mark.gpu`. CPU tests must pass without any GPU.

**Coverage requirement: 80% on changed lines.** CI will block PRs that don't meet this.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 80% coverage threshold is not enforced by CI

The document states "CI will block PRs that don't meet this," but the actual CI workflow (cicd-main.yml) runs coverage report purely for reporting — there is no --fail-under=80 flag and no fail_under key in pyproject.toml's [tool.coverage.report] section (which doesn't exist). An agent that believes this will be surprised when a PR with low coverage passes CI, or will waste time padding tests to hit an imaginary gate. Either remove the blocking-CI claim or add an actual enforcement step to keep the docs honest.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasdhc what do you suggest we do here?

Copy link
Copy Markdown
Contributor

@thomasdhc thomasdhc Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov configuration lives here: codecov.yml not in pyproject.toml

Comment thread AGENTS.md Outdated
Comment thread AGENTS.md
Comment thread AGENTS.md Outdated
Comment thread AGENTS.md Outdated
Comment thread AGENTS.md Outdated
Comment thread AGENTS.md Outdated
Comment thread AGENTS.md
Comment thread AGENTS.md
| Don't | Do instead |
|-------|-----------|
| `pip install ...` | `uv sync --extra <group>` |
| `python script.py` | `uv run python script.py` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv run python script.py works within any uv-managed project directory (i.e. anywhere under the repo root after uv sync has been run). It invokes Python from the project virtual environment without requiring the venv to be manually activated — which is the recommended uv pattern. Happy to change the wording if you prefer uv run -- python or a note clarifying the prerequisite.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?

Copy link
Copy Markdown
Contributor Author

@arhamm1 arhamm1 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case; should we stick to uv pip install "nemo-curator"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope this (uv run) doesn't work for us always; we still need to learn why it doesn't work.. For now we've been doing source .venv/bin/activate && python script.py
In fact uv run led to CI/CD failures, that Onur helped resolve here https://github.com/NVIDIA-NeMo/Curator/pull/1557/changes#diff-fc966eacbe7f865fa8fee5be263e78011c386a7bd0be696a28271e169def612dR98-R101

- Add math_cpu and math_cuda12 to available extras list
- Move inputs()/outputs() from Optional to Expected overrides; every
  stage should implement them
- Clarify setup_on_node (download/verify models on disk) vs setup
  (load models into memory)
- Add note that process() must propagate _stage_perf and _metadata
- Update stage example to propagate _stage_perf and _metadata
- Correct executor ordering: XennaExecutor is the current default,
  RayDataExecutor is the common alternative

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread AGENTS.md
uv sync --extra audio_cpu --extra video_cpu
```

Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think deduplication should be included here, users should never install deduplication by itself:

Suggested change
Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `deduplication_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.
Available extras: `text_cpu`, `text_cuda12`, `audio_cpu`, `audio_cuda12`, `image_cpu`, `image_cuda12`, `video_cpu`, `video_cuda12`, `sdg_cpu`, `sdg_cuda12`, `interleaved_cpu`, `interleaved_cuda12`, `math_cpu`, `math_cuda12`, `all`.

Comment thread AGENTS.md
| Don't | Do instead |
|-------|-----------|
| `pip install ...` | `uv sync --extra <group>` |
| `python script.py` | `uv run python script.py` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more context is needed here. I thought we had had a ticket open because we weren't sure if uv run would always work?

Comment thread AGENTS.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add instructions about the copyright header too. The copyright should be added to all newly added, non-empty Python files.

Comment thread AGENTS.md
Comment on lines +257 to +258
# Single GPU stage
resources = Resources(gpu_memory_gb=40.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't tested this code path as much so let's remove this

Comment thread AGENTS.md
## Quick Reference

```bash
uv sync --extra all # install all dependencies
Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uv sync --extra all # install all dependencies
uv sync --all-extras --all-groups # install all dependencies and development dependencies
source .venv/bin/activate # to activate the uv venv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants