docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) by dhruvnathawani · Pull Request #544 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-04-14T16:22:27Z

📋 Summary

Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.

🔄 Changes

Added the following :

docs/assets/recipes/model_usability/structured_data.py — Five-stage pipeline: samplers → schema generation → user prompt → conversation pairs → best-of-3 structured output across JSON, YAML, XML, Markdown. Demonstrates
SubcategorySamplerParams for conditional topic sampling.
docs/assets/recipes/model_usability/prompt_sensitivity.py — Seed-driven pipeline with 10 regex answer formats × 30 preambles, 7 diversity samplers, 3 LLM paraphrasing stages, and 4 LLM judges (format compliance, regex alignment, order
coherence, preamble quality).
docs/assets/recipes/code_generation/infinibyte.py — Cross-source problem generation using HF streaming, random cross-join, LLMStructuredColumnConfig with Pydantic models for candidate generation/selection/evaluation, and solution generation.
docs/recipes/model_usability/structured_data.md — recipe doc page
docs/recipes/model_usability/prompt_sensitivity.md — recipe doc page
docs/recipes/code_generation/infinibyte.md — recipe doc page

🔧 Changed

docs/recipes/cards.md — three new recipe cards added
mkdocs.yml — nav entries for new Model Usability category and InfiniByte under Code Generation

🧪 Testing

structured_data.py --num-records 2 — 2/2 records, all columns generated
prompt_sensitivity.py --num-records 2 — 2/2 records, all 4 judges scored
infinibyte.py --num-records 2 --limit 100 — pipeline stages execute correctly (streaming + cross-join + structured columns all work; default nvidia-text model times out on long coding problems, documented in prerequisites)
uv run mkdocs build — no errors for new recipe files
make check-all-fix — all checks passed

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (if applicable)

… infinibyte)

github-actions · 2026-04-14T16:23:47Z

Docs preview: https://fa8b923f.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

greptile-apps · 2026-04-15T05:16:57Z

Greptile Summary

Adds three new Nemotron Nano recipe files (structured data, prompt sensitivity, InfiniByte), their documentation pages, recipe cards, and mkdocs.yml nav entries. The InfiniByte and structured-data recipes look correct; the prompt-sensitivity recipe has three bugs in its FORMAT_TEMPLATES regex patterns that will corrupt the Regex Alignment judge's ground-truth signal.

P1 — fmt_05 mismatched delimiter (prompt_sensitivity.py line 103): regex opens with \\[ (escaped bracket) but closes with \\) (escaped paren), so the pattern matches [Answer: X) instead of [Answer: X].
P2 — fmt_00/fmt_09 broken LaTeX regex (lines 78, 123): r\"\\boxed{([.*?])}\" — \\b is the word-boundary anchor, not a literal backslash, and [.*?] is a three-character class, not a wildcard. Neither will match \\boxed{answer}.
P2 — fmt_08 wrong wildcard (line 118): ([.*?]) is a character class for ., *, ? only; should be (.*?) to capture arbitrary answer content.

Confidence Score: 4/5

Safe to merge after fixing the fmt_05 regex mismatch; the other two regex bugs are also worth correcting before this becomes a reference recipe.

One clear P1 bug (mismatched bracket/paren delimiter in fmt_05) will cause the Regex Alignment judge to evaluate against an incorrect ground-truth pattern for that format, producing wrong training signal. Two additional P2 regex bugs (broken LaTeX pattern and wrong char-class wildcard) compound the issue in the same file.

docs/assets/recipes/model_usability/prompt_sensitivity.py — all three regex issues are in FORMAT_TEMPLATES (lines 78, 103, 118, 123).

Important Files Changed

Filename	Overview
docs/assets/recipes/model_usability/prompt_sensitivity.py	New prompt-sensitivity recipe with seed format templates × 30 preambles; contains three regex bugs: fmt_05 has mismatched […) delimiters (P1), fmt_00/fmt_09 use word-boundary \b instead of literal backslash and wrong [.?] char class, and fmt_08 has same [.?] char-class issue.
docs/assets/recipes/code_generation/infinibyte.py	New InfiniByte recipe implementing cross-source problem augmentation via HF streaming, random cross-join sampling, and a 5-stage structured LLM pipeline; logic is correct and follows established patterns.
docs/assets/recipes/model_usability/structured_data.py	New structured-data recipe with SubcategorySamplerParams conditional topic sampling and best-of-3 candidate generation; no issues found.
docs/recipes/cards.md	Adds three recipe cards (Structured Data, Prompt Sensitivity, InfiniByte) with correct links and download buttons.
mkdocs.yml	Adds nav entries for new Model Usability category and InfiniByte under Code Generation; structure is correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph IB["InfiniByte Pipeline"]
        IB0[HF Streaming Download] --> IB1[Random Cross-Join to CSV seed]
        IB1 --> IB2[Stage 1: combination_type sampler]
        IB2 --> IB3[Stage 2: LLMStructured Candidate generation x2]
        IB3 --> IB4[Stage 3: LLMStructured Best problem selection]
        IB4 --> IB5[Stage 4: LLMStructured Evaluation]
        IB5 --> IB6[Stage 5: LLMText Solution generation]
    end
    subgraph PS["Prompt Sensitivity Pipeline"]
        PS0[Seed: 10 formats x 30 preambles] --> PS1[Stage 1: 7 diversity samplers]
        PS1 --> PS2[Stage 2: LLMText Preamble paraphrase]
        PS2 --> PS3[Stage 3: LLMText Format instruction paraphrase]
        PS3 --> PS4[Stage 4: LLMText User prompt composition]
        PS4 --> PS5[Stage 5: 4 LLM judges]
    end
    subgraph SD["Structured Data Pipeline"]
        SD1[Stage 1: Samplers] --> SD2[Stage 2: LLMText Schema generation]
        SD2 --> SD3[Stage 3: LLMText User prompt]
        SD3 --> SD4[Stage 4: LLMText Conversation pairs]
        SD4 --> SD5[Stage 5: LLMText x3 Best-of-3 outputs]
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103

Comment:
**Mismatched closing delimiter in `fmt_05` regex**

The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.

```suggestion
        "output_regex": r"\[Answer:\s*([A-Za-z])\]",
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78

Comment:
**Incorrect regex for LaTeX `\boxed{}` format**

`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:

```suggestion
        "output_regex": r"\\boxed\{(.*?)\}",
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118

Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**

`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:

```suggestion
        "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile}

greptile-apps · 2026-04-15T05:17:01Z

docs/assets/recipes/model_usability/prompt_sensitivity.py

+    },
+    {
+        "format_key": "fmt_05",
+        "output_regex": r"\[Answer:\s*([A-Za-z])\)",


Mismatched closing delimiter in fmt_05 regex

The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.

Suggested change

"output_regex": r"\[Answer:\s*([A-Za-z])\)",

"output_regex": r"\[Answer:\s*([A-Za-z])\]",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 103 Comment: **Mismatched closing delimiter in `fmt_05` regex** The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce. ```suggestion "output_regex": r"\[Answer:\s*([A-Za-z])\]", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:02Z

docs/assets/recipes/model_usability/prompt_sensitivity.py

+FORMAT_TEMPLATES = [
+    {
+        "format_key": "fmt_00",
+        "output_regex": r"\boxed{([.*?])}",


Incorrect regex for LaTeX \boxed{} format

r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:

Suggested change

"output_regex": r"\boxed{([.*?])}",

"output_regex": r"\\boxed\{(.*?)\}",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 78 Comment: **Incorrect regex for LaTeX `\boxed{}` format** `r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is: ```suggestion "output_regex": r"\\boxed\{(.*?)\}", ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-15T05:17:03Z

docs/assets/recipes/model_usability/prompt_sensitivity.py

+    },
+    {
+        "format_key": "fmt_08",
+        "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",


[.*?] character class captures only ., *, or ?

([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:

Suggested change

"output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>",

"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/assets/recipes/model_usability/prompt_sensitivity.py Line: 118 Comment: **`[.*?]` character class captures only `.`, `*`, or `?`** `([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets: ```suggestion "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-15T05:21:03Z

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

This PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in docs/ — no library code is modified.

Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, cards.md, mkdocs.yml)
Lines: +1568, -0

Findings

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

File: docs/assets/recipes/model_usability/prompt_sensitivity.py (lines 63–111 in the diff, the FORMAT_TEMPLATES list)

Several output_regex patterns appear incorrect. Since these are passed to LLM judges to evaluate "regex alignment," incorrect patterns will produce unreliable evaluation scores:

\boxed is not escaped correctly (fmt_00, fmt_09): In Python regex, \b is the word-boundary anchor, so r"\boxed{...}" matches word-boundary + oxed{...}, not the literal string \boxed{...}. To match the LaTeX \boxed{} literally, use r"\\boxed\{..." or a double backslash.
[.*?] is a character class, not a wildcard (fmt_00, fmt_08, fmt_09): [.*?] matches a single character that is ., *, or ?. The likely intent is (.*?) (non-greedy capture group) or .+?.
fmt_05 has mismatched brackets: The regex r"\[Answer:\s*([A-Za-z])\)" opens with \[ (literal [) but closes with \) (literal )). The seed_format_instruction says "end with [Answer: X]" — so the closing delimiter should be \], not \).
fmt_00 and fmt_09 are duplicates: Both use the identical regex r"\boxed{([.*?])}". Their seed_format_instruction values differ, but the regex and format_key should likely differ too, or one should be removed.

Impact: These patterns are seed data for an LLM pipeline. The regex_alignment LLM judge evaluates whether generated format instructions match the output_regex. If the regex itself is wrong, the judge's scoring reference is unreliable, degrading the quality signal in the generated dataset.

Recommendation: Verify these patterns against the original Nemotron Nano pipeline. If they were copied verbatim from the training codebase, document that in a comment. If they are new to this recipe, fix them.

Low — Doc Page Format Inconsistency

Files: docs/recipes/code_generation/infinibyte.md, docs/recipes/model_usability/structured_data.md, docs/recipes/model_usability/prompt_sensitivity.md

The new recipe doc pages include a # heading and a description paragraph before the download button:

# Nemotron Nano InfiniByte

Generate more diverse and complex training problems...

[Download Code ...]

Most existing recipe doc pages (e.g., text_to_python.md, product_info_qa.md) have only the download button and code include — no heading or description. Some newer ones (e.g., enterprise_text_to_sql.md) do include a heading and notes.

Impact: Minor visual inconsistency in docs. The added heading/description is arguably an improvement, providing better context to readers. Not blocking.

Informational — `hashlib.md5` in `infinibyte.py`

File: docs/assets/recipes/code_generation/infinibyte.py, line ~102 (within fetch_hf_dataset_to_df)

rec_id = rec.get("id") or hashlib.md5(text.encode("utf-8")).hexdigest()

MD5 is used as a fallback ID when a HuggingFace record has no id field. This is fine for deduplication/identification (not security), but some linters and security scanners flag hashlib.md5 usage. Consider hashlib.sha256 for forward-compatibility if the pipeline is adapted to stricter environments.

Informational — Single Strategy Defined in InfiniByte

File: docs/assets/recipes/code_generation/infinibyte.py, lines ~55–57

STRATEGIES = {
    "ocr_omr": ("ocr", "omr"),
}

The --strategy CLI arg accepts choices=list(STRATEGIES.keys()) but only one strategy (ocr_omr) is defined. This is fine for a recipe — it demonstrates extensibility — but the CLI help could note that additional strategies can be added by extending the STRATEGIES dict.

Positive Observations

Well-structured pipeline designs with clear ASCII architecture diagrams in each recipe's docstring.
Proper use of DataDesigner APIs: LLMStructuredColumnConfig with Pydantic models (infinibyte), LLMJudgeColumnConfig with Score rubrics (prompt_sensitivity), SubcategorySamplerParams for conditional sampling (structured_data), ExpressionColumnConfig for extracting structured fields.
SPDX license headers present on all new files.
PEP 723 inline script metadata (# /// script) correctly specified for uv run compatibility.
from __future__ import annotations included in all three recipe files (consistent with project style guide).
Consistent CLI interface across all three recipes (--model-alias, --num-records, --artifact-path).
Recipe cards in cards.md follow the established grid pattern with icons, descriptions, "Demonstrates" sections, and action buttons.
mkdocs.yml nav entries are properly structured, creating a new "Model Usability" category cleanly.

Verdict

Approve with suggestions. The recipes are well-crafted, demonstrate advanced DataDesigner features, and follow established patterns. The regex pattern issues in prompt_sensitivity.py are the primary concern — they should be verified against the original Nemotron Nano pipeline and corrected or annotated. The other findings are minor and non-blocking.

docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…

38c2559

… infinibyte)

Merge branch 'main' into dhruv/recipes/nano

b1b34ec

dhruvnathawani changed the title ~~[DRAFT] docs: add Nemotron Nano recipes (structured data, prompt sensitivity,…)~~ docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte) Apr 15, 2026

dhruvnathawani marked this pull request as ready for review April 15, 2026 05:13

dhruvnathawani requested a review from a team as a code owner April 15, 2026 05:13

greptile-apps bot reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544

docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
dhruvnathawani wants to merge 2 commits intomainfrom
dhruv/recipes/nano

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 15, 2026

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"output_regex": r"\[Answer:\s*([A-Za-z])\)",
	"output_regex": r"\[Answer:\s*([A-Za-z])\]",

	"output_regex": r"\boxed{([.*?])}",
	"output_regex": r"\\boxed\{(.*?)\}",

	"output_regex": r"<final_answer>\s([.?])\s*</final_answer>",
	"output_regex": r"<final_answer>\s(.?)\s*</final_answer>",

Conversation

dhruvnathawani commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔄 Changes

🔧 Changed

🧪 Testing

✅ Checklist

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Apr 15, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 15, 2026

Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)

Summary

Findings

High — Regex Patterns in prompt_sensitivity.py Have Multiple Issues

Low — Doc Page Format Inconsistency

Informational — hashlib.md5 in infinibyte.py

Informational — Single Strategy Defined in InfiniByte

Positive Observations

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhruvnathawani commented Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading

High — Regex Patterns in `prompt_sensitivity.py` Have Multiple Issues

Informational — `hashlib.md5` in `infinibyte.py`