docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544
docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)#544dhruvnathawani wants to merge 2 commits intomainfrom
Conversation
|
Docs preview: https://fa8b923f.dd-docs-preview.pages.dev
|
Greptile SummaryAdds three new Nemotron Nano recipe files (structured data, prompt sensitivity, InfiniByte), their documentation pages, recipe cards, and
|
| Filename | Overview |
|---|---|
| docs/assets/recipes/model_usability/prompt_sensitivity.py | New prompt-sensitivity recipe with seed format templates × 30 preambles; contains three regex bugs: fmt_05 has mismatched […) delimiters (P1), fmt_00/fmt_09 use word-boundary \b instead of literal backslash and wrong [.?] char class, and fmt_08 has same [.?] char-class issue. |
| docs/assets/recipes/code_generation/infinibyte.py | New InfiniByte recipe implementing cross-source problem augmentation via HF streaming, random cross-join sampling, and a 5-stage structured LLM pipeline; logic is correct and follows established patterns. |
| docs/assets/recipes/model_usability/structured_data.py | New structured-data recipe with SubcategorySamplerParams conditional topic sampling and best-of-3 candidate generation; no issues found. |
| docs/recipes/cards.md | Adds three recipe cards (Structured Data, Prompt Sensitivity, InfiniByte) with correct links and download buttons. |
| mkdocs.yml | Adds nav entries for new Model Usability category and InfiniByte under Code Generation; structure is correct. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph IB["InfiniByte Pipeline"]
IB0[HF Streaming Download] --> IB1[Random Cross-Join to CSV seed]
IB1 --> IB2[Stage 1: combination_type sampler]
IB2 --> IB3[Stage 2: LLMStructured Candidate generation x2]
IB3 --> IB4[Stage 3: LLMStructured Best problem selection]
IB4 --> IB5[Stage 4: LLMStructured Evaluation]
IB5 --> IB6[Stage 5: LLMText Solution generation]
end
subgraph PS["Prompt Sensitivity Pipeline"]
PS0[Seed: 10 formats x 30 preambles] --> PS1[Stage 1: 7 diversity samplers]
PS1 --> PS2[Stage 2: LLMText Preamble paraphrase]
PS2 --> PS3[Stage 3: LLMText Format instruction paraphrase]
PS3 --> PS4[Stage 4: LLMText User prompt composition]
PS4 --> PS5[Stage 5: 4 LLM judges]
end
subgraph SD["Structured Data Pipeline"]
SD1[Stage 1: Samplers] --> SD2[Stage 2: LLMText Schema generation]
SD2 --> SD3[Stage 3: LLMText User prompt]
SD3 --> SD4[Stage 4: LLMText Conversation pairs]
SD4 --> SD5[Stage 5: LLMText x3 Best-of-3 outputs]
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103
Comment:
**Mismatched closing delimiter in `fmt_05` regex**
The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.
```suggestion
"output_regex": r"\[Answer:\s*([A-Za-z])\]",
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78
Comment:
**Incorrect regex for LaTeX `\boxed{}` format**
`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:
```suggestion
"output_regex": r"\\boxed\{(.*?)\}",
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118
Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**
`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:
```suggestion
"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Merge branch 'main' into dhruv/recipes/n..." | Re-trigger Greptile
| }, | ||
| { | ||
| "format_key": "fmt_05", | ||
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", |
There was a problem hiding this comment.
Mismatched closing delimiter in
fmt_05 regex
The regex opens with \[ (escaped left square bracket) but closes with \) (escaped right parenthesis), so it matches [Answer: X) instead of [Answer: X]. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks Regex Alignment will evaluate against a pattern that doesn't match what it was designed to produce.
| "output_regex": r"\[Answer:\s*([A-Za-z])\)", | |
| "output_regex": r"\[Answer:\s*([A-Za-z])\]", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 103
Comment:
**Mismatched closing delimiter in `fmt_05` regex**
The regex opens with `\[` (escaped left square bracket) but closes with `\)` (escaped right parenthesis), so it matches `[Answer: X)` instead of `[Answer: X]`. The seed instruction explicitly says "end with [Answer: X]", so the regex and instruction are inconsistent — the LLM judge that checks `Regex Alignment` will evaluate against a pattern that doesn't match what it was designed to produce.
```suggestion
"output_regex": r"\[Answer:\s*([A-Za-z])\]",
```
How can I resolve this? If you propose a fix, please make it concise.| FORMAT_TEMPLATES = [ | ||
| { | ||
| "format_key": "fmt_00", | ||
| "output_regex": r"\boxed{([.*?])}", |
There was a problem hiding this comment.
Incorrect regex for LaTeX
\boxed{} format
r"\boxed{([.*?])}" has two problems: \b in a raw string is the regex word-boundary anchor (not a literal backslash + b), and [.*?] is a character class matching only the three characters ., *, or ?. The regex never matches the intended \boxed{<answer>} LaTeX output. The same issue appears in fmt_09 (line 123). The correct pattern to match a literal \boxed{…} is:
| "output_regex": r"\boxed{([.*?])}", | |
| "output_regex": r"\\boxed\{(.*?)\}", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 78
Comment:
**Incorrect regex for LaTeX `\boxed{}` format**
`r"\boxed{([.*?])}"` has two problems: `\b` in a raw string is the regex word-boundary anchor (not a literal backslash + `b`), and `[.*?]` is a character class matching only the three characters `.`, `*`, or `?`. The regex never matches the intended `\boxed{<answer>}` LaTeX output. The same issue appears in `fmt_09` (line 123). The correct pattern to match a literal `\boxed{…}` is:
```suggestion
"output_regex": r"\\boxed\{(.*?)\}",
```
How can I resolve this? If you propose a fix, please make it concise.| }, | ||
| { | ||
| "format_key": "fmt_08", | ||
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", |
There was a problem hiding this comment.
[.*?] character class captures only ., *, or ?
([.*?]) is a capture group containing a character class that matches exactly one of the three literal characters ., *, ?. It won't capture any real answer content inside <final_answer>…</final_answer>. The intended lazy-match wildcard should be outside the brackets:
| "output_regex": r"<final_answer>\s*([.*?])\s*</final_answer>", | |
| "output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/assets/recipes/model_usability/prompt_sensitivity.py
Line: 118
Comment:
**`[.*?]` character class captures only `.`, `*`, or `?`**
`([.*?])` is a capture group containing a character class that matches exactly one of the three literal characters `.`, `*`, `?`. It won't capture any real answer content inside `<final_answer>…</final_answer>`. The intended lazy-match wildcard should be outside the brackets:
```suggestion
"output_regex": r"<final_answer>\s*(.*?)\s*</final_answer>",
```
How can I resolve this? If you propose a fix, please make it concise.
Code Review: PR #544 — docs: add Nemotron Nano recipes (structured data, prompt sensitivity, infinibyte)SummaryThis PR adds three new recipe scripts and accompanying documentation for Nemotron Nano training pipelines: Structured Data (multi-format schema generation), Prompt Sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). It also introduces a new "Model Usability" recipe category in the docs navigation. The changes are entirely in Files changed: 8 (3 Python recipe scripts, 3 Markdown doc pages, FindingsHigh — Regex Patterns in
|
📋 Summary
Adds three new recipes implementing SDG pipelines used for Nemotron Nano training: structured data generation (multi-format schemas), prompt sensitivity (regex-verified preamble diversity), and InfiniByte (cross-source problem augmentation). Introduce a new "Model Usability" recipe category.
🔄 Changes
Added the following :
SubcategorySamplerParams for conditional topic sampling.
coherence, preamble quality).
🔧 Changed
🧪 Testing
✅ Checklist