docs: add HF ecosystem context to push-to-hub dev notes#474
Conversation
Add section on what datasets get on the Hub (Dataset Viewer, streaming, Viewer API), link to Hub search for DataDesigner datasets, and note that private datasets can be flipped to public.
|
All contributors have signed the DCO ✍️ ✅ |
Greptile SummaryThis PR enriches the "Push Datasets to Hugging Face Hub" dev-note with Hugging Face ecosystem context contributed by a Hugging Face ML Librarian, and adds them as a co-author. The changes are purely documentation.
|
| Filename | Overview |
|---|---|
| docs/devnotes/.authors.yml | Adds Daniel van Strien (Hugging Face) as a new author entry — correct YAML structure, valid GitHub avatar URL. |
| docs/devnotes/posts/push-datasets-to-hugging-face-hub.md | Adds "What You Get on the Hub" section with Dataset Viewer, streaming snippet (with import), and Dataset Viewer API; updates Hub search URL to the clean ?library=datadesigner form; expands the private=True note. Both previously flagged issues (missing import, doubled URL prefix) are resolved in this revision. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DataDesigner push_to_hub] --> B[Hugging Face Hub]
B --> C[Dataset Viewer\nbrowsable in browser]
B --> D[Parquet files\non HF storage]
D --> E[Streaming\nload_dataset with streaming=True]
D --> F[Dataset Viewer API\nrow pagination / search / stats]
B --> G[Hub Search\n?library=datadesigner]
Reviews (3): Last reviewed commit: "fix: remove doubled library: prefix in H..." | Re-trigger Greptile
| Tags default to `["synthetic", "datadesigner"]` plus whatever you pass in. | ||
| Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed. | ||
| Size category (`n<1K`, `1K<n<10K`, etc.) is auto-computed. These tags make your | ||
| dataset discoverable in [Hub search](https://huggingface.co/datasets?library=library:datadesigner&sort=trending) |
There was a problem hiding this comment.
Potentially doubled
library: prefix in Hub search URL
The library query parameter appears to embed the full internal tag name library:datadesigner as its value, which may be redundant:
https://huggingface.co/datasets?library=library:datadesigner&sort=trending
HF Hub typically strips the library: prefix in the URL query parameter — the standard pattern used elsewhere is just the library name as the value, e.g. ?library=datadesigner. Worth verifying the link resolves to the intended filtered view, since an incorrect URL would return an empty result set to readers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
Line: 246
Comment:
**Potentially doubled `library:` prefix in Hub search URL**
The `library` query parameter appears to embed the full internal tag name `library:datadesigner` as its value, which may be redundant:
```
https://huggingface.co/datasets?library=library:datadesigner&sort=trending
```
HF Hub typically strips the `library:` prefix in the URL query parameter — the standard pattern used elsewhere is just the library name as the value, e.g. `?library=datadesigner`. Worth verifying the link resolves to the intended filtered view, since an incorrect URL would return an empty result set to readers.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Both https://huggingface.co/datasets?library=library:datadesigner&sort=trending and https://huggingface.co/datasets?library=datadesigner&sort=trending resolve to the same thing. We should perhaps keep the later.
There was a problem hiding this comment.
Thanks @davanstrien! Added one small nit, but lgtm!
You'll need to comment with I have read the DCO document and I hereby sign the DCO. before you can merge though!
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
I have read the DCO document and I hereby sign the DCO. |
9a352b8
into
NVIDIA-NeMo:nmulepati/docs/dev-notes-push-to-huggingface-hub
Summary
private=Truedatasets can be flipped to public latercc @nabinchha