Skip to content

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395

Open
Copilot wants to merge 15 commits intomainfrom
copilot/add-tar-gz-support
Open

Add .tar.gz / .tgz archive support for extension, preset, and workflow installation#2395
Copilot wants to merge 15 commits intomainfrom
copilot/add-tar-gz-support

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

The extension, preset, and workflow download pipelines only accepted ZIP archives, blocking use of npm registries and CI artifact stores that serve tarballs natively.

Core utilities (extensions.py)

  • detect_archive_format(url, content_type="") — infers format from URL path extension (.zip, .tar.gz, .tgz) with Content-Type header fallback (application/gzip, application/x-gzip, application/x-tar+gzip)
  • safe_extract_tarball(archive_path, dest_dir, error_class) — safe extraction with:
    • Pre-extraction validation of all members: rejects absolute paths, .. traversal, symlinks, hard links, devices, and FIFOs
    • PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata entries (GNU_TYPES: longname/longlink/sparse) are silently skipped — they carry no extractable payload and are emitted by many common archiving tools
    • Python 3.11: passes pre-validated safe_members list to extractall()
    • Python 3.12+: uses tarfile.data_filter for additional OS-level protection
    • tarfile.TarError/OSError are caught and re-raised as the caller-supplied error_class for consistent error handling

Both helpers are public (no underscore prefix) and imported directly by presets.py and __init__.py.

Extensions & presets

  • install_from_zip() on both managers now detects archive format from the file extension and dispatches to ZIP or tarball extraction accordingly — existing callers are unaffected
  • download_extension() / download_pack() detect format from the download URL (or Content-Type fallback) and persist the archive with the correct extension (.zip or .tar.gz); unknown formats are rejected with a clear error rather than silently defaulting to ZIP

__init__.py call sites

Path Change
extension add --from Detects format from URL/Content-Type before saving; unknown formats rejected
preset add --from Same
extension update Inline manifest peek handles both ZIP and tar.gz; cache filename sanitized via Path(extension).name to prevent path traversal
workflow add (URL) Extracts workflow.yml from archive when URL points to one; temp-file paths initialized before write to avoid UnboundLocalError on disk-full
workflow add (local) Accepts local .tar.gz/.tgz/.zip archive files (case-insensitive detection)
workflow add (catalog) Same archive detection for catalog-sourced URLs

A shared _extract_workflow_yml(archive_path, fmt) helper handles root-level and single-nested-directory layouts for both formats, with tarfile.extractfile() handles properly closed via context managers.

Tests

30 new tests across test_extensions.py, test_presets.py, and test_workflows.py covering:

  • test_extensions.py: format detection (URL + Content-Type), flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_presets.py: flat and nested tarball install, missing manifest errors, path traversal rejection, symlink rejection
  • test_workflows.py (TestWorkflowAddArchive, 9 CLI-level tests): local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, URL-based archive download for both ZIP and tar.gz formats

Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:04
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:06
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:09
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:12
Copilot AI requested review from Copilot and removed request for Copilot April 28, 2026 18:14
Copilot AI changed the title [WIP] Add support for .tar.gz and .tgz archives Add .tar.gz / .tgz archive support for extension, preset, and workflow installation Apr 28, 2026
Copilot AI requested a review from mnriem April 28, 2026 18:16
Comment thread src/specify_cli/__init__.py Fixed
@mnriem mnriem marked this pull request as ready for review May 5, 2026 22:58
@mnriem mnriem requested review from Copilot and removed request for Copilot May 5, 2026 22:58
…mlink rejection test

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/0469bac5-296a-46b6-b84e-eb33b0dc0fce

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

@copilot apply changes based on the comments in this thread

Applied both changes in commit e0495eb:

  • arc_tmp_path unbound local (__init__.py:5082, 5130, 5234): In all three workflow add code paths, the archive path variable is now initialized to None and assigned before the write() call, then the finally guard uses if arc_tmp_path is not None: ...unlink() to ensure safe cleanup even when the write fails.
  • Missing preset symlink test (test_presets.py): Added test_install_from_tar_gz_rejects_symlinks to TestPresetManager, mirroring the extension-side test — verifies that install_from_zip() raises PresetValidationError with "Symlinks" when the tarball contains a symlink entry.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 2

Comment thread src/specify_cli/extensions.py Outdated
Comment on lines +111 to +116
def _detect_archive_format(url: str, content_type: str = "") -> str:
"""Detect archive format from URL path extension or Content-Type header.

Args:
url: URL or file path to inspect.
content_type: Optional ``Content-Type`` header value from the HTTP response.
Comment on lines +4926 to +4931
def _extract_workflow_yml(archive_path: Path, archive_fmt: str) -> bytes:
"""Extract ``workflow.yml`` from a ZIP or ``.tar.gz`` archive.

Searches the archive root and a single nested top-level subdirectory
(e.g., ``repo-name-1.0/workflow.yml``).

@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 6, 2026

@copilot apply changes based on the comments in this thread

…add archive CLI tests

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/845e41d1-75e3-49fb-a580-a7fb805dd716

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 6, 2026

@copilot apply changes based on the comments in this thread

Applied both changes in commit 0a02369:

  • Public API (extensions.py): Renamed _detect_archive_formatdetect_archive_format and _safe_extract_tarballsafe_extract_tarball (dropped the underscore prefix). All callers in extensions.py, presets.py, __init__.py, and tests/test_extensions.py have been updated accordingly.
  • CLI-level tests (test_workflows.py): Added TestWorkflowAddArchive with 9 tests covering local ZIP (flat/nested), local .tar.gz (flat/nested), .tgz alias, missing workflow.yml error cases, and URL-based download for both ZIP and tar.gz formats.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

src/specify_cli/init.py:3669

  • archive_path.write_bytes(archive_data) can raise OSError/IOError (disk full, permission issues), but this block only handles URLError. That means specify extension add --from ... may crash with a traceback instead of emitting a controlled CLI error. Add except OSError (or broader I/O handling) around the write/install steps and surface a clear message before exiting.
                    suffix = ".tar.gz" if archive_fmt == "tar.gz" else ".zip"
                    safe_name = Path(extension).name or "extension"
                    archive_path = download_dir / f"{safe_name}-url-download{suffix}"
                    archive_path.write_bytes(archive_data)

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

if f is not None:
with f:
manifest_data = yaml.safe_load(f.read()) or {}
except KeyError:
Comment thread src/specify_cli/__init__.py Outdated
Comment on lines +2653 to +2654
archive_path.write_bytes(archive_data)
manifest = manager.install_from_zip(archive_path, speckit_version, priority)
- Fix tar.gz extractfile() None fallback in extension_update: nested-directory
  search now runs whenever manifest_data is still None, not only on KeyError
- Add OSError handling around write_bytes in preset --from URL path
- Add OSError handling around write_bytes in extension --from URL path
Comment thread src/specify_cli/__init__.py Fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 16:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 2

Comment thread src/specify_cli/extensions.py Outdated
Comment on lines +204 to +208
# Only allow regular files and directories.
if not (member.isreg() or member.isdir()):
raise error_class(
f"Non-regular file in archive: {member.name}"
)
Comment thread tests/test_workflows.py Outdated
Comment on lines +1891 to +1893
with __import__("unittest.mock", fromlist=["patch"]).patch.object(
__import__("pathlib", fromlist=["Path"]).Path, "cwd", return_value=project_dir
):
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 7, 2026

@copilot apply changes based on the comments in this thread

…ock imports in workflow tests

Agent-Logs-Url: https://github.com/github/spec-kit/sessions/c1fcc1ff-8766-4d97-90a5-368447980acf

Co-authored-by: mnriem <15701806+mnriem@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 7, 2026

@copilot apply changes based on the comments in this thread

Applied both changes in commit 05798a9:

  • safe_extract_tarball() metadata members: PAX headers (XHDTYPE, XGLTYPE, SOLARIS_XHDTYPE) and GNU metadata entries (GNU_TYPES: longname/longlink/sparse) are now skipped instead of rejected. The _TAR_METADATA_TYPES tuple is defined once before the loop for efficiency. Symlinks, hard links, devices, and FIFOs are still rejected; only regular files and directories are extracted.
  • Test imports: Replaced the __import__(...) inline pattern in all 9 TestWorkflowAddArchive tests with from unittest.mock import patch / patch.object(Path, "cwd", ...) — consistent with the rest of the file.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (2)

src/specify_cli/extensions.py:231

  • In the Python 3.12+ branch, safe_extract_tarball() validates members into safe_members (and intentionally skips PAX/GNU metadata entries), but then calls tf.extractall(dest_dir, filter="data") without restricting extraction to safe_members. This makes extraction behavior inconsistent across Python versions and can reintroduce metadata entries (or other unexpected members) into the extraction step. Pass members=safe_members in the 3.12+ branch as well (and keep the filter="data" hardening), or extract only the validated members explicitly.
            # Extract — use the "data" filter on Python 3.12+ for extra hardening.
            # On older versions pass only the pre-validated members so that no
            # unvetted entry (added concurrently or via a race) slips through.
            if sys.version_info >= (3, 12):
                tf.extractall(dest_dir, filter="data")  # type: ignore[call-arg]
            else:
                tf.extractall(dest_dir, members=safe_members)  # noqa: S202 — validated above

src/specify_cli/init.py:3663

  • extension add --from detects archive format from the original URL and optional Content-Type, but does not validate the final URL after redirects. An HTTPS URL can redirect to non-HTTPS, and format detection can be wrong if the redirect changes the effective filename/extension. Consider validating response.geturl() against the HTTPS/localhost policy and running detect_archive_format() against that final URL.
                    with urllib.request.urlopen(from_url, timeout=60) as response:
                        if not archive_fmt:
                            content_type = response.headers.get("Content-Type", "")
                            archive_fmt = detect_archive_format(from_url, content_type)
                        archive_data = response.read()

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment on lines +176 to +181
_TAR_METADATA_TYPES = (
tarfile.XHDTYPE, # PAX extended header
tarfile.XGLTYPE, # PAX global extended header
tarfile.SOLARIS_XHDTYPE, # Solaris PAX extended header
*tarfile.GNU_TYPES, # GNU longname / longlink / sparse
)
Comment on lines +2635 to +2640
archive_fmt = _det_fmt(from_url)
try:
with urllib.request.urlopen(from_url, timeout=60) as response:
zip_path.write_bytes(response.read())
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = _det_fmt(from_url, content_type)
Comment on lines +2155 to +2164
# Detect archive format from URL; resolve via Content-Type when needed.
archive_fmt = detect_archive_format(download_url)

# Download the archive
try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(download_url, content_type)
archive_data = response.read()
Comment on lines +2316 to +2324
# Detect archive format from URL; resolve via Content-Type when needed.
archive_fmt = detect_archive_format(download_url)

try:
with self._open_url(download_url, timeout=60) as response:
zip_data = response.read()

zip_path.write_bytes(zip_data)
return zip_path
if not archive_fmt:
content_type = response.headers.get("Content-Type", "")
archive_fmt = detect_archive_format(download_url, content_type)
archive_data = response.read()
@mnriem
Copy link
Copy Markdown
Collaborator

mnriem commented May 7, 2026

@copilot apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support .tar.gz / .tgz archives for extension, preset, and workflow installation

3 participants