TarReader: implement GNU sparse format 1.0 (PAX)#125283
Conversation
…rse.name and GNU.sparse.realsize Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
…sertions for data stream integrity Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
|
@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream |
Extracting GNU sparse format 1.0 PAX entries with the current code produces a file with the correct name (from our fix) but incorrect content. Here's what happens: the
When Properly expanding the DataStream would require:
This is significantly more complex than the current fix. The current PR is scoped to fix the metadata issues ( I can revert the |
|
Tagging subscribers to this area: @dotnet/area-system-formats-tar |
|
@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to produce them) I think it is okay to avoid implementing the older sparse file formats though. |
…SparseStream Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
…ed offsets, fix CanSeek order, fix async allocation Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Done in commit
Code review fixes applied: |
|
@copilot There have been test failures with |
… after PaxTarEntry construction PaxTarEntry constructor calls ReplaceNormalAttributesWithExtended which now validates GNU.sparse.realsize, causing BuildRawSparseArchive(-1L) to throw InvalidDataException during archive construction rather than during GetNextEntry(). Fix: omit GNU.sparse.realsize from the constructor's attribute dict and inject it directly into the ExtendedAttributes dictionary after construction. This bypasses constructor validation while still writing the attribute correctly to the archive, so TarReader still throws for both negative and valid realsize values. Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/dad003fe-7b19-4842-8a68-43549b5b054a Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Head branch was pushed to by a user without write access
Fixed in Root cause: Fix: omit Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
@copilot there are test failures, there seems to be some invalid cast in the test code |
…arse.realsize Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/16bb63f4-599a-49e3-8d2e-38df5d863734 Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Head branch was pushed to by a user without write access
Fixed in Fixed by using reflection to access |
… programmatic archive construction" The tar files pax-nil-sparse-data.tar, pax-nil-sparse-hole.tar, and pax-sparse-big.tar DO exist in the System.Formats.Tar.TestData NuGet package. This reverts commit d35983c. Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Done in |
|
/ba-g wasm failures are unrelated |
TarReaderwas not handling GNU sparse format 1.0 PAX entries, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths likeGNUSparseFile.0/real-file.dll, incorrect sizes, and corrupted extracted content.Changes
Added read-only support for GNU sparse format 1.0 (PAX). When
TarReaderencounters PAX extended attributesGNU.sparse.major=1andGNU.sparse.minor=0, it resolves the real file name fromGNU.sparse.name, reports the expanded size fromGNU.sparse.realsize, and wraps the raw data stream withGnuSparseStreamwhich presents the expanded virtual file content (zeros for holes, packed data at correct offsets).The sparse map embedded in the data section is parsed lazily on first
Read, so_dataStreamremains unconsumed during entry construction. This allowsTarWriter.WriteEntryto round-trip the condensed sparse data correctly for both seekable and non-seekable source archives.Older GNU sparse formats (0.0, 0.1) and write support are not addressed.
Additional correctness and robustness improvements based on code review:
GnuSparseStreamnow overridesDisposeAsyncto properly await async disposal of the underlying raw stream.TarHeader.Readnow throwsInvalidDataExceptionifGNU.sparse.realsizeis negative, consistent with validation of the regular_sizefield.offset > _realSize || length > _realSize - offset).FindSegmentFromCurrentuses binary search (O(log n)) for backward seeks, preserving the O(1) amortized forward scan for the common sequential-read case.Testing
All existing tests pass. New
TarReader.SparseFile.Tests.cscovers:copyData× sync/asyncGNU.sparse.realsizevalue throwsInvalidDataException(sync and async) — the test helperWriteSparseEntryomitsGNU.sparse.realsizefrom thePaxTarEntryconstructor's attribute dictionary (to avoid constructor-level validation) and instead injects it via reflection into the internalTarHeader.ExtendedAttributesdictionary after construction, so the archive can be built while ensuringTarReader.GetNextEntry()is the one that throwsgolang_tartest data files (pax-nil-sparse-data.tar,pax-nil-sparse-hole.tar,pax-sparse-big.tar) from theSystem.Formats.Tar.TestDataNuGet package, plus programmatically constructed archives for additional coverageAdvancePastEntry_DoesNotCorruptNextEntryandCopySparseEntryToNewArchive_PreservesExpandedContentnow share archive construction helpers (WriteSparseEntry,BuildSparseArchive,BuildRawSparseArchive) with the rest of the test suite💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.