feat: Add support for RollingManifestWriter#3009
Conversation
jayceslesar
left a comment
There was a problem hiding this comment.
matches the java it looks like to me
| class RollingManifestWriter: | ||
| """As opposed to ManifestWriter, a rolling writer could produce multiple manifest files.""" | ||
|
|
||
| _ROWS_DIVISOR = 250 |
There was a problem hiding this comment.
In Java we target at the size, controlled by commit.manifest.target-size-bytes. Any chance we can do something similar here? Or at least have some rationale behind this number.
There was a problem hiding this comment.
Yeah this is a direct port of the java implementations RollingmanifestWriter.ROW_DIVISOR.
this added divisor is basically a interval where we check the file size against target-size-bytes every 250 entries rather than on every write. But after some digging it looks like in PyIceberg tell() on both the PyArrow and fsspec OutputStream implementations is essentially free and use their own counters so maybe we can simplify this check and ignore the ROW_DIVISOR, but still doesn't hurt to have wdyt?
Rationale for this change
Adds support for the
RollingManifestWriterto split large commits into multiple manifest files instead of one giant file. This PR wraps the ManifestWriter and follows the Java implemenation of checking the size every 250 entries. The bulk of this work was done in #650.Next step is wiring this into
fast_append, but it's also useful for manifest repair operations, like deduplicating entries and rewriting manifests without blowing up the output sizesAre these changes tested?
Yes
Are there any user-facing changes?
No