[fix](s3) Add limit-aware brace expansion and fix misleading glob metrics#61843
[fix](s3) Add limit-aware brace expansion and fix misleading glob metrics#61843dataroaring wants to merge 2 commits intomasterfrom
Conversation
…rics Prevent OOM from unbounded brace expansion by adding early-stop to expandBracePatterns when the expansion exceeds s3_head_request_max_paths. Also skip LIST-path metrics logging when the HEAD/getProperties optimization was used, avoiding misleading "process 0 elements" logs. Found via review of #61775. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR hardens the deterministic-path (HEAD/getProperties) optimization for S3/Azure by making brace expansion limit-aware to avoid unbounded expansion, and fixes misleading glob LIST metrics logs when the HEAD/getProperties path returns early.
Changes:
- Add a limit-aware
S3Util.expandBracePatterns(pattern, maxPaths)with an early-abort exception. - Update S3/Azure deterministic-path globbing to use the limit-aware expansion and skip LIST metrics logging when the HEAD/getProperties path is used.
- Add unit tests covering limit boundary conditions for brace expansion.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java | Introduces limit-aware brace expansion and a dedicated exception for exceeding limits. |
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java | Uses limit-aware brace expansion for HEAD optimization and avoids logging LIST metrics when HEAD path is used. |
| fe/fe-core/src/main/java/org/apache/doris/fs/obj/AzureObjStorage.java | Uses limit-aware brace expansion for getProperties optimization and skips misleading LIST metrics logs; adjusts finally-block behavior. |
| fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java | Adds test coverage for limit-aware brace expansion scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * @throws BraceExpansionTooLargeException if expansion exceeds maxPaths | ||
| */ | ||
| public static List<String> expandBracePatterns(String pathPattern, int maxPaths) { |
There was a problem hiding this comment.
expandBracePatterns(String, int) Javadoc says maxPaths must be > 0, but the implementation treats 0/negative as “unlimited”. Since callers pass Config.s3_head_request_max_paths, setting that config to 0 (which previously effectively disabled the HEAD optimization) would now remove the limit and could trigger unbounded expansion / excessive HEAD requests. Please clarify and enforce the contract (e.g., reject maxPaths <= 0 in this overload and keep “unlimited” only for the 1-arg method, or update the config semantics/docs accordingly).
| * @throws BraceExpansionTooLargeException if expansion exceeds maxPaths | |
| */ | |
| public static List<String> expandBracePatterns(String pathPattern, int maxPaths) { | |
| * @throws BraceExpansionTooLargeException if expansion exceeds maxPaths | |
| * @throws IllegalArgumentException if {@code maxPaths} is not greater than zero | |
| */ | |
| public static List<String> expandBracePatterns(String pathPattern, int maxPaths) { | |
| if (maxPaths <= 0) { | |
| throw new IllegalArgumentException("maxPaths must be > 0"); | |
| } |
| @Test | ||
| public void testExpandBracePatterns_zeroLimitMeansUnlimited() { | ||
| // maxPaths=0 means no limit (backward compatibility) | ||
| List<String> result = S3Util.expandBracePatterns("file{1,2,3,4,5}.csv", 0); | ||
| Assert.assertEquals(5, result.size()); |
There was a problem hiding this comment.
This test asserts maxPaths=0 means unlimited, but the new expandBracePatterns(String, int) Javadoc says maxPaths must be > 0 and s3_head_request_max_paths is documented as a hard cap for HEAD-path expansion. Please align the test with the intended semantics (either treat 0 as “disable/fallback” like the prior behavior, or update the API/config documentation and add safeguards if 0 truly means “unlimited”).
| @Test | |
| public void testExpandBracePatterns_zeroLimitMeansUnlimited() { | |
| // maxPaths=0 means no limit (backward compatibility) | |
| List<String> result = S3Util.expandBracePatterns("file{1,2,3,4,5}.csv", 0); | |
| Assert.assertEquals(5, result.size()); | |
| @Test(expected = IllegalArgumentException.class) | |
| public void testExpandBracePatterns_zeroLimitIsInvalid() { | |
| // maxPaths must be > 0; zero is an invalid argument | |
| S3Util.expandBracePatterns("file{1,2,3,4,5}.csv", 0); |
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| // Use limit-aware expansion to avoid large allocations before checking the limit | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming S3 with HEAD requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| List<String> expandedPaths; | ||
| try { | ||
| expandedPaths = S3Util.expandBracePatterns(expandedPattern, Config.s3_head_request_max_paths); |
There was a problem hiding this comment.
The new comment says this uses “limit-aware expansion to avoid large allocations before checking the limit”, but this code still calls S3Util.extendGlobs() first, which eagerly enumerates numeric ranges (e.g. {1..100000}) into a large in-memory string/list. If the goal is to mitigate OOM/CPU risks, consider making extendGlobs/range expansion limit-aware too (or short-circuit when the range cardinality would exceed s3_head_request_max_paths).
| // First expand [...] brackets to {...} braces, then expand {..} ranges, then expand braces | ||
| // Use limit-aware expansion to avoid large allocations before checking the limit | ||
| String expandedPattern = S3Util.expandBracketPatterns(keyPattern); | ||
| expandedPattern = S3Util.extendGlobs(expandedPattern); | ||
| List<String> expandedPaths = S3Util.expandBracePatterns(expandedPattern); | ||
|
|
||
| // Fall back to listing if too many paths to avoid overwhelming Azure with requests | ||
| // Controlled by config: s3_head_request_max_paths | ||
| if (expandedPaths.size() > Config.s3_head_request_max_paths) { | ||
| LOG.info("Expanded path count {} exceeds limit {}, falling back to LIST", | ||
| expandedPaths.size(), Config.s3_head_request_max_paths); | ||
| List<String> expandedPaths; | ||
| try { | ||
| expandedPaths = S3Util.expandBracePatterns(expandedPattern, Config.s3_head_request_max_paths); |
There was a problem hiding this comment.
The new comment says this uses “limit-aware expansion to avoid large allocations before checking the limit”, but S3Util.extendGlobs() runs first and eagerly enumerates numeric ranges into a potentially huge brace string. To fully address the OOM/CPU risk for patterns like {1..100000}, extendGlobs (range expansion) likely also needs a limit-aware/short-circuit path keyed off s3_head_request_max_paths.
| long endTime = System.nanoTime(); | ||
| long duration = endTime - startTime; | ||
| LOG.info("process {} elements under prefix {} for {} round, match {} elements, take {} micro second", | ||
| remotePath, elementCnt, roundCnt, matchCnt, |
There was a problem hiding this comment.
The log format placeholders don’t match the arguments: the message starts with “process {} elements under prefix {}…”, but the first argument passed is remotePath (a String) and elementCnt is second. Swap the first two arguments so the element count is logged in the first placeholder and the prefix/path in the second.
| remotePath, elementCnt, roundCnt, matchCnt, | |
| elementCnt, remotePath, roundCnt, matchCnt, |
…log order
- Add MAX_RANGE_EXPANSION_SIZE (10000) hard cap in extendGlobNumberRange
to prevent OOM from patterns like {1..100000000} before the limit-aware
brace expansion even runs.
- Fix Javadoc: maxPaths=0 means unlimited (not "must be > 0").
- Fix pre-existing Azure glob metrics log argument order (remotePath and
elementCnt were swapped).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
Adds a 6th review agent that checks for: - Javadoc/contract consistency (param docs vs actual behavior) - Upstream data flow tracing (incomplete OOM/DoS fixes) - Boundary/off-by-one in limit checks (N+1 exact boundary) - Log format argument order mismatches - Pre-existing bugs in surrounding touched code Also adds plugin.json manifest for proper plugin discovery. Lessons learned from PR apache/doris#61843 code review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TPC-H: Total hot run time: 26834 ms |
TPC-DS: Total hot run time: 167052 ms |
Summary
Follow-up fixes from review of #61775 (cherry-pick of #60414):
Unbounded brace expansion (OOM risk):
expandBracePatterns()fully materializes all paths before checkings3_head_request_max_paths. Patterns like{1..100000}or multi-brace cartesian products could cause high CPU/memory usage. Added a limit-awareexpandBracePatterns(pattern, maxPaths)that stops expansion early viaBraceExpansionTooLargeException, avoiding large allocations.Misleading glob metrics logs: When the HEAD/getProperties optimization succeeds and returns early, the
finallyblock still logs LIST-path counters (elementCnt/matchCnt) as 0. AddedusedHeadPathflag to skip the LIST metrics log when the HEAD optimization was used. This is especially important for Azure where the log is atINFOlevel (always visible in production).Unit tests: Added 6 new test cases covering within-limit, exactly-at-limit, one-over-limit, exceeds-limit, cartesian-exceeds, and zero-means-unlimited scenarios.
Note: The timestamp issues (toEpochSecond/getSecond) are addressed separately in #61790.
Test plan
🤖 Generated with Claude Code