[opt](segment) Ignore not-found segments in query and load paths#61844
[opt](segment) Ignore not-found segments in query and load paths#61844dataroaring wants to merge 8 commits intoapache:masterfrom
Conversation
When a segment file is missing (e.g., removed by GC or external cause), queries and loads now skip the missing segment instead of failing with IO error. Controlled by mutable config `ignore_not_found_segment` (default true), togglable at runtime via BE HTTP API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Add regression test that verifies ignore_not_found_segment behavior end-to-end using debug point injection on a real BE cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a BE runtime config to tolerate missing native OLAP segment files by skipping NOT_FOUND segments in several segment-loading paths, aiming to avoid user-visible query/load failures when segment files are missing.
Changes:
- Add mutable BE config
ignore_not_found_segment(defaulttrue) to control skipping behavior. - Skip NOT_FOUND segments in
SegmentLoader::load_segments,BetaRowset::load_segments, andLazyInitSegmentIterator::init/next_batch. - Add UT coverage via
IgnoreNotFoundSegmentTestusing a debug-point injection inBetaRowset::load_segment.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/common/config.h | Declares new mutable config ignore_not_found_segment. |
| be/src/common/config.cpp | Defines new mutable config with default true. |
| be/src/storage/segment/segment_loader.cpp | Skips NOT_FOUND segments during bulk segment loading. |
| be/src/storage/segment/lazy_init_segment_iterator.h | Makes next_batch()/current_block_row_locations() return EOF when inner iterator is null. |
| be/src/storage/segment/lazy_init_segment_iterator.cpp | Ignores NOT_FOUND in init() when config enabled (leaves inner iterator null). |
| be/src/storage/rowset/beta_rowset.cpp | Skips NOT_FOUND in load_segments(); adds debug-point injection for NOT_FOUND in load_segment(). |
| be/test/storage/segment/ignore_not_found_segment_test.cpp | New UTs covering config on/off behaviors for the three paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. seg_id=" << _segment_id; | ||
| // _inner_iterator remains nullptr, next_batch() will return EOF | ||
| return Status::OK(); |
There was a problem hiding this comment.
LazyInitSegmentIterator::init() logs only seg_id when skipping NOT_FOUND, which makes correlating the warning to a specific tablet/rowset difficult in production. Please include at least the rowset id (and ideally tablet id / segment path if available) in the warning, and consider rate-limiting if this can be hit repeatedly.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset_id() | ||
| << ", seg_id=" << seg_id; | ||
| seg_id++; | ||
| continue; |
There was a problem hiding this comment.
Like SegmentLoader, BetaRowset::load_segments() now logs a WARNING per missing segment. If a rowset has many missing segments (or the scan is retried), this can generate a large volume of logs. Consider rate limiting and/or logging a summary once per rowset (e.g., number of skipped segments) to reduce operational noise.
| RowsetMetaPB pb; | ||
| json2pb::JsonToProtoMessage(json, &pb); | ||
| pb.set_start_version(0); | ||
| pb.set_end_version(1); | ||
| pb.set_num_segments(num_segments); | ||
| rsm->init_from_pb(pb); |
There was a problem hiding this comment.
The return value from json2pb::JsonToProtoMessage() is ignored here. Other tests in the repo treat this as a bool and assert success; if the JSON format changes, silently proceeding can make failures harder to diagnose. Please capture the return value and ASSERT_TRUE/EXPECT_TRUE it (or use a helper that returns Status).
| void TearDown() override { | ||
| DebugPoints::instance()->clear(); | ||
| config::ignore_not_found_segment = _saved_ignore; | ||
| config::enable_debug_points = _saved_debug_points; | ||
|
|
||
| ExecEnv::GetInstance()->set_segment_loader(_saved_segment_loader); | ||
| delete _segment_loader; | ||
| _segment_loader = nullptr; |
There was a problem hiding this comment.
The test fixture uses DebugPoints::instance()->clear() in TearDown(), which removes all debug points for the entire process. This can create test-order coupling if other tests in the same binary rely on debug points. Prefer removing only the points added by this test (e.g., remove("BetaRowset::load_segment.return_not_found")) or using an RAII helper that adds/removes a named debug point and restores config::enable_debug_points.
| ASSERT_EQ(0, handle.get_segments().size()); | ||
| ASSERT_TRUE(handle.is_inited()); | ||
| } | ||
|
|
There was a problem hiding this comment.
The new UTs validate that SegmentLoader::load_segments() returns OK when segments are missing, but they don't cover any real caller that relies on SegmentCacheHandle::get_segments() being indexable by seg_id (e.g., code paths that do segments[seg_id]). Given the change in semantics, please add a regression test exercising such a caller (or explicitly validate the seg_id->vector-index contract you intend to provide).
| // Regression test: validate that SegmentCacheHandle::get_segments() can be indexed by seg_id | |
| // after a successful SegmentLoader::load_segments() call. This mimics real callers that do | |
| // `segments[seg_id]` and relies on the seg_id -> vector-index contract. | |
| TEST_F(IgnoreNotFoundSegmentTest, SegmentLoaderSegmentsIndexableBySegId) { | |
| config::ignore_not_found_segment = true; | |
| auto rowset = create_rowset(3); | |
| // Do not inject NOT_FOUND for this test; we want all segments to load successfully | |
| SegmentLoader loader(1024 * 1024, 100); | |
| SegmentCacheHandle handle; | |
| auto st = loader.load_segments(rowset, &handle, false); | |
| ASSERT_TRUE(st.ok()) << st; | |
| ASSERT_TRUE(handle.is_inited()); | |
| const auto& segments = handle.get_segments(); | |
| // Expect that we can index by seg_id in [0, 3) | |
| ASSERT_GE(segments.size(), 3); | |
| for (int seg_id = 0; seg_id < 3; ++seg_id) { | |
| // Real callers rely on segments[seg_id] being valid for each seg_id | |
| ASSERT_NE(nullptr, segments[seg_id]); | |
| } | |
| } |
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | ||
| << ", seg_id=" << i; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
SegmentLoader::load_segments() now skips NOT_FOUND segments and still returns OK, but many call sites implicitly assume cache_handle->get_segments() is indexed by seg_id and has size()==rowset->num_segments(). For example, BaseTablet::lookup_row_key() indexes segments[id] where id is a seg_id; with skipped entries this can become out-of-bounds or dereference the wrong segment in release builds. To make skipping safe, either (a) preserve the seg_id->index contract by resizing the segment vector to num_segments and storing loaded segments at segments[seg_id] (leaving nullptr for missing) and update callers to handle nullptr, or (b) change the API to return a mapping and update all callers to lookup by Segment::id() instead of positional indexing.
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | |
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | |
| << ", seg_id=" << i; | |
| continue; | |
| } |
| if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) { | ||
| LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id() | ||
| << ", seg_id=" << i; | ||
| continue; |
There was a problem hiding this comment.
The warning log inside the segment-load loop can spam logs when multiple segments are missing (and load_segments() is called frequently during query execution). Consider rate limiting (e.g., LOG_EVERY_N / LOG_EVERY_N_SECONDS) and/or aggregating counts per rowset to reduce operational noise while still preserving debuggability.
|
|
||
| #pragma once | ||
|
|
||
| #include "common/config.h" |
There was a problem hiding this comment.
lazy_init_segment_iterator.h adds #include "common/config.h", but this header doesn't reference config:: anywhere. Keeping this include in a widely-used header increases compile-time dependencies; it should be removed (the .cpp already includes what it needs transitively).
| #include "common/config.h" |
- Add rowset_id to LazyInitSegmentIterator skip log for better debuggability - Remove unused #include "common/config.h" from lazy_init_segment_iterator.h - Use specific DebugPoints::remove() instead of clear() in test TearDown - Assert json2pb::JsonToProtoMessage return value in test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
The forward declaration of BetaRowset is insufficient for calling rowset_id() in the log message. Add the full include. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 26832 ms |
TPC-DS: Total hot run time: 169589 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
The segment cache serves cached segments from baseline query, bypassing BetaRowset::load_segment entirely. Disable it so the debug point is hit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the skip condition to cover both NOT_FOUND and IO_ERROR, so EIO errors from damaged/inaccessible segment files are also tolerated when ignore_not_found_segment is enabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
Summary
ignore_not_found_segment(defaulttrue), togglable at runtime via HTTP API.SegmentLoader::load_segments,LazyInitSegmentIterator::init, andBetaRowset::load_segments.Changes
config.h/cppignore_not_found_segmentconfig (mutable bool, default true)segment_loader.cppload_segments()catches NOT_FOUND and skips with warninglazy_init_segment_iterator.cpp/hinit()catches NOT_FOUND, returns OK with null iterator;next_batch()/current_block_row_locations()return EOF on nullbeta_rowset.cppload_segments()catches NOT_FOUND and skips;load_segment()gets DBUG injection pointignore_not_found_segment_test.cppTest plan
IgnoreNotFoundSegmentTest(9 cases) covering BetaRowset, SegmentLoader, and LazyInitSegmentIterator paths🤖 Generated with Claude Code