Skip to content

[opt](segment) Ignore not-found segments in query and load paths#61844

Open
dataroaring wants to merge 8 commits intoapache:masterfrom
freemandealer:opt/ignore-not-found-segment
Open

[opt](segment) Ignore not-found segments in query and load paths#61844
dataroaring wants to merge 8 commits intoapache:masterfrom
freemandealer:opt/ignore-not-found-segment

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

Summary

  • When a segment file is missing (e.g., removed by GC or external cause), queries and loads now skip the missing segment instead of failing with IO error reported to users.
  • Controlled by mutable BE config ignore_not_found_segment (default true), togglable at runtime via HTTP API.
  • Covers all three segment-loading paths: SegmentLoader::load_segments, LazyInitSegmentIterator::init, and BetaRowset::load_segments.

Changes

File Change
config.h/cpp New ignore_not_found_segment config (mutable bool, default true)
segment_loader.cpp load_segments() catches NOT_FOUND and skips with warning
lazy_init_segment_iterator.cpp/h init() catches NOT_FOUND, returns OK with null iterator; next_batch()/current_block_row_locations() return EOF on null
beta_rowset.cpp load_segments() catches NOT_FOUND and skips; load_segment() gets DBUG injection point
ignore_not_found_segment_test.cpp 9 test cases covering all paths with config on/off

Test plan

  • New UT: IgnoreNotFoundSegmentTest (9 cases) covering BetaRowset, SegmentLoader, and LazyInitSegmentIterator paths
  • Verify config toggle works at runtime via BE HTTP API
  • Regression: existing segment_cache_test still passes

🤖 Generated with Claude Code

When a segment file is missing (e.g., removed by GC or external cause),
queries and loads now skip the missing segment instead of failing with
IO error. Controlled by mutable config `ignore_not_found_segment`
(default true), togglable at runtime via BE HTTP API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 28, 2026 07:12
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Mar 28, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Add regression test that verifies ignore_not_found_segment behavior
end-to-end using debug point injection on a real BE cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a BE runtime config to tolerate missing native OLAP segment files by skipping NOT_FOUND segments in several segment-loading paths, aiming to avoid user-visible query/load failures when segment files are missing.

Changes:

  • Add mutable BE config ignore_not_found_segment (default true) to control skipping behavior.
  • Skip NOT_FOUND segments in SegmentLoader::load_segments, BetaRowset::load_segments, and LazyInitSegmentIterator::init/next_batch.
  • Add UT coverage via IgnoreNotFoundSegmentTest using a debug-point injection in BetaRowset::load_segment.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
be/src/common/config.h Declares new mutable config ignore_not_found_segment.
be/src/common/config.cpp Defines new mutable config with default true.
be/src/storage/segment/segment_loader.cpp Skips NOT_FOUND segments during bulk segment loading.
be/src/storage/segment/lazy_init_segment_iterator.h Makes next_batch()/current_block_row_locations() return EOF when inner iterator is null.
be/src/storage/segment/lazy_init_segment_iterator.cpp Ignores NOT_FOUND in init() when config enabled (leaves inner iterator null).
be/src/storage/rowset/beta_rowset.cpp Skips NOT_FOUND in load_segments(); adds debug-point injection for NOT_FOUND in load_segment().
be/test/storage/segment/ignore_not_found_segment_test.cpp New UTs covering config on/off behaviors for the three paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +48
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. seg_id=" << _segment_id;
// _inner_iterator remains nullptr, next_batch() will return EOF
return Status::OK();
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyInitSegmentIterator::init() logs only seg_id when skipping NOT_FOUND, which makes correlating the warning to a specific tablet/rowset difficult in production. Please include at least the rowset id (and ideally tablet id / segment path if available) in the warning, and consider rate-limiting if this can be hit repeatedly.

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +259
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset_id()
<< ", seg_id=" << seg_id;
seg_id++;
continue;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like SegmentLoader, BetaRowset::load_segments() now logs a WARNING per missing segment. If a rowset has many missing segments (or the scan is retried), this can generate a large volume of logs. Consider rate limiting and/or logging a summary once per rowset (e.g., number of skipped segments) to reduce operational noise.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +83
RowsetMetaPB pb;
json2pb::JsonToProtoMessage(json, &pb);
pb.set_start_version(0);
pb.set_end_version(1);
pb.set_num_segments(num_segments);
rsm->init_from_pb(pb);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value from json2pb::JsonToProtoMessage() is ignored here. Other tests in the repo treat this as a bool and assert success; if the JSON format changes, silently proceeding can make failures harder to diagnose. Please capture the return value and ASSERT_TRUE/EXPECT_TRUE it (or use a helper that returns Status).

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +53
void TearDown() override {
DebugPoints::instance()->clear();
config::ignore_not_found_segment = _saved_ignore;
config::enable_debug_points = _saved_debug_points;

ExecEnv::GetInstance()->set_segment_loader(_saved_segment_loader);
delete _segment_loader;
_segment_loader = nullptr;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fixture uses DebugPoints::instance()->clear() in TearDown(), which removes all debug points for the entire process. This can create test-order coupling if other tests in the same binary rely on debug points. Prefer removing only the points added by this test (e.g., remove("BetaRowset::load_segment.return_not_found")) or using an RAII helper that adds/removes a named debug point and restores config::enable_debug_points.

Copilot uses AI. Check for mistakes.
ASSERT_EQ(0, handle.get_segments().size());
ASSERT_TRUE(handle.is_inited());
}

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new UTs validate that SegmentLoader::load_segments() returns OK when segments are missing, but they don't cover any real caller that relies on SegmentCacheHandle::get_segments() being indexable by seg_id (e.g., code paths that do segments[seg_id]). Given the change in semantics, please add a regression test exercising such a caller (or explicitly validate the seg_id->vector-index contract you intend to provide).

Suggested change
// Regression test: validate that SegmentCacheHandle::get_segments() can be indexed by seg_id
// after a successful SegmentLoader::load_segments() call. This mimics real callers that do
// `segments[seg_id]` and relies on the seg_id -> vector-index contract.
TEST_F(IgnoreNotFoundSegmentTest, SegmentLoaderSegmentsIndexableBySegId) {
config::ignore_not_found_segment = true;
auto rowset = create_rowset(3);
// Do not inject NOT_FOUND for this test; we want all segments to load successfully
SegmentLoader loader(1024 * 1024, 100);
SegmentCacheHandle handle;
auto st = loader.load_segments(rowset, &handle, false);
ASSERT_TRUE(st.ok()) << st;
ASSERT_TRUE(handle.is_inited());
const auto& segments = handle.get_segments();
// Expect that we can index by seg_id in [0, 3)
ASSERT_GE(segments.size(), 3);
for (int seg_id = 0; seg_id < 3; ++seg_id) {
// Real callers rely on segments[seg_id] being valid for each seg_id
ASSERT_NE(nullptr, segments[seg_id]);
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +100
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SegmentLoader::load_segments() now skips NOT_FOUND segments and still returns OK, but many call sites implicitly assume cache_handle->get_segments() is indexed by seg_id and has size()==rowset->num_segments(). For example, BaseTablet::lookup_row_key() indexes segments[id] where id is a seg_id; with skipped entries this can become out-of-bounds or dereference the wrong segment in release builds. To make skipping safe, either (a) preserve the seg_id->index contract by resizing the segment vector to num_segments and storing loaded segments at segments[seg_id] (leaving nullptr for missing) and update callers to handle nullptr, or (b) change the API to return a mapping and update all callers to lookup by Segment::id() instead of positional indexing.

Suggested change
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
}

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +99
if (st.is<ErrorCode::NOT_FOUND>() && config::ignore_not_found_segment) {
LOG(WARNING) << "segment not found, skip it. rowset_id=" << rowset->rowset_id()
<< ", seg_id=" << i;
continue;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning log inside the segment-load loop can spam logs when multiple segments are missing (and load_segments() is called frequently during query execution). Consider rate limiting (e.g., LOG_EVERY_N / LOG_EVERY_N_SECONDS) and/or aggregating counts per rowset to reduce operational noise while still preserving debuggability.

Copilot uses AI. Check for mistakes.

#pragma once

#include "common/config.h"
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_init_segment_iterator.h adds #include "common/config.h", but this header doesn't reference config:: anywhere. Keeping this include in a widely-used header increases compile-time dependencies; it should be removed (the .cpp already includes what it needs transitively).

Suggested change
#include "common/config.h"

Copilot uses AI. Check for mistakes.
- Add rowset_id to LazyInitSegmentIterator skip log for better debuggability
- Remove unused #include "common/config.h" from lazy_init_segment_iterator.h
- Use specific DebugPoints::remove() instead of clear() in test TearDown
- Assert json2pb::JsonToProtoMessage return value in test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gavinchou
gavinchou previously approved these changes Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

The forward declaration of BetaRowset is insufficient for calling
rowset_id() in the log message. Add the full include.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26832 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 47f21911665268c40cbbd2b9d5ef4087bb831f35, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17637	4553	4290	4290
q2	q3	10639	835	533	533
q4	4682	362	258	258
q5	7566	1215	1030	1030
q6	177	174	148	148
q7	789	840	677	677
q8	9307	1483	1376	1376
q9	4913	4774	4792	4774
q10	6248	1950	1669	1669
q11	459	260	251	251
q12	739	593	462	462
q13	18019	2737	1941	1941
q14	229	224	215	215
q15	q16	742	736	674	674
q17	727	859	422	422
q18	6020	5452	5357	5357
q19	1124	993	636	636
q20	559	497	389	389
q21	4444	1861	1420	1420
q22	438	453	310	310
Total cold run time: 95458 ms
Total hot run time: 26832 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4783	4577	4649	4577
q2	q3	3903	4399	3862	3862
q4	899	1214	769	769
q5	4128	4390	4364	4364
q6	187	179	143	143
q7	1750	1679	1529	1529
q8	2498	2719	2619	2619
q9	7788	7426	7541	7426
q10	3819	4035	3591	3591
q11	511	430	441	430
q12	507	594	470	470
q13	2537	2884	2069	2069
q14	296	318	283	283
q15	q16	741	761	708	708
q17	1179	1396	1366	1366
q18	7142	7008	6694	6694
q19	916	931	1005	931
q20	2084	2177	1997	1997
q21	4027	3577	3282	3282
q22	463	432	463	432
Total cold run time: 50158 ms
Total hot run time: 47542 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 169589 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 47f21911665268c40cbbd2b9d5ef4087bb831f35, data reload: false

query5	4369	632	524	524
query6	351	230	208	208
query7	4222	473	265	265
query8	344	242	253	242
query9	8695	2659	2711	2659
query10	507	427	340	340
query11	6988	5080	4842	4842
query12	177	127	129	127
query13	1279	490	350	350
query14	5844	3683	3526	3526
query14_1	2852	2852	2826	2826
query15	209	203	177	177
query16	1011	512	432	432
query17	876	733	598	598
query18	2437	439	342	342
query19	214	204	179	179
query20	134	124	124	124
query21	214	145	109	109
query22	13480	14958	14529	14529
query23	16529	16422	15672	15672
query23_1	15680	15621	15682	15621
query24	7222	1633	1228	1228
query24_1	1239	1267	1239	1239
query25	606	507	438	438
query26	1241	268	153	153
query27	2785	487	301	301
query28	4493	1855	1834	1834
query29	841	607	514	514
query30	300	227	195	195
query31	1014	969	885	885
query32	94	79	76	76
query33	534	367	302	302
query34	897	884	535	535
query35	657	692	610	610
query36	1097	1138	1071	1071
query37	143	103	91	91
query38	2943	2883	2871	2871
query39	850	848	815	815
query39_1	816	806	791	791
query40	241	160	143	143
query41	68	67	65	65
query42	261	260	258	258
query43	241	246	227	227
query44	
query45	206	193	188	188
query46	961	999	613	613
query47	2138	2135	2023	2023
query48	308	318	233	233
query49	653	498	407	407
query50	703	287	227	227
query51	4142	4163	4041	4041
query52	264	278	258	258
query53	297	335	287	287
query54	338	313	291	291
query55	104	89	89	89
query56	348	351	335	335
query57	1938	1734	1704	1704
query58	293	282	276	276
query59	2808	2926	2772	2772
query60	335	332	331	331
query61	162	163	156	156
query62	641	602	533	533
query63	310	283	277	277
query64	4997	1319	1061	1061
query65	
query66	1464	459	363	363
query67	24335	24324	24240	24240
query68	
query69	401	322	284	284
query70	992	969	964	964
query71	339	310	297	297
query72	2844	2744	2485	2485
query73	545	559	311	311
query74	9573	9533	9428	9428
query75	2867	2767	2471	2471
query76	2307	1032	681	681
query77	359	404	316	316
query78	10993	11181	10497	10497
query79	1083	849	578	578
query80	1446	648	556	556
query81	545	263	232	232
query82	1364	150	126	126
query83	389	265	259	259
query84	304	125	110	110
query85	1085	526	488	488
query86	428	313	302	302
query87	3178	3179	2992	2992
query88	3537	2660	2632	2632
query89	429	383	346	346
query90	1867	185	179	179
query91	180	174	141	141
query92	83	78	73	73
query93	916	862	498	498
query94	596	324	305	305
query95	598	408	324	324
query96	651	531	231	231
query97	2477	2516	2457	2457
query98	241	220	215	215
query99	1042	1005	920	920
Total cold run time: 250436 ms
Total hot run time: 169589 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.90% (19959/37729)
Line Coverage 36.44% (187227/513755)
Region Coverage 32.70% (145248/444177)
Branch Coverage 33.86% (63634/187927)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.78% (26523/36949)
Line Coverage 54.63% (279801/512208)
Region Coverage 51.64% (231484/448297)
Branch Coverage 53.11% (100111/188493)

dataroaring and others added 3 commits March 30, 2026 20:10
The segment cache serves cached segments from baseline query, bypassing
BetaRowset::load_segment entirely. Disable it so the debug point is hit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the skip condition to cover both NOT_FOUND and IO_ERROR,
so EIO errors from damaged/inaccessible segment files are also
tolerated when ignore_not_found_segment is enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants