Skip to content

[fix](metric) Change partition near-limit metrics from counters to gauges#61845

Merged
dataroaring merged 2 commits intomasterfrom
fix/partition-near-limit-gauge
Mar 30, 2026
Merged

[fix](metric) Change partition near-limit metrics from counters to gauges#61845
dataroaring merged 2 commits intomasterfrom
fix/partition-near-limit-gauge

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

Summary

  • Changed auto_partition_near_limit_count and dynamic_partition_near_limit_count from LongCounterMetric (monotonically increasing) to GaugeMetricImpl<Long> so they correctly decrease when the near-limit condition resolves
  • Moved metric computation from inline event-driven increments (in FrontendServiceImpl and DynamicPartitionUtil) to TabletStatMgr's periodic all-table scan, which already iterates all tables and partitions under read locks
  • Metric names are preserved for monitoring compatibility; semantics changed from "cumulative event count" to "current number of tables near the limit"

Test plan

  • Verify auto_partition_near_limit_count increases when an auto-partition table exceeds 80% of max_auto_partition_num
  • Verify the gauge decreases back to 0 after dropping partitions below the 80% threshold (within one tablet_stat_update_interval_second cycle)
  • Verify dynamic_partition_near_limit_count behaves the same for dynamic partition tables
  • Verify existing Prometheus/Grafana dashboards continue to scrape the metric names without changes

🤖 Generated with Claude Code

…uges

The auto_partition_near_limit_count and dynamic_partition_near_limit_count
metrics were LongCounterMetric (monotonically increasing) and never
decreased, even when the near-limit condition resolved. Changed them to
GaugeMetricImpl updated by TabletStatMgr's periodic table scan, so they
reflect the current number of tables near the partition limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 28, 2026 07:13
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates FE partition near-limit metrics so they reflect the current number of tables approaching partition-count limits, rather than a monotonically increasing event count, by moving computation into the existing periodic TabletStatMgr scan.

Changes:

  • Replaced auto_partition_near_limit_count / dynamic_partition_near_limit_count from counters to gauges (names preserved).
  • Removed event-driven metric increments from FrontendServiceImpl and DynamicPartitionUtil.
  • Added near-limit table counting during TabletStatMgr’s periodic all-table scan and publishes results to MetricRepo gauges.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java Removes counter increment on auto-partition near-limit warning.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Changes near-limit metrics to gauges while preserving metric names and registers them.
fe/fe-core/src/main/java/org/apache/doris/common/util/DynamicPartitionUtil.java Removes counter increment on dynamic-partition near-limit warning during property analysis.
fe/fe-core/src/main/java/org/apache/doris/catalog/TabletStatMgr.java Computes and sets the new near-limit gauge values during periodic stats scan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to +172
List<Partition> allPartitions = olapTable.getAllPartitions();
partitionCount += allPartitions.size();
int tablePartitionNum = allPartitions.size();
partitionCount += tablePartitionNum;
// Check if this table's partition count is near the limit (>80%)
if (olapTable.getPartitionInfo().enableAutomaticPartition()) {
int limit = Config.max_auto_partition_num;
if (tablePartitionNum > limit * 8L / 10) {
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OlapTable.getAllPartitions() includes temp partitions, but partition limit enforcement/warnings use getPartitionNum() (non-temp). Using allPartitions.size() here can inflate the near-limit gauges (and partitionCount) due to temp partitions and make the metric inconsistent with the actual limit checks. Consider computing near-limit using olapTable.getPartitionNum() / getPartitions() (or otherwise excluding temp partitions) while still iterating getAllPartitions() for size/stat aggregation if needed.

Copilot uses AI. Check for mistakes.
@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26423 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4a81380ddef69c1021e85718d7110a2fa7445dc7, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17588	4424	4315	4315
q2	q3	10647	785	517	517
q4	4670	356	252	252
q5	7560	1200	1014	1014
q6	176	178	153	153
q7	774	857	680	680
q8	9284	1527	1330	1330
q9	4841	4757	4741	4741
q10	6238	1908	1657	1657
q11	470	268	239	239
q12	696	582	471	471
q13	18021	2706	1935	1935
q14	224	226	210	210
q15	q16	731	735	671	671
q17	729	841	493	493
q18	6072	5340	5101	5101
q19	1108	970	642	642
q20	549	509	371	371
q21	4559	1857	1381	1381
q22	341	294	250	250
Total cold run time: 95278 ms
Total hot run time: 26423 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5006	4729	4693	4693
q2	q3	3902	4329	3848	3848
q4	855	1195	783	783
q5	4057	4410	4327	4327
q6	196	175	143	143
q7	1787	1654	1509	1509
q8	2512	2733	2536	2536
q9	7664	7363	7548	7363
q10	3801	3958	3642	3642
q11	506	434	436	434
q12	481	587	448	448
q13	2441	2931	1997	1997
q14	269	291	263	263
q15	q16	704	757	727	727
q17	1170	1347	1361	1347
q18	7225	6929	6541	6541
q19	888	888	889	888
q20	2084	2179	2102	2102
q21	4023	3500	3323	3323
q22	525	494	414	414
Total cold run time: 50096 ms
Total hot run time: 47328 ms

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26444 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4a81380ddef69c1021e85718d7110a2fa7445dc7, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17576	4493	4259	4259
q2	q3	10645	775	523	523
q4	4675	357	252	252
q5	7554	1189	1029	1029
q6	182	176	144	144
q7	774	830	673	673
q8	9327	1469	1304	1304
q9	4941	4787	4729	4729
q10	6313	1933	1635	1635
q11	478	252	243	243
q12	746	578	457	457
q13	18084	2695	1929	1929
q14	226	227	214	214
q15	q16	732	754	665	665
q17	739	816	452	452
q18	6152	5467	5235	5235
q19	1188	982	625	625
q20	527	502	370	370
q21	4448	1838	1399	1399
q22	342	307	395	307
Total cold run time: 95649 ms
Total hot run time: 26444 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4859	4724	4678	4678
q2	q3	3879	4356	3862	3862
q4	877	1219	783	783
q5	4074	4407	4304	4304
q6	198	181	143	143
q7	1836	1706	1554	1554
q8	2471	2710	2590	2590
q9	7533	7321	7389	7321
q10	3806	3968	3621	3621
q11	510	435	414	414
q12	511	606	470	470
q13	2493	3006	2049	2049
q14	279	303	286	286
q15	q16	724	768	886	768
q17	1298	1405	1382	1382
q18	7170	6815	6754	6754
q19	923	901	880	880
q20	2067	2245	2051	2051
q21	3948	3493	3349	3349
q22	462	437	396	396
Total cold run time: 49918 ms
Total hot run time: 47655 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168787 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4a81380ddef69c1021e85718d7110a2fa7445dc7, data reload: false

query5	4339	639	496	496
query6	324	232	214	214
query7	4237	466	265	265
query8	351	240	225	225
query9	8678	2656	2679	2656
query10	542	388	319	319
query11	6967	5115	4864	4864
query12	213	128	122	122
query13	1279	459	341	341
query14	5808	3697	3443	3443
query14_1	2832	2826	2808	2808
query15	206	200	178	178
query16	990	488	455	455
query17	918	730	632	632
query18	2457	455	368	368
query19	217	219	192	192
query20	131	125	127	125
query21	213	133	112	112
query22	13331	15360	14495	14495
query23	16688	16330	15848	15848
query23_1	16478	15794	15615	15615
query24	7208	1609	1217	1217
query24_1	1224	1225	1225	1225
query25	545	454	402	402
query26	1248	257	152	152
query27	2799	476	293	293
query28	4466	1838	1829	1829
query29	849	562	487	487
query30	307	245	196	196
query31	1004	960	868	868
query32	78	75	70	70
query33	517	327	284	284
query34	902	876	517	517
query35	642	679	602	602
query36	1090	1140	953	953
query37	131	94	83	83
query38	2927	2912	2855	2855
query39	870	832	816	816
query39_1	777	816	788	788
query40	231	148	173	148
query41	64	59	61	59
query42	263	255	258	255
query43	235	245	232	232
query44	
query45	195	203	178	178
query46	899	1012	618	618
query47	2120	2126	2083	2083
query48	318	332	230	230
query49	637	468	390	390
query50	707	276	206	206
query51	4044	4085	4076	4076
query52	263	278	259	259
query53	292	342	308	308
query54	297	264	261	261
query55	90	86	88	86
query56	314	317	319	317
query57	1930	1801	1456	1456
query58	291	272	268	268
query59	2811	2954	2732	2732
query60	344	344	320	320
query61	158	153	154	153
query62	616	589	544	544
query63	314	276	284	276
query64	5138	1304	1006	1006
query65	
query66	1510	461	356	356
query67	24406	24462	24227	24227
query68	
query69	396	310	289	289
query70	941	971	960	960
query71	339	298	298	298
query72	2828	2923	2557	2557
query73	532	548	325	325
query74	9625	9548	9407	9407
query75	2863	2751	2459	2459
query76	2302	1027	684	684
query77	357	404	301	301
query78	10945	11133	10401	10401
query79	1090	758	576	576
query80	1333	609	539	539
query81	544	259	230	230
query82	984	156	121	121
query83	332	263	250	250
query84	299	115	101	101
query85	919	501	450	450
query86	415	337	289	289
query87	3171	3086	3008	3008
query88	3513	2626	2638	2626
query89	423	366	343	343
query90	2014	188	171	171
query91	173	164	139	139
query92	74	76	67	67
query93	1033	840	504	504
query94	646	312	304	304
query95	596	346	382	346
query96	632	510	230	230
query97	2461	2474	2384	2384
query98	256	232	226	226
query99	999	989	931	931
Total cold run time: 250000 ms
Total hot run time: 168787 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 66.67% (14/21) 🎉
Increment coverage report
Complete coverage report

@gavinchou
Copy link
Copy Markdown
Contributor

Review: Cloud Mode Not Handled

Critical Bug: Cloud mode not handling partition near-limit metrics

The PR changes only update TabletStatMgr.java, but CloudTabletStatMgr.java is missing the same changes. In cloud mode, Doris uses CloudTabletStatMgr instead of TabletStatMgr (see CloudEnvFactory.java:198-199).

The Issue

CloudTabletStatMgr.updateStatInfo() (lines 156-334) has nearly identical logic to TabletStatMgr.runAfterCatalogReady(), but it is missing:

  1. Counter variables (lines 131-132 in TabletStatMgr):
long autoPartitionNearLimitCount = 0L;
long dynamicPartitionNearLimitCount = 0L;
  1. Partition limit check logic (lines 167-181 in TabletStatMgr):
int tablePartitionNum = allPartitions.size();
partitionCount += tablePartitionNum;
// Check if this table's partition count is near the limit (>80%)
if (olapTable.getPartitionInfo().enableAutomaticPartition()) {
    int limit = Config.max_auto_partition_num;
    if (tablePartitionNum > limit * 8L / 10) {
        autoPartitionNearLimitCount++;
    }
}
if (olapTable.dynamicPartitionExists()
        && olapTable.getTableProperty().getDynamicPartitionProperty().getEnable()) {
    int limit = Config.max_dynamic_partition_num;
    if (tablePartitionNum > limit * 8L / 10) {
        dynamicPartitionNearLimitCount++;
    }
}
  1. Gauge metric updates (lines 314-315 in TabletStatMgr):
MetricRepo.GAUGE_AUTO_PARTITION_NEAR_LIMIT.setValue(autoPartitionNearLimitCount);
MetricRepo.GAUGE_DYNAMIC_PARTITION_NEAR_LIMIT.setValue(dynamicPartitionNearLimitCount);

Impact

  • Non-cloud mode: Metrics work correctly (updated by TabletStatMgr)
  • Cloud mode: Metrics stay at 0 forever (never updated by CloudTabletStatMgr)

This defeats the purpose of the PR for cloud deployments, as users cannot monitor tables approaching partition limits.

Fix Required

Apply the same changes to CloudTabletStatMgr.updateStatInfo():

  • Add the counter variables
  • Add the partition limit check logic inside the table iteration loop
  • Add the gauge metric updates before the logging statement

Copy link
Copy Markdown
Contributor

@gavinchou gavinchou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Cloud Mode Not Handled

The PR changes only update TabletStatMgr.java, but CloudTabletStatMgr.java is missing the same changes. In cloud mode, Doris uses CloudTabletStatMgr instead of TabletStatMgr (see CloudEnvFactory.java:198-199).

The Issue

CloudTabletStatMgr.updateStatInfo() (lines 156-334) has nearly identical logic to TabletStatMgr.runAfterCatalogReady(), but it is missing:

  1. Counter variables
  2. Partition limit check logic
  3. Gauge metric updates

Impact

  • Non-cloud mode: Metrics work correctly (updated by TabletStatMgr)
  • Cloud mode: Metrics stay at 0 forever (never updated by CloudTabletStatMgr)

Fix Required

Apply the same changes to CloudTabletStatMgr.updateStatInfo():

  • Add the counter variables
  • Add the partition limit check logic inside the table iteration loop
  • Add the gauge metric updates before the logging statement

… temp partitions

Address review comments:
- Add partition near-limit gauge logic to CloudTabletStatMgr.updateStatInfo()
  so cloud mode also reports these metrics (previously stayed at 0).
- Use getPartitionNum() instead of getAllPartitions().size() for the
  near-limit check to exclude temp partitions, consistent with how
  partition limits are enforced elsewhere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26642 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0909b4caf628d4bea8150a50571b0c8127ceb1e3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17647	4457	4295	4295
q2	q3	10648	764	521	521
q4	4676	353	256	256
q5	7552	1200	1025	1025
q6	185	173	150	150
q7	788	846	666	666
q8	9402	1497	1337	1337
q9	5191	4773	4729	4729
q10	6347	1905	1652	1652
q11	471	248	237	237
q12	752	612	461	461
q13	18033	2675	1913	1913
q14	229	236	210	210
q15	q16	738	755	660	660
q17	731	910	449	449
q18	5958	5389	5283	5283
q19	1495	960	636	636
q20	546	492	373	373
q21	4461	1837	1483	1483
q22	522	373	306	306
Total cold run time: 96372 ms
Total hot run time: 26642 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4850	4528	4546	4528
q2	q3	3847	4305	3869	3869
q4	870	1188	800	800
q5	4076	4425	4436	4425
q6	227	179	153	153
q7	1825	1688	1536	1536
q8	2488	2711	2588	2588
q9	7567	7471	7437	7437
q10	3824	4122	3602	3602
q11	503	439	416	416
q12	489	581	445	445
q13	2435	3020	2120	2120
q14	286	330	288	288
q15	q16	770	768	722	722
q17	1157	1320	1401	1320
q18	7258	6884	6931	6884
q19	985	963	949	949
q20	2111	2154	2019	2019
q21	3917	3502	3326	3326
q22	425	429	403	403
Total cold run time: 49910 ms
Total hot run time: 47830 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168191 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0909b4caf628d4bea8150a50571b0c8127ceb1e3, data reload: false

query5	4325	649	513	513
query6	326	223	200	200
query7	4235	472	260	260
query8	339	255	229	229
query9	8690	2697	2688	2688
query10	486	389	350	350
query11	6969	5056	4845	4845
query12	182	133	124	124
query13	1294	451	354	354
query14	5818	3715	3472	3472
query14_1	2869	2826	2786	2786
query15	207	195	182	182
query16	982	472	447	447
query17	902	728	631	631
query18	2456	458	361	361
query19	221	232	186	186
query20	136	128	124	124
query21	214	135	109	109
query22	13255	13334	13159	13159
query23	16199	15794	16753	15794
query23_1	16258	16066	16124	16066
query24	7617	1687	1263	1263
query24_1	1283	1306	1282	1282
query25	597	586	516	516
query26	1006	282	166	166
query27	2736	477	304	304
query28	4490	1853	1827	1827
query29	859	575	509	509
query30	306	227	192	192
query31	1038	935	864	864
query32	89	72	68	68
query33	509	338	285	285
query34	880	884	523	523
query35	659	721	609	609
query36	1100	1115	982	982
query37	138	102	84	84
query38	2988	2939	2936	2936
query39	871	836	841	836
query39_1	810	813	802	802
query40	236	153	136	136
query41	63	60	59	59
query42	273	263	271	263
query43	256	263	230	230
query44	
query45	201	190	192	190
query46	900	1042	628	628
query47	2076	2141	2046	2046
query48	305	319	247	247
query49	634	473	393	393
query50	710	284	224	224
query51	4114	4082	4017	4017
query52	260	268	262	262
query53	291	335	285	285
query54	297	276	279	276
query55	93	89	85	85
query56	326	323	309	309
query57	1928	1800	1556	1556
query58	291	263	271	263
query59	2805	2944	2716	2716
query60	347	332	320	320
query61	152	154	150	150
query62	607	590	537	537
query63	303	279	277	277
query64	5093	1289	999	999
query65	
query66	1460	448	355	355
query67	24122	24251	24126	24126
query68	
query69	406	309	288	288
query70	930	1003	949	949
query71	343	314	288	288
query72	2878	2711	2467	2467
query73	549	555	321	321
query74	9622	9588	9420	9420
query75	2844	2751	2464	2464
query76	2287	1027	656	656
query77	389	378	306	306
query78	10859	11159	10424	10424
query79	2036	769	574	574
query80	1637	616	544	544
query81	550	258	226	226
query82	996	158	130	130
query83	337	264	273	264
query84	298	129	103	103
query85	898	512	451	451
query86	418	352	327	327
query87	3193	3119	3023	3023
query88	3579	2679	2655	2655
query89	432	362	344	344
query90	2033	179	172	172
query91	165	177	141	141
query92	78	74	79	74
query93	980	829	504	504
query94	646	326	289	289
query95	599	411	323	323
query96	641	519	227	227
query97	2441	2495	2407	2407
query98	230	238	220	220
query99	1017	987	911	911
Total cold run time: 251073 ms
Total hot run time: 168191 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 38.24% (13/34) 🎉
Increment coverage report
Complete coverage report

@dataroaring
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Overview

This PR converts partition near-limit metrics from monotonically increasing LongCounterMetric to GaugeMetricImpl<Long>, and moves computation from inline event-driven sites (FrontendServiceImpl, DynamicPartitionUtil) to the periodic TabletStatMgr/CloudTabletStatMgr scan. The counter-to-gauge conversion is a sound design improvement — gauges correctly reflect the current number of tables near the limit rather than a cumulative event count.

Critical Checkpoint Conclusions

1. Goal & Correctness:

  • The auto-partition metric (auto_partition_near_limit_count) is correct: both the original enforcement site (FrontendServiceImpl) and the new gauge compare olapTable.getPartitionNum() against Config.max_auto_partition_num.
  • Issue found: The dynamic-partition metric has a semantic mismatch (see inline comment). The original DynamicPartitionUtil compared end - start (the configured partition span) against max_dynamic_partition_num, but the new code compares total partition count against the same limit. These measure different things.

2. Modification scope: Focused and minimal — 5 files, well-scoped to the metric change.

3. Concurrency: No new concurrency concerns. The gauge values are computed in a single-threaded periodic scan and set atomically via setValue(). The computation reads under existing table read locks.

4. Lifecycle / static init: No issues. Gauge metrics are initialized in MetricRepo.init() before use.

5. Configuration: No new configs added. Existing max_auto_partition_num and max_dynamic_partition_num are reused appropriately.

6. Incompatible changes: Metric names are preserved (auto_partition_near_limit_count, dynamic_partition_near_limit_count). However, semantics changed from cumulative counter to current gauge — any Prometheus alert rules using rate() or increase() on these metrics will silently break. The PR description acknowledges this but there should be a release note.

7. Parallel code paths: Both TabletStatMgr (shared-nothing) and CloudTabletStatMgr (cloud mode) are updated — good.

8. Test coverage: No tests are included. The PR description lists manual test items but no regression or unit tests are added. Given this is a metrics-only change, this is acceptable but not ideal.

9. Observability: The LOG.warn calls are preserved in both FrontendServiceImpl and DynamicPartitionUtil — good.

10. Performance: No concerns. The checks are trivial integer comparisons added to an already-existing iteration loop.

Issues Found

  1. (Medium) Semantic mismatch in dynamic_partition_near_limit_count — see inline comments on both TabletStatMgr.java and CloudTabletStatMgr.java.
  2. (Minor) Code duplication between TabletStatMgr and CloudTabletStatMgr — the near-limit computation logic is copy-pasted. Consider extracting to a shared utility method.

}
}
if (olapTable.dynamicPartitionExists()
&& olapTable.getTableProperty().getDynamicPartitionProperty().getEnable()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic mismatch for dynamic partition near-limit check.

The original metric in DynamicPartitionUtil compared expectCreatePartitionNum (which is end - start, the configured partition span/window) against Config.max_dynamic_partition_num. Per the config's own Javadoc:

Used to limit the maximum number of partitions that can be created when creating a dynamic partition table [...] The number is determined by "start" and "end" in the dynamic partition parameters.

However, the new code here compares nonTempPartitionNum (total current partition count on the table) against max_dynamic_partition_num. These are semantically different:

  • A table with start=-5, end=3 (span=8) but 17,000 manually-added partitions would trigger this gauge even though the dynamic partition config is well within the limit.
  • Conversely, a table with start=-15000, end=5000 (span=20,000) but only 100 currently-existing partitions would NOT trigger the gauge, even though the dynamic partition span already hits the hard limit.

Consider comparing the configured span (end - start) instead:

DynamicPartitionProperty dpProp = olapTable.getTableProperty().getDynamicPartitionProperty();
long span = (long) dpProp.getEnd() - dpProp.getStart();
int limit = Config.max_dynamic_partition_num;
if (span > limit * 8L / 10) {
    dynamicPartitionNearLimitCount++;
}

This would make the gauge semantically consistent with the enforcement in DynamicPartitionUtil.

Note: The auto-partition check above is correct — FrontendServiceImpl.createPartition() does compare total partition count against max_auto_partition_num.

}
}
if (olapTable.dynamicPartitionExists()
&& olapTable.getTableProperty().getDynamicPartitionProperty().getEnable()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same semantic mismatch as in TabletStatMgr.java — the dynamic partition near-limit check should compare the configured span (end - start) against max_dynamic_partition_num, not the total partition count. See the detailed comment on the TabletStatMgr.java counterpart.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Collaborator

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 87d592e into master Mar 30, 2026
29 of 31 checks passed
github-actions bot pushed a commit that referenced this pull request Mar 30, 2026
…uges (#61845)

## Summary
- Changed `auto_partition_near_limit_count` and
`dynamic_partition_near_limit_count` from `LongCounterMetric`
(monotonically increasing) to `GaugeMetricImpl<Long>` so they correctly
decrease when the near-limit condition resolves
- Moved metric computation from inline event-driven increments (in
`FrontendServiceImpl` and `DynamicPartitionUtil`) to `TabletStatMgr`'s
periodic all-table scan, which already iterates all tables and
partitions under read locks
- Metric names are preserved for monitoring compatibility; semantics
changed from "cumulative event count" to "current number of tables near
the limit"

## Test plan
- [ ] Verify `auto_partition_near_limit_count` increases when an
auto-partition table exceeds 80% of `max_auto_partition_num`
- [ ] Verify the gauge decreases back to 0 after dropping partitions
below the 80% threshold (within one `tablet_stat_update_interval_second`
cycle)
- [ ] Verify `dynamic_partition_near_limit_count` behaves the same for
dynamic partition tables
- [ ] Verify existing Prometheus/Grafana dashboards continue to scrape
the metric names without changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.0.x-conflict dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants