Skip to content

HIVE-28755: Statistics Management Task#6438

Open
DanielZhu58 wants to merge 11 commits into
apache:masterfrom
DanielZhu58:HIVE-28755
Open

HIVE-28755: Statistics Management Task#6438
DanielZhu58 wants to merge 11 commits into
apache:masterfrom
DanielZhu58:HIVE-28755

Conversation

@DanielZhu58
Copy link
Copy Markdown
Contributor

@DanielZhu58 DanielZhu58 commented Apr 17, 2026

What changes were proposed in this pull request?

Added StatisticsManagementTask, a new MetastoreTaskThread that periodically scans TAB_COL_STATS and PART_COL_STATS and deletes column statistics whose lastAnalyzed timestamp is older than the configured retention period. Three new configs are introduced: metastore.column.statistics.auto.deletion (default: false), metastore.column.statistics.retention.period (default: 365 days), and metastore.column.statistics.management.task.frequency (default: 7 days). Individual tables can opt out by setting the table property statistics.auto.deletion.exclude.

Why are the changes needed?

Column statistics can become stale over time and consume unnecessary storage. There was no existing mechanism to automatically clean them up based on age, requiring manual intervention.

Does this PR introduce any user-facing change?

Yes. Three new metastore configuration properties are introduced. The feature is opt-in and disabled by default (metastore.column.statistics.auto.deletion=false), so existing deployments are unaffected after upgrade.

How was this patch tested?

Added TestStatisticsManagement with a few unit tests. There is one verifying that expired table-level column statistics are deleted when the task runs, and one verifying that tables marked with the exclude property are left untouched. And there are a few more for corner cases.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.

Changes:

  • Introduce StatisticsManagementTask (a MetastoreTaskThread) that deletes column stats older than a configured retention window.
  • Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
  • Add a micro-benchmark entry and a new unit test class for statistics management.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing).
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java Wires the new benchmark into the benchmark suite (one suite name is inconsistent).
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object).
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java New background task implementation (contains multiple correctness/compilation/runtime issues).
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@soumyakanti3578 soumyakanti3578 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed StatisticsManagementTask.java yet because there are too many sonar/checkstyle warning.

Please follow the hive coding conventions to get rid of those warnings: https://hive.apache.org/community/resources/howtocontribute/#coding-conventions

You can also see the checkstyle file for standalone-metastore at standalone-metastore/checkstyle/checkstyle.xml which has the default values that sonar checks against.

I think you can also use mvn checkstyle from the console or download the checkstyle plugin for IntelliJ IDEA to catch these warnings before submitting the PR.

In general, if you're creating a new file, it's a good idea to follow the official coding conventions, and if you're updating small parts of a file then it's okay to follow the existing style to maintain readability. :)

"Automatic partition management will look for tables using the specified table pattern"),

STATISTICS_MANAGEMENT_TASK_FREQUENCY("metastore.statistics.management.task.frequency",
"metastore.statistics.management.task.frequency",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two arguments are the same. Should the second one (hiveName) be something else, like, hive.column.statistics.management.task.frequency? Not sure if it even matters tbh.

Similar issue for the other two props too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument (hiveName) exists for backward compatibility — when a config previously lived under a hive. prefix in older Hive versions, HMS needs to recognize both names. Since these three configs are brand new and have no prior hive.-prefixed equivalent, it's ok to keep them same.

+ "table.database.name, "
+ "table.tableName, "
+ "colName, "
+ "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a filter for the tblQuery: "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Added a filter to filter out the excluded tables.

if (!MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_AUTO_DELETION)
|| MetastoreConf.getTimeVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_RETENTION_PERIOD,
TimeUnit.MILLISECONDS) <= 0) {
return Long.MAX_VALUE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

Comment on lines +140 to +155
try (Query tblQuery = pm.newQuery(MTableColumnStatistics.class)) {
tblQuery.setFilter(
"lastAnalyzed < threshold "
+ "&& table.parameters.get(\""
+ STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\") != \"true\"");
tblQuery.declareParameters("long threshold");
tblQuery.setRange(0, 1000);
tblQuery.setResult(
"table.database.catalogName, "
+ "table.database.name, "
+ "table.tableName, "
+ "colName");
@SuppressWarnings("unchecked")
List<Object[]> rows = (List<Object[]>) tblQuery.execute(lastAnalyzedThreshold);
return new ArrayList<>(rows);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. I added a filter which only collect the hive table and hive partitions. We do not want to touch any impala/spark tables and partitions.

boolean committed = false;
openTransaction();
try {
deleteTableColumnStatistics(coords[0], coords[1], coords[2], entry.getValue(), "hive");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the previous one. Resolved.

Comment on lines +211 to +227
try (Query partQuery = pm.newQuery(MPartitionColumnStatistics.class)) {
partQuery.setFilter(
"lastAnalyzed < threshold "
+ "&& partition.table.parameters.get(\""
+ STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\") != \"true\"");
partQuery.declareParameters("long threshold");
partQuery.setRange(0, 1000);
partQuery.setResult(
"partition.table.database.catalogName, "
+ "partition.table.database.name, "
+ "partition.table.tableName, "
+ "partition.partitionName, "
+ "colName");
@SuppressWarnings("unchecked")
List<Object[]> rows = (List<Object[]>) partQuery.execute(lastAnalyzedThreshold);
return new ArrayList<>(rows);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the previous one. Resolved.

Comment on lines +260 to +261
deletePartitionColumnStatistics(coords[0], coords[1], coords[2],
Collections.singletonList(coords[3]), entry.getValue(), "hive");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the previous one. Resolved.

*
* @param pm the JDO persistence manager to use for the query
* @param lastAnalyzedThreshold epoch seconds; rows with lastAnalyzed below this value are expired
* @return list of projected rows: [catName, dbName, tblName, colName, excludeVal]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed.

private void assertHasPartitionColStats(String db, String tbl, String partName,
String col) throws TException {
Map<String, List<ColumnStatisticsObj>> statsMap = client.getPartitionColumnStatistics(
db, tbl, Collections.singletonList(partName), Collections.singletonList(col), "hive");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed. Will leave the engine as "hive" for now.

Comment on lines +446 to +449
q.setFilter("table.tableName == t && table.database.name == d");
q.declareParameters("java.lang.String t, java.lang.String d");
@SuppressWarnings("unchecked")
List<MTableColumnStatistics> rows = (List<MTableColumnStatistics>) q.execute(tbl, db);
Comment on lines +473 to +474
q.setFilter("partition.table.tableName == t && partition.table.database.name == d");
q.declareParameters("java.lang.String t, java.lang.String d");
Comment on lines +1296 to +1308
7, TimeUnit.DAYS, "Frequency at which timer task runs to do automatic statistics \n" +
"management for tables. Statistics management include 2 configs. \n" +
"One is 'metastore.column.statistics.auto.deletion', and the other is 'metastore.column.statistics.retention.period'. \n" +
"When 'metastore.column.statistics.auto.deletion'='true' is set, statistics management will look for tables which their\n" +
"column statistics are over the retention period, and then delete the column stats. \n"),

COLUMN_STATISTICS_RETENTION_PERIOD("metastore.column.statistics.retention.period",
"metastore.column.statistics.retention.period", 365, TimeUnit.DAYS, "The retention period " +
"that we want to keep the stats for each table, which means if the stats are older than this period\n" +
"of time, the stats will be automatically deleted. \n"),

COLUMN_STATISTICS_AUTO_DELETION("metastore.column.statistics.auto.deletion", "metastore.column.statistics.auto.deletion", false,
"Whether table/partition column statistics will be auto deleted after retention period"),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is very clear, no need to change.

Comment on lines +108 to +109
@Test
public void testExpiredTableColStatsAreDeleted() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged and changed.

"that we want to keep the stats for each table, which means if the stats are older than this period\n" +
"of time, the stats will be automatically deleted. \n"),

COLUMN_STATISTICS_AUTO_DELETION("metastore.column.statistics.auto.deletion", "metastore.column.statistics.auto.deletion", false,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we can use metastore.column.statistics.management.task.frequency to turn on/off the management task, so I think we don't need extra COLUMN_STATISTICS_AUTO_DELETION to configure the task

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to keep COLUMN_STATISTICS_AUTO_DELETION. The two configs have distinct responsibilities: the boolean controls whether the feature is active at all, while the frequency controls how often it runs. Conflating them makes the configuration less readable. It can gives users a simple toggle to temporarily disable deletion without having to remember and restore their frequency setting.

* statistics deletion regardless of the global retention setting.
*/
public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY =
"statistics.auto.deletion.exclude";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metastore.statistics.auto.deletion.exclude ?
can we add this parameter to the database or table, so we don't need to set it explicitly per table/partition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point and I agree that database-level exclusion would be more convenient when users want to exclude all tables in a database. However, that's a very rare case. The Jira is created to delete the stale column stats, the exclusion is just for the users to prevent. Since this feature usually delete the column stats created one year ago, it is rare that the users want to exclude a certain table out of auto deletion. We can defer that to a follow-up Jira later.

createDbAndTable(dbName, tableName, false);
writeTableLevelColStats(dbName, tableName, "c1");
assertHasTableColStats(dbName, tableName, "c1");
makeAllTableColStatsOlderThanRetention(dbName, tableName);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create some fresh stats so we can check those stats are not deleted after runStatisticsManagementTask

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. I added 2 new tests to check both table and partitions so not delete fresh column stats.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants