HIVE-28755: Statistics Management Task#6438
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Metastore background task intended to automatically delete expired column statistics, and wires it into the metastore task thread list. The PR also introduces a benchmark hook and a new unit test for the statistics-management behavior.
Changes:
- Introduce
StatisticsManagementTask(aMetastoreTaskThread) that deletes column stats older than a configured retention window. - Add new metastore configuration knobs for stats-management frequency/retention/enablement and register the task in the task thread list.
- Add a micro-benchmark entry and a new unit test class for statistics management.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java | Adds a benchmark for StatisticsManagementTask (currently simulates/validates the wrong thing). |
| standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/BenchmarkTool.java | Wires the new benchmark into the benchmark suite (one suite name is inconsistent). |
| standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java | New unit tests for stats auto-deletion (currently missing required statsDesc on the stats object). |
| standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java | New background task implementation (contains multiple correctness/compilation/runtime issues). |
| standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java | Adds new conf vars and registers the task; docstrings currently don’t match how the feature is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
soumyakanti3578
left a comment
There was a problem hiding this comment.
I haven't reviewed StatisticsManagementTask.java yet because there are too many sonar/checkstyle warning.
Please follow the hive coding conventions to get rid of those warnings: https://hive.apache.org/community/resources/howtocontribute/#coding-conventions
You can also see the checkstyle file for standalone-metastore at standalone-metastore/checkstyle/checkstyle.xml which has the default values that sonar checks against.
I think you can also use mvn checkstyle from the console or download the checkstyle plugin for IntelliJ IDEA to catch these warnings before submitting the PR.
In general, if you're creating a new file, it's a good idea to follow the official coding conventions, and if you're updating small parts of a file then it's okay to follow the existing style to maintain readability. :)
| "Automatic partition management will look for tables using the specified table pattern"), | ||
|
|
||
| STATISTICS_MANAGEMENT_TASK_FREQUENCY("metastore.statistics.management.task.frequency", | ||
| "metastore.statistics.management.task.frequency", |
There was a problem hiding this comment.
The first two arguments are the same. Should the second one (hiveName) be something else, like, hive.column.statistics.management.task.frequency? Not sure if it even matters tbh.
Similar issue for the other two props too.
There was a problem hiding this comment.
The second argument (hiveName) exists for backward compatibility — when a config previously lived under a hive. prefix in older Hive versions, HMS needs to recognize both names. Since these three configs are brand new and have no prior hive.-prefixed equivalent, it's ok to keep them same.
| + "table.database.name, " | ||
| + "table.tableName, " | ||
| + "colName, " | ||
| + "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"); |
There was a problem hiding this comment.
Could add a filter for the tblQuery: "table.parameters.get(\"" + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"
There was a problem hiding this comment.
Acknowledged. Added a filter to filter out the excluded tables.
| if (!MetastoreConf.getBoolVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_AUTO_DELETION) | ||
| || MetastoreConf.getTimeVar(conf, MetastoreConf.ConfVars.COLUMN_STATISTICS_RETENTION_PERIOD, | ||
| TimeUnit.MILLISECONDS) <= 0) { | ||
| return Long.MAX_VALUE; |
There was a problem hiding this comment.
Acknowledged and changed.
| try (Query tblQuery = pm.newQuery(MTableColumnStatistics.class)) { | ||
| tblQuery.setFilter( | ||
| "lastAnalyzed < threshold " | ||
| + "&& table.parameters.get(\"" | ||
| + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\") != \"true\""); | ||
| tblQuery.declareParameters("long threshold"); | ||
| tblQuery.setRange(0, 1000); | ||
| tblQuery.setResult( | ||
| "table.database.catalogName, " | ||
| + "table.database.name, " | ||
| + "table.tableName, " | ||
| + "colName"); | ||
| @SuppressWarnings("unchecked") | ||
| List<Object[]> rows = (List<Object[]>) tblQuery.execute(lastAnalyzedThreshold); | ||
| return new ArrayList<>(rows); | ||
| } |
There was a problem hiding this comment.
Acknowledged. I added a filter which only collect the hive table and hive partitions. We do not want to touch any impala/spark tables and partitions.
| boolean committed = false; | ||
| openTransaction(); | ||
| try { | ||
| deleteTableColumnStatistics(coords[0], coords[1], coords[2], entry.getValue(), "hive"); |
There was a problem hiding this comment.
Same issue as the previous one. Resolved.
| try (Query partQuery = pm.newQuery(MPartitionColumnStatistics.class)) { | ||
| partQuery.setFilter( | ||
| "lastAnalyzed < threshold " | ||
| + "&& partition.table.parameters.get(\"" | ||
| + STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\") != \"true\""); | ||
| partQuery.declareParameters("long threshold"); | ||
| partQuery.setRange(0, 1000); | ||
| partQuery.setResult( | ||
| "partition.table.database.catalogName, " | ||
| + "partition.table.database.name, " | ||
| + "partition.table.tableName, " | ||
| + "partition.partitionName, " | ||
| + "colName"); | ||
| @SuppressWarnings("unchecked") | ||
| List<Object[]> rows = (List<Object[]>) partQuery.execute(lastAnalyzedThreshold); | ||
| return new ArrayList<>(rows); | ||
| } |
There was a problem hiding this comment.
Same issue as the previous one. Resolved.
| deletePartitionColumnStatistics(coords[0], coords[1], coords[2], | ||
| Collections.singletonList(coords[3]), entry.getValue(), "hive"); |
There was a problem hiding this comment.
Same issue as the previous one. Resolved.
| * | ||
| * @param pm the JDO persistence manager to use for the query | ||
| * @param lastAnalyzedThreshold epoch seconds; rows with lastAnalyzed below this value are expired | ||
| * @return list of projected rows: [catName, dbName, tblName, colName, excludeVal] |
There was a problem hiding this comment.
Acknowledged and changed.
| private void assertHasPartitionColStats(String db, String tbl, String partName, | ||
| String col) throws TException { | ||
| Map<String, List<ColumnStatisticsObj>> statsMap = client.getPartitionColumnStatistics( | ||
| db, tbl, Collections.singletonList(partName), Collections.singletonList(col), "hive"); |
There was a problem hiding this comment.
Acknowledged and changed. Will leave the engine as "hive" for now.
| q.setFilter("table.tableName == t && table.database.name == d"); | ||
| q.declareParameters("java.lang.String t, java.lang.String d"); | ||
| @SuppressWarnings("unchecked") | ||
| List<MTableColumnStatistics> rows = (List<MTableColumnStatistics>) q.execute(tbl, db); |
| q.setFilter("partition.table.tableName == t && partition.table.database.name == d"); | ||
| q.declareParameters("java.lang.String t, java.lang.String d"); |
| 7, TimeUnit.DAYS, "Frequency at which timer task runs to do automatic statistics \n" + | ||
| "management for tables. Statistics management include 2 configs. \n" + | ||
| "One is 'metastore.column.statistics.auto.deletion', and the other is 'metastore.column.statistics.retention.period'. \n" + | ||
| "When 'metastore.column.statistics.auto.deletion'='true' is set, statistics management will look for tables which their\n" + | ||
| "column statistics are over the retention period, and then delete the column stats. \n"), | ||
|
|
||
| COLUMN_STATISTICS_RETENTION_PERIOD("metastore.column.statistics.retention.period", | ||
| "metastore.column.statistics.retention.period", 365, TimeUnit.DAYS, "The retention period " + | ||
| "that we want to keep the stats for each table, which means if the stats are older than this period\n" + | ||
| "of time, the stats will be automatically deleted. \n"), | ||
|
|
||
| COLUMN_STATISTICS_AUTO_DELETION("metastore.column.statistics.auto.deletion", "metastore.column.statistics.auto.deletion", false, | ||
| "Whether table/partition column statistics will be auto deleted after retention period"), |
There was a problem hiding this comment.
The description is very clear, no need to change.
| @Test | ||
| public void testExpiredTableColStatsAreDeleted() throws Exception { |
There was a problem hiding this comment.
Acknowledged and changed.
| "that we want to keep the stats for each table, which means if the stats are older than this period\n" + | ||
| "of time, the stats will be automatically deleted. \n"), | ||
|
|
||
| COLUMN_STATISTICS_AUTO_DELETION("metastore.column.statistics.auto.deletion", "metastore.column.statistics.auto.deletion", false, |
There was a problem hiding this comment.
since we can use metastore.column.statistics.management.task.frequency to turn on/off the management task, so I think we don't need extra COLUMN_STATISTICS_AUTO_DELETION to configure the task
There was a problem hiding this comment.
I think it's ok to keep COLUMN_STATISTICS_AUTO_DELETION. The two configs have distinct responsibilities: the boolean controls whether the feature is active at all, while the frequency controls how often it runs. Conflating them makes the configuration less readable. It can gives users a simple toggle to temporarily disable deletion without having to remember and restore their frequency setting.
| * statistics deletion regardless of the global retention setting. | ||
| */ | ||
| public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY = | ||
| "statistics.auto.deletion.exclude"; |
There was a problem hiding this comment.
metastore.statistics.auto.deletion.exclude ?
can we add this parameter to the database or table, so we don't need to set it explicitly per table/partition?
There was a problem hiding this comment.
That's a good point and I agree that database-level exclusion would be more convenient when users want to exclude all tables in a database. However, that's a very rare case. The Jira is created to delete the stale column stats, the exclusion is just for the users to prevent. Since this feature usually delete the column stats created one year ago, it is rare that the users want to exclude a certain table out of auto deletion. We can defer that to a follow-up Jira later.
| createDbAndTable(dbName, tableName, false); | ||
| writeTableLevelColStats(dbName, tableName, "c1"); | ||
| assertHasTableColStats(dbName, tableName, "c1"); | ||
| makeAllTableColStatsOlderThanRetention(dbName, tableName); |
There was a problem hiding this comment.
Create some fresh stats so we can check those stats are not deleted after runStatisticsManagementTask
There was a problem hiding this comment.
Acknowledged. I added 2 new tests to check both table and partitions so not delete fresh column stats.
|



What changes were proposed in this pull request?
Added StatisticsManagementTask, a new MetastoreTaskThread that periodically scans TAB_COL_STATS and PART_COL_STATS and deletes column statistics whose lastAnalyzed timestamp is older than the configured retention period. Three new configs are introduced: metastore.column.statistics.auto.deletion (default: false), metastore.column.statistics.retention.period (default: 365 days), and metastore.column.statistics.management.task.frequency (default: 7 days). Individual tables can opt out by setting the table property statistics.auto.deletion.exclude.
Why are the changes needed?
Column statistics can become stale over time and consume unnecessary storage. There was no existing mechanism to automatically clean them up based on age, requiring manual intervention.
Does this PR introduce any user-facing change?
Yes. Three new metastore configuration properties are introduced. The feature is opt-in and disabled by default (metastore.column.statistics.auto.deletion=false), so existing deployments are unaffected after upgrade.
How was this patch tested?
Added TestStatisticsManagement with a few unit tests. There is one verifying that expired table-level column statistics are deleted when the task runs, and one verifying that tables marked with the exclude property are left untouched. And there are a few more for corner cases.