HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge… by tanishq-chugh · Pull Request #6443 · apache/hive

tanishq-chugh · 2026-04-18T13:13:18Z

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

Aggarwal-Raghav · 2026-04-18T17:43:58Z

+        if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
+          logTypeWarning(colName, type);
+        } else {
+          nonPrimColNames.add(colName);


the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

thomasrebele · 2026-04-21T12:09:27Z

-          } else {
-            colTypes.add(type);
-          }
+    Map<String, String> colTypeMap = new HashMap<>();


Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

ColumnStatsSemanticAnalyzer#getColumnName

Utilities.getColumnNamesFromFieldSchema
The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

thomasrebele

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

thomasrebele · 2026-04-28T10:14:24Z

    return rwt;
  }

+  private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {


Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:16:38Z

+    return new StatsEligibleColumns(colNames, colTypes);
  }

  private List<String> getColumnName(ASTNode tree) throws SemanticException {


I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

thomasrebele · 2026-04-28T10:17:25Z

+    colNames.clear();
+    colNames.addAll(primColNames);


Modifying the argument can be avoided when implementing my other comments.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

thomasrebele · 2026-04-28T10:26:43Z

  }

-  protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
+  protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {


I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:44:30Z

+        colNames = statsCols.columnNames();
+      } else {
+        colNames = getColumnName(ast);
+      }


The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

Collect the column names as string using the original method

Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)

Collect the columns as List<FieldSchema>

The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

thomasrebele · 2026-04-28T10:55:41Z

-    default:
+    if (tree.getChildCount() != 3) {
      throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
          + " either 2 or 3. Found : " + tree.getChildCount());


If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Updated the exception message in commit - 84d81f9

abstractdog · 2026-05-11T13:22:26Z

-    public void setColName(List<String> colName) {
-      this.colName = colName;
+    public List<String> getColName() {
+      return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas);


is there a chance you can go one level further?
we cannot be sure how hot is getColName, so running getColumnNamesFromFieldSchema all the time can be even worse...I mean without investigating this, we can even simply check the callers and adapt them accordingly, like in ColumnStatsDesc

@abstractdog I see only one call of AnalyzeRewriteContext#getColName() from TaskCompiler before initialising ColumnStatsDesc, and no other calls which would lead to multiple calls to Utilities.getColumnNamesFromFieldSchema().
I am afraid i did not fully understand about how could this lead to issues and rather removing getColName/getColType and introducing a new constructor in ColumnStatsDesc using columnSchemas instead of colName/colType which internally calls same Utilities methods wouldn't be behaving the same? could you please elaborate on this? Thanks!

if by static code analysis we can be sure that getColName is not called e.g. 100 times during compilation, this is going to be fine, I'm inclined to accept your analysis here

my point was only to completely remove the separate colName and colType variables from method signatures where it's possible, and only have them as separate variables where it's really needed (e.g. where EXPLAIN plan is used), e.g.:

@Explain(displayName = "Columns") public List<String> getColName() { return colName; }

so I can see that TaskCompiler uses them separately only for passing them to ColumnStatsDesc, which just complicates code, so what about passing schemas to ColumnStatsDesc, and we can resolve names/types there like:

@Explain(displayName = "Column Types") public List<String> getColType() { return ...the logic parsing types for schemas...; }

Yeah, that definitely makes sense that we should remove colName and colType usage wherever possible to ensure consistency.
So, what about introducing a new constructor in ColumnStatsDesc like:

public ColumnStatsDesc(String tableName, List<FieldSchema> columnSchemas, boolean isTblLevel, int numBitVector, FetchWork fWork1) { this(tableName, columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas), columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas), isTblLevel, numBitVector, fWork1); }

To compute and store the colNames and colTypes in ColumnStatsDesc itself.

Thinking of changing the ColumnStatsDesc constructor to have only columnSchemas instead of colNames / colTypes, the getColName / getColType will have to be changed to either:

@Explain(displayName = "Columns") public List<String> getColName() { return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas); }

or

@Explain(displayName = "Columns") public List<String> getColName() { if (colName != null) { return colName; } colName = columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas); return colName; }

In the first case, we will have multiple calls to getColumnNamesFromFieldSchema() as i see from static code, multiple usages of getColName() here.

In the second case, we will have columnSchemas as class variable in addition to colName and colType, increasing the complexity in my opinion.

Let me know what do u think of introducing the new constructor as above and keeping the getColName / getColType same as is? Thanks!

I feel we should have a single constructor, so simply change the existing one

regarding complexity: right, I don't know this codepath in general, so I'm not sure either if we need to be extra cautious about storing colName separately, but there is a good way to handle this complexity: refactor it to a new class :D I mean, really, what about FieldSchemas, a simple container to handle this mess, and it can act as FieldSchemas.getColName() and FieldSchemas.getColType(): might seem overengineering, but already has some value, and you can introduce it early, so instead of:

private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) { List<FieldSchema> result = new ArrayList<>(); ... return result; }

you can do:

private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) { List<FieldSchema> result = new ArrayList<>(); ... return new FieldSchemas(result); }

this can already eliminate some functions from the already huge Utilities class

Right, i have refactored the approach into a wrapper FieldSchemas as discussed in this commit: 2b1f556

abstractdog · 2026-05-11T19:42:38Z

  }

-  private static String genRewrittenQuery(Table tbl, List<String> colNames, List<String> colTypes,
+  private static String genRewrittenQuery(Table tbl, List<FieldSchema> columnSchemas,


nit: I like FieldSchemas as it is, so what about using it here too (as you already touched the signature)? it can also have a size() and get(i) method to fulfill the requirements in this method

Nice catch, have made this change in commit: c030988

abstractdog

LGTM
in the current form, this is a good optimization and refactoring of the same area, thanks!

Aggarwal-Raghav · 2026-05-12T08:37:20Z

+      columnNames = getExplicitColumnNamesFromAst(ast);
+    }
+
+    checkForPartitionColumns(columnNames, Utilities.getColumnNamesFromFieldSchema(tbl.getPartitionKeys()));


@tanishq-chugh , can you please check checkForPartitionColumns it also has nested for loop and validateSpecifiedColumnNames is using for loop + contains() which is also O(N*M). Both are used in getColumnsFromAst()

I haven't gone though the full PR yet, just wanted to hightlight.

Made this change in commit: 68446bd

sonarqubecloud · 2026-05-12T19:12:30Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.3% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests pending tests passed and removed tests pending labels Apr 18, 2026

Aggarwal-Raghav reviewed Apr 18, 2026

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 19, 2026

thomasrebele reviewed Apr 21, 2026

View reviewed changes

tanishq-chugh force-pushed the HIVE-29551 branch from 4a6804d to 85c0ebe Compare April 26, 2026 19:34

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 26, 2026

thomasrebele reviewed Apr 28, 2026

View reviewed changes

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending tests unstable labels Apr 28, 2026

asf-ci-hive added tests pending and removed tests passed labels May 11, 2026

abstractdog reviewed May 11, 2026

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels May 11, 2026

tanishq-chugh force-pushed the HIVE-29551 branch from 5b73f20 to 9813bb5 Compare May 11, 2026 17:27

Refactor approach to use a container FieldSchemas

2b1f556

tanishq-chugh force-pushed the HIVE-29551 branch from 9813bb5 to 2b1f556 Compare May 11, 2026 17:45

asf-ci-hive added tests pending and removed tests passed labels May 11, 2026

abstractdog reviewed May 11, 2026

View reviewed changes

asf-ci-hive added tests unstable and removed tests pending labels May 11, 2026

Refactor genRewrittenQuery to use FieldSchemas

c030988

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 12, 2026

abstractdog approved these changes May 12, 2026

View reviewed changes

Aggarwal-Raghav reviewed May 12, 2026

View reviewed changes

Refactor validateSpecifiedColumnNames & checkForPartitionColumns

68446bd

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending tests unstable labels May 12, 2026

asf-ci-hive added tests passed and removed tests pending labels May 12, 2026

Conversation

tanishq-chugh commented Apr 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanishq-chugh May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanishq-chugh May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 12, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tanishq-chugh May 11, 2026 •

edited

Loading

abstractdog May 11, 2026 •

edited

Loading

tanishq-chugh May 11, 2026 •

edited

Loading