Skip to content

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443

Open
tanishq-chugh wants to merge 15 commits into
apache:masterfrom
tanishq-chugh:HIVE-29551
Open

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443
tanishq-chugh wants to merge 15 commits into
apache:masterfrom
tanishq-chugh:HIVE-29551

Conversation

@tanishq-chugh
Copy link
Copy Markdown
Contributor

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
logTypeWarning(colName, type);
} else {
nonPrimColNames.add(colName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

} else {
colTypes.add(type);
}
Map<String, String> colTypeMap = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

  • ColumnStatsSemanticAnalyzer#getColumnName
  • Utilities.getColumnNamesFromFieldSchema
    The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

return rwt;
}

private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

return new StatsEligibleColumns(colNames, colTypes);
}

private List<String> getColumnName(ASTNode tree) throws SemanticException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

Comment on lines +245 to +246
colNames.clear();
colNames.addAll(primColNames);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the argument can be avoided when implementing my other comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

}

protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

colNames = statsCols.columnNames();
} else {
colNames = getColumnName(ast);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

  1. Collect the column names as string using the original method
  2. Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
  3. Collect the columns as List<FieldSchema>
  4. The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

default:
if (tree.getChildCount() != 3) {
throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
+ " either 2 or 3. Found : " + tree.getChildCount());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the exception message in commit - 84d81f9

public void setColName(List<String> colName) {
this.colName = colName;
public List<String> getColName() {
return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance you can go one level further?
we cannot be sure how hot is getColName, so running getColumnNamesFromFieldSchema all the time can be even worse...I mean without investigating this, we can even simply check the callers and adapt them accordingly, like in ColumnStatsDesc

Copy link
Copy Markdown
Contributor Author

@tanishq-chugh tanishq-chugh May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog I see only one call of AnalyzeRewriteContext#getColName() from TaskCompiler before initialising ColumnStatsDesc, and no other calls which would lead to multiple calls to Utilities.getColumnNamesFromFieldSchema().
I am afraid i did not fully understand about how could this lead to issues and rather removing getColName/getColType and introducing a new constructor in ColumnStatsDesc using columnSchemas instead of colName/colType which internally calls same Utilities methods wouldn't be behaving the same? could you please elaborate on this? Thanks!

Copy link
Copy Markdown
Contributor

@abstractdog abstractdog May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if by static code analysis we can be sure that getColName is not called e.g. 100 times during compilation, this is going to be fine, I'm inclined to accept your analysis here

my point was only to completely remove the separate colName and colType variables from method signatures where it's possible, and only have them as separate variables where it's really needed (e.g. where EXPLAIN plan is used), e.g.:

  @Explain(displayName = "Columns")
  public List<String> getColName() {
    return colName;
  }

so I can see that TaskCompiler uses them separately only for passing them to ColumnStatsDesc, which just complicates code, so what about passing schemas to ColumnStatsDesc, and we can resolve names/types there like:

  @Explain(displayName = "Column Types")
  public List<String> getColType() {
    return ...the logic parsing types for schemas...;
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that definitely makes sense that we should remove colName and colType usage wherever possible to ensure consistency.
So, what about introducing a new constructor in ColumnStatsDesc like:

public ColumnStatsDesc(String tableName, List<FieldSchema> columnSchemas, boolean isTblLevel,
      int numBitVector, FetchWork fWork1) {
    this(tableName,
        columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas),
        columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas),
        isTblLevel, numBitVector, fWork1);
  }

To compute and store the colNames and colTypes in ColumnStatsDesc itself.

Thinking of changing the ColumnStatsDesc constructor to have only columnSchemas instead of colNames / colTypes, the getColName / getColType will have to be changed to either:

  @Explain(displayName = "Columns")
  public List<String> getColName() {
    return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas);
  }

or

  @Explain(displayName = "Columns")
  public List<String> getColName() {
    if (colName != null) {
      return colName;
    }
    colName = columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas);
    return colName;
  }

In the first case, we will have multiple calls to getColumnNamesFromFieldSchema() as i see from static code, multiple usages of getColName() here.

In the second case, we will have columnSchemas as class variable in addition to colName and colType, increasing the complexity in my opinion.

Let me know what do u think of introducing the new constructor as above and keeping the getColName / getColType same as is? Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should have a single constructor, so simply change the existing one

regarding complexity: right, I don't know this codepath in general, so I'm not sure either if we need to be extra cautious about storing colName separately, but there is a good way to handle this complexity: refactor it to a new class :D I mean, really, what about FieldSchemas, a simple container to handle this mess, and it can act as FieldSchemas.getColName() and FieldSchemas.getColType(): might seem overengineering, but already has some value, and you can introduce it early, so instead of:

  private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) {
  List<FieldSchema> result = new ArrayList<>();
  ...
  return result;
}

you can do:

  private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) {
  List<FieldSchema> result = new ArrayList<>();
  ...
  return new FieldSchemas(result);
}

this can already eliminate some functions from the already huge Utilities class

Copy link
Copy Markdown
Contributor Author

@tanishq-chugh tanishq-chugh May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i have refactored the approach into a wrapper FieldSchemas as discussed in this commit: 2b1f556

}

private static String genRewrittenQuery(Table tbl, List<String> colNames, List<String> colTypes,
private static String genRewrittenQuery(Table tbl, List<FieldSchema> columnSchemas,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like FieldSchemas as it is, so what about using it here too (as you already touched the signature)? it can also have a size() and get(i) method to fulfill the requirements in this method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, have made this change in commit: c030988

Copy link
Copy Markdown
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
in the current form, this is a good optimization and refactoring of the same area, thanks!

columnNames = getExplicitColumnNamesFromAst(ast);
}

checkForPartitionColumns(columnNames, Utilities.getColumnNamesFromFieldSchema(tbl.getPartitionKeys()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanishq-chugh , can you please check checkForPartitionColumns it also has nested for loop and validateSpecifiedColumnNames is using for loop + contains() which is also O(N*M). Both are used in getColumnsFromAst()

I haven't gone though the full PR yet, just wanted to hightlight.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit: 68446bd

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants