HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443
HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443tanishq-chugh wants to merge 15 commits into
Conversation
| if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) { | ||
| logTypeWarning(colName, type); | ||
| } else { | ||
| nonPrimColNames.add(colName); |
There was a problem hiding this comment.
the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.
There was a problem hiding this comment.
@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !
| } else { | ||
| colTypes.add(type); | ||
| } | ||
| Map<String, String> colTypeMap = new HashMap<>(); |
There was a problem hiding this comment.
Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:
- ColumnStatsSemanticAnalyzer#getColumnName
- Utilities.getColumnNamesFromFieldSchema
The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.
The HashMap is only needed when the ASTNode has 3 children.
4a6804d to
85c0ebe
Compare
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.
| return rwt; | ||
| } | ||
|
|
||
| private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) { |
There was a problem hiding this comment.
Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?
| return new StatsEligibleColumns(colNames, colTypes); | ||
| } | ||
|
|
||
| private List<String> getColumnName(ASTNode tree) throws SemanticException { |
There was a problem hiding this comment.
I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.
There was a problem hiding this comment.
Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9
| colNames.clear(); | ||
| colNames.addAll(primColNames); |
There was a problem hiding this comment.
Modifying the argument can be avoided when implementing my other comments.
There was a problem hiding this comment.
Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9
| } | ||
|
|
||
| protected static List<String> getColumnTypes(Table tbl, List<String> colNames) { | ||
| protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) { |
There was a problem hiding this comment.
I recommend to refactor getColumnTypesByName to return List<FieldSchema>.
| colNames = statsCols.columnNames(); | ||
| } else { | ||
| colNames = getColumnName(ast); | ||
| } |
There was a problem hiding this comment.
The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:
- Collect the column names as string using the original method
- Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
- Collect the columns as
List<FieldSchema> - The caller extracts the names (with
org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).
This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.
There was a problem hiding this comment.
Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9
| default: | ||
| if (tree.getChildCount() != 3) { | ||
| throw new SemanticException("Internal error. Expected number of children of ASTNode to be" | ||
| + " either 2 or 3. Found : " + tree.getChildCount()); |
There was a problem hiding this comment.
If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.
There was a problem hiding this comment.
Updated the exception message in commit - 84d81f9
| public void setColName(List<String> colName) { | ||
| this.colName = colName; | ||
| public List<String> getColName() { | ||
| return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas); |
There was a problem hiding this comment.
is there a chance you can go one level further?
we cannot be sure how hot is getColName, so running getColumnNamesFromFieldSchema all the time can be even worse...I mean without investigating this, we can even simply check the callers and adapt them accordingly, like in ColumnStatsDesc
There was a problem hiding this comment.
@abstractdog I see only one call of AnalyzeRewriteContext#getColName() from TaskCompiler before initialising ColumnStatsDesc, and no other calls which would lead to multiple calls to Utilities.getColumnNamesFromFieldSchema().
I am afraid i did not fully understand about how could this lead to issues and rather removing getColName/getColType and introducing a new constructor in ColumnStatsDesc using columnSchemas instead of colName/colType which internally calls same Utilities methods wouldn't be behaving the same? could you please elaborate on this? Thanks!
There was a problem hiding this comment.
if by static code analysis we can be sure that getColName is not called e.g. 100 times during compilation, this is going to be fine, I'm inclined to accept your analysis here
my point was only to completely remove the separate colName and colType variables from method signatures where it's possible, and only have them as separate variables where it's really needed (e.g. where EXPLAIN plan is used), e.g.:
@Explain(displayName = "Columns")
public List<String> getColName() {
return colName;
}
so I can see that TaskCompiler uses them separately only for passing them to ColumnStatsDesc, which just complicates code, so what about passing schemas to ColumnStatsDesc, and we can resolve names/types there like:
@Explain(displayName = "Column Types")
public List<String> getColType() {
return ...the logic parsing types for schemas...;
}
There was a problem hiding this comment.
Yeah, that definitely makes sense that we should remove colName and colType usage wherever possible to ensure consistency.
So, what about introducing a new constructor in ColumnStatsDesc like:
public ColumnStatsDesc(String tableName, List<FieldSchema> columnSchemas, boolean isTblLevel,
int numBitVector, FetchWork fWork1) {
this(tableName,
columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas),
columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas),
isTblLevel, numBitVector, fWork1);
}
To compute and store the colNames and colTypes in ColumnStatsDesc itself.
Thinking of changing the ColumnStatsDesc constructor to have only columnSchemas instead of colNames / colTypes, the getColName / getColType will have to be changed to either:
@Explain(displayName = "Columns")
public List<String> getColName() {
return columnSchemas == null ? null : Utilities.getColumnNamesFromFieldSchema(columnSchemas);
}
or
@Explain(displayName = "Columns")
public List<String> getColName() {
if (colName != null) {
return colName;
}
colName = columnSchemas == null ? null : Utilities.getColumnTypesFromFieldSchema(columnSchemas);
return colName;
}
In the first case, we will have multiple calls to getColumnNamesFromFieldSchema() as i see from static code, multiple usages of getColName() here.
In the second case, we will have columnSchemas as class variable in addition to colName and colType, increasing the complexity in my opinion.
Let me know what do u think of introducing the new constructor as above and keeping the getColName / getColType same as is? Thanks!
There was a problem hiding this comment.
I feel we should have a single constructor, so simply change the existing one
regarding complexity: right, I don't know this codepath in general, so I'm not sure either if we need to be extra cautious about storing colName separately, but there is a good way to handle this complexity: refactor it to a new class :D I mean, really, what about FieldSchemas, a simple container to handle this mess, and it can act as FieldSchemas.getColName() and FieldSchemas.getColType(): might seem overengineering, but already has some value, and you can introduce it early, so instead of:
private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) {
List<FieldSchema> result = new ArrayList<>();
...
return result;
}
you can do:
private static List<FieldSchema> getStatsEligibleFieldSchemas(Table tbl) {
List<FieldSchema> result = new ArrayList<>();
...
return new FieldSchemas(result);
}
this can already eliminate some functions from the already huge Utilities class
There was a problem hiding this comment.
Right, i have refactored the approach into a wrapper FieldSchemas as discussed in this commit: 2b1f556
| } | ||
|
|
||
| private static String genRewrittenQuery(Table tbl, List<String> colNames, List<String> colTypes, | ||
| private static String genRewrittenQuery(Table tbl, List<FieldSchema> columnSchemas, |
There was a problem hiding this comment.
nit: I like FieldSchemas as it is, so what about using it here too (as you already touched the signature)? it can also have a size() and get(i) method to fulfill the requirements in this method
There was a problem hiding this comment.
Nice catch, have made this change in commit: c030988
abstractdog
left a comment
There was a problem hiding this comment.
LGTM
in the current form, this is a good optimization and refactoring of the same area, thanks!
| columnNames = getExplicitColumnNamesFromAst(ast); | ||
| } | ||
|
|
||
| checkForPartitionColumns(columnNames, Utilities.getColumnNamesFromFieldSchema(tbl.getPartitionKeys())); |
There was a problem hiding this comment.
@tanishq-chugh , can you please check checkForPartitionColumns it also has nested for loop and validateSpecifiedColumnNames is using for loop + contains() which is also O(N*M). Both are used in getColumnsFromAst()
I haven't gone though the full PR yet, just wanted to hightlight.
|



…tColumnTypes
What changes were proposed in this pull request?
Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes
Why are the changes needed?
Performance Improvement
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual Testing + CI