Skip to content

[WIP] HIVE-28931: Convert IN to SEARCH in RexNodeConverter#6495

Draft
rubenada wants to merge 4 commits into
apache:masterfrom
rubenada:HIVE-28931
Draft

[WIP] HIVE-28931: Convert IN to SEARCH in RexNodeConverter#6495
rubenada wants to merge 4 commits into
apache:masterfrom
rubenada:HIVE-28931

Conversation

@rubenada
Copy link
Copy Markdown

What changes were proposed in this pull request?

In RexNodeConverter, convert IN to SEARCH operator.

Why are the changes needed?

Doing an early conversion from IN to SEARCH (instead of OR) in RexNodeConverter seems beneficial for enabling further simplifications and it could possibly reduce compilation latency since we would avoid the intermediate conversion to OR when that is not necessary. In a nutshell instead of doing IN -> OR -> SEARCH we could do IN -> SEARCH and we could possibly exploit the RexBuilder.makeIn method.

Does this PR introduce any user-facing change?

No

How was this patch tested?

tbd

Copy link
Copy Markdown
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few quick comments as I skimmed through the PR.

Comment on lines +585 to +587
if (childRexNodeLst.size() == 2) {
return rexBuilder.makeCall(SqlStdOperatorTable.EQUALS, childRexNodeLst);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this become part of RexBuilder#makeIn API if not already the case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... Calcite's code just creates an IN with one element in this case, which is not incorrect (and I guess it's the expected behavior). I've just kept this special case on Hive's side for the moment to minimize side effects by this refactoring.

Comment on lines +595 to +600
// Calcite SEARCH conversion was not possible: generate our own OR expression
List<RexNode> newInputs = RexNodeConverter.transformInToOrOperands(childRexNodeLst, rexBuilder);
if (newInputs == null) {
return null;
}
return newInputs.size() == 1 ? newInputs.get(0) : rexBuilder.makeCall(SqlStdOperatorTable.OR, newInputs);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RexBuilder#makeIn method also generates an OR when it cannot create a SEARCH. Is there any reason of why we opt for the Hive specific IN to OR transformation?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive's OR conversion is more thorough: it checks HiveCalciteUtil.isDeterministic (and bails out if any expression is not), and also includes a field by field comparison for ROW types. Calcite's code is simply a "blind" OR of EQUALS

}

// TODO remove?
public static List<RexNode> rewriteInClauseChildren(SqlOperator op, List<RexNode> childRexNodeLst,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not use then yes lets drop it.

// the threshold configured
childRexNodeLst = rewriteInClauseChildren(calciteOp, childRexNodeLst, rexBuilder);
calciteOp = SqlStdOperatorTable.OR;
if (childRexNodeLst.size() == 2 || RexUtil.isReferenceOrAccess(childRexNodeLst.get(0), true)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we limit the transformation? What are the pros/cons of doing an unconditional rewrite?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just keeping the original logic to minimize the side effects of the change. I'll investigate why this condition was added in the first place and whether it can be removed...

Comment on lines +271 to +276
if (rewritten != null) {
assert rewritten instanceof RexCall;
RexCall call = (RexCall) rewritten;
if (call.getKind() == SqlKind.OR && maxNodesForInToOrTransformation != 0) {
// Rewrite to OR is done only if number of operands are less than the threshold configured
if (call.getOperands().size() <= maxNodesForInToOrTransformation) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange to have threshold blockers after performing the transformation. I think the initial purpose of the threshold was to prevent a performance hit and the exponential explosion of disjunctions generated by the transformation.

As part of this PR we should determine what happens with hive.optimize.transform.in.maxnodes property. Is it still relevant/necessary or are we planning to deprecate and remove it eventually?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll try to move this logic into RexNodeConverter (makes more sense in there), to perform the check before actually generating the OR.
I guess we should keep this "maxnodes check" since this property was introduced due to perf issues (HIVE-22074: Slow compilation due to IN to OR transformation)

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants