Disallow order by within ordered-set aggregate functions argument lists#20421
Disallow order by within ordered-set aggregate functions argument lists#20421cj-zhukov wants to merge 3 commits intoapache:mainfrom
Conversation
High-level overviewThis PR fixes incorrect handling of ordered-set aggregate syntax. Specifically:
|
|
@Jefffrey Since you originally opened this issue, could you please take a look at the proposed refactor? |
There was a problem hiding this comment.
Thank you @cj-zhukov and @Jefffrey
The only thing I worry about this PR is that it will cause some queries that used to work to stop. However, on more review it seems like what it does is just improve the error message.
Can someone confirm that this is the case?
| select quantile_cont(col0, 0.75 order by col0) | ||
| from values (1, 3), (2, 2), (3, 1) t(col0, col1); | ||
|
|
||
| statement error ORDER BY must be specified using WITHIN GROUP |
There was a problem hiding this comment.
I double checked that these queries fail on main too:
> select quantile_cont(0.75 order by col0)
from values (1, 3), (2, 2), (3, 1) t(col0, col1);
Error during planning: Function 'percentile_cont' expects 2 arguments but received 1 No function matches the given name and argument types 'percentile_cont(Float64)'. You might need to add explicit type casts.
Candidate functions:
percentile_cont(expr: Coercion(TypeSignatureClass::Float, implicit_coercion=ImplicitCoercion([Numeric], default_type=Float64), percentile: Coercion(TypeSignatureClass::Native(LogicalType(Native(Float64), Float64)), implicit_coercion=ImplicitCoercion([Numeric], default_type=Float64))There was a problem hiding this comment.
The query above currently succeeds on main
select quantile_cont(col0, 0.75 order by col0)
from values (1, 3), (2, 2), (3, 1) t(col0, col1);I was thinking this could be a breaking change, but I wonder how often these (ordered set aggregates) are currently used especially with an order by in the argument list like this; if we want to be fully explicit we can add an entry to the upgrade guide
There was a problem hiding this comment.
I double checked and I agree with you
andrewlamb@Andrews-MacBook-Pro-3:~/Software/influxdb_pro/ent$ datafusion-cli
DataFusion CLI v52.1.0
> select quantile_cont(col0, 0.75 order by col0)
from values (1, 3), (2, 2), (3, 1) t(col0, col1);
+-------------------------------------+
| quantile_cont(t.col0,Float64(0.75)) |
+-------------------------------------+
| 2.5 |
+-------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.In general, rather than disallow this syntax, is there some way we can support both syntaxes? I also have no idea how often they are used but if it is currently supported and produces the correct result, then I suspect at least some users would treat them starting to error as a regression 🤔
There was a problem hiding this comment.
I can confirm that this PR does introduce a breaking change. This query works on main:
select quantile_cont(col0, 0.75 order by col0)
from values (1, 3), (2, 2), (3, 1) t(col0, col1);
but now it returns error: ORDER BY must be specified using WITHIN GROUP
There was a problem hiding this comment.
Thanks for raising this suggestion - I started working on supporting both syntaxes instead of returning an error.
After digging into it more deeply, it turns out the change is more involved than I initially expected. It’s starting to drift away from the scope of the original refactor.
Given that, I’m wondering if it would make sense to keep this PR focused and open a separate PR dedicated specifically to supporting both syntaxes. I’m happy to continue working on that and share progress once I have a clean and well-tested solution.
Would that approach make sense?
Probably I am just worried about disallowing syntax that was previously allowed as I worry it will cause downstream issues (we have hit stuff like this in the past at InfluxData) From a customers perspective they are like "my query use to work but no longer does" Unless we have some sort of more compelling story (like "it was getting the wrong answer and now you get an error which is better") it is hard to justify such changes |
Thanks for clarifying - that makes sense. I’ll update this PR to continue supporting both syntaxes rather than introducing a breaking change. I’ll follow up once I have a clean solution in place. |
I believe this might be the case here? As far as I'm aware, the select quantile_cont(col0, 0.75 order by col0 desc)
from values (1, 3), (2, 2), (3, 1) t(col0, col1);Then it is would actually be computed as ascending (the default since |
|
This PR now supports both syntaxes: What do you think about supporting both syntaxes? Happy to adjust if there is a preference to keep only one form. |
alamb
left a comment
There was a problem hiding this comment.
I am still trying to get my head around what this PR does / doesn't do
Here is a query that (still) doesn't work on this PR. Is that expected? Or do you expect that this one will work?
-- Unsupported on both main and on this branch
SELECT approx_percentile_cont_with_weight(1, 0.95 ORDER BY x)
FROM (VALUES (1), (2), (3)) AS t(x);| Ok((exprs, names)) | ||
| } | ||
|
|
||
| #[expect(dead_code)] |
There was a problem hiding this comment.
I think this means the code is not used -- shouldn't we just remove it rather than making #[expect(dead)]?
There was a problem hiding this comment.
I think I may be missing part of the intended scope, so I’d appreciate some clarification.
It looks like we may need broader test coverage here - not only for quantile_cont, but also for other aggregate functions and similar usage patterns, to better define the expected behavior.
Originally, this issue was about rejecting invalid ORDER BY usage inside aggregate argument lists earlier. My initial fix enforced that, but it introduced a breaking change, so I updated the implementation to support both syntaxes for quantile_cont. This works for the cases I’ve added tests for.
Regarding your example:
SELECT approx_percentile_cont_with_weight(1, 0.95 ORDER BY x)
FROM (VALUES (1), (2), (3)) AS t(x);I didn’t consider this kind of usage across other aggregate functions when implementing the change, so I’m not fully sure what the expected behavior should be here.
Would you expect inline ORDER BY to be supported only for quantile_cont (and similar ordered-set aggregates), or should this be extended more generally to other aggregate functions as well?
There was a problem hiding this comment.
Would you expect inline ORDER BY to be supported only for quantile_cont (and similar ordered-set aggregates), or should this be extended more generally to other aggregate functions as well?
I am not sure -- I basically expect that datafusion would follow other implementations rather than creating our own syntax. In the case where previous versions of DataFusion supported syntax that is not supported by other systems, I would expect that syntax to keep working, though we don't have to extend it to other functions that didn't previously support it
There was a problem hiding this comment.
Thanks, that helps clarify the direction.
I’ll keep the inline ORDER BY syntax only for cases where it was already supported. I’ll also add tests to ensure we preserve existing behavior while rejecting unsupported cases more clearly.
There was a problem hiding this comment.
Thank you -- I am sorry for the delay / back and forth on this -- but we have been bitten downstream in the past when stuff that used to work stopped working
…et-agr-fn-arg-lists
Which issue does this PR close?
order bywithin ordered-set aggregate functions argument lists #18281.Rationale for this change
This PR fixes incorrect handling of ordered-set aggregate syntax.
What changes are included in this PR?
datafusion/sql/src/expr/function.rsto return a planning error whenORDER BYis used inlinedatafusion/sqllogictest/test_files/aggregate.sltto ensure these invalid forms now return a syntax/planning errorAre these changes tested?
Yes, these changes are tested.
Are there any user-facing changes?
Yes - this is a public API change.
This query:
now returns an error:
ORDER BY must be specified using WITHIN GROUP