-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of last_value
by implementing special GroupsAccumulator
#15542
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @UBarney I think this is a right direction, I would suggest though to split the PR into smaller. The fix itself is important however there is bunch of renames/code moves, etc. It would be nice to start with a PR with just a fix a performance benefits description?
@@ -291,7 +202,121 @@ impl AggregateUDFImpl for FirstValue { | |||
} | |||
} | |||
|
|||
struct FirstPrimitiveGroupsAccumulator<T> | |||
fn create_group_acc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn create_group_acc( | |
fn create_group_accumulator( |
fn create_group_acc( | ||
args: AccumulatorArgs, | ||
pick_first_in_group: bool, | ||
) -> std::result::Result<Box<dyn GroupsAccumulator>, DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use DFResult
instead of Result with DataFusionError
@comphead Thanks for reviewing. I have split this PR. This PR only contains performance improvements. After this PR is merged, I will start a refactor PR to handle renames and code moves |
Which issue does this PR close?
first
andlast
#13998.Rationale for this change
Achieved significant performance improvement when cardinality is high.
select id2, id4, last_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4;
select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;
What changes are included in this PR?
pick_first_in_group: bool
toPrimitiveGroupsAccumulator
. If ture take first element in an aggregation group according to the requested ordering, otherwisetake last elementAdditional context
#15266
Are these changes tested?
Yes
Are there any user-facing changes?
No