Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of last_value by implementing special GroupsAccumulator #15542

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

UBarney
Copy link
Contributor

@UBarney UBarney commented Apr 2, 2025

Which issue does this PR close?

Rationale for this change

Achieved significant performance improvement when cardinality is high.

benchmark sql main thisPR
select id2, id4, last_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4; 36.546s 7.276s
select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode; 0.962s 0.801s

What changes are included in this PR?

  • Add fields pick_first_in_group: bool to PrimitiveGroupsAccumulator. If ture take first element in an aggregation group according to the requested ordering, otherwisetake last element

Additional context

#15266

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 2, 2025
@UBarney UBarney marked this pull request as ready for review April 3, 2025 01:34
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @UBarney I think this is a right direction, I would suggest though to split the PR into smaller. The fix itself is important however there is bunch of renames/code moves, etc. It would be nice to start with a PR with just a fix a performance benefits description?

@@ -291,7 +202,121 @@ impl AggregateUDFImpl for FirstValue {
}
}

struct FirstPrimitiveGroupsAccumulator<T>
fn create_group_acc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn create_group_acc(
fn create_group_accumulator(

fn create_group_acc(
args: AccumulatorArgs,
pick_first_in_group: bool,
) -> std::result::Result<Box<dyn GroupsAccumulator>, DataFusionError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use DFResult instead of Result with DataFusionError

@UBarney UBarney marked this pull request as draft April 5, 2025 14:14
@UBarney UBarney marked this pull request as ready for review April 6, 2025 03:56
@UBarney
Copy link
Contributor Author

UBarney commented Apr 6, 2025

@comphead Thanks for reviewing. I have split this PR. This PR only contains performance improvements. After this PR is merged, I will start a refactor PR to handle renames and code moves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support fast group accumulator for first and last
2 participants