Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize Delta Lake checkpoint v2 sidecar files retrieval #25469

Merged

Conversation

zhaner08
Copy link
Contributor

@zhaner08 zhaner08 commented Apr 1, 2025

Description

Parallelize Delta Lake checkpoint v2 sidecar files retrieval.

This optimization will use a bounded executor out of the metadata executor service to process sidecar files in parallel, which normally take a few hundred milliseconds each depend on the file sizes and the effectiveness of the predicate pushdown into metadata. The retrieval of the futures and split generations will happen the same way as now which splits can be generated and consumed while future values are obtained.

This optimization may significantly improve the speeds of Delta Lake queries, especially those ones with large amount of metdata files.

A configuration delta.checkpoint-processing.parallelism has been provided in case users would like to speed up the metadata processing even more, or want to reduce the parallelism when having a smaller cluster

This PR also adds more tests for multipart v2 checkpoint validations

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta lake
* Improve performance of scans on delta lake tables with v2 checkpoints. ({issue}`25469`)

@cla-bot cla-bot bot added the cla-signed label Apr 1, 2025
@github-actions github-actions bot added the delta-lake Delta Lake connector label Apr 1, 2025
@zhaner08 zhaner08 requested review from ebyhr and raunaqmorarka April 1, 2025 04:50
@zhaner08 zhaner08 self-assigned this Apr 1, 2025
@ebyhr ebyhr requested a review from chenjian2664 April 1, 2025 04:59
@zhaner08
Copy link
Contributor Author

zhaner08 commented Apr 3, 2025

@raunaqmorarka Thanks for the review, all comments addressed

@zhaner08
Copy link
Contributor Author

zhaner08 commented Apr 4, 2025

saw a few tests failed with my new revision, will fix them

@zhaner08 zhaner08 force-pushed the support_parallel_sidecar_read_1 branch from ca88bb1 to 856364a Compare April 4, 2025 06:31
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash your commits

@zhaner08 zhaner08 force-pushed the support_parallel_sidecar_read_1 branch from 856364a to bd03199 Compare April 4, 2025 14:07
@raunaqmorarka raunaqmorarka merged commit eba2248 into trinodb:master Apr 7, 2025
22 checks passed
@github-actions github-actions bot added this to the 475 milestone Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants