Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mbutrovich
Copy link

@mbutrovich mbutrovich commented Apr 1, 2025

Note this PR is a draft because it 1) relies on an unreleased arrow-rs version so there are cargo dependencies that I will remove after DF bumps its arrow-rs dependency, and 2) relies on an unapproved parquet-testing PR. I'm opening this as a draft to signal to other dependencies and try to get the timing right for upcoming releases.

Which issue does this PR close?

  • N/A.

Rationale for this change

We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values.

What changes are included in this PR?

  • new option in ParquetOptions to coerce int96 resolution, with serialization support (I think I did this correctly)

Are these changes tested?

Added a new test that relies on new int96_from_spark.parquet in parquet-testing.

Are there any user-facing changes?

There is a new field in ParquetOptions. There is an API-change to a pub(crate) test function to accept a provided table schema.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Apr 3, 2025
@mbutrovich
Copy link
Author

apache/parquet-testing#73 merged so I updated the parquet-testing dependency. Now waiting on an arrow-rs release and DF bumping to that version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant