Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note this PR is a draft because it 1) relies on an unreleased arrow-rs version so there are cargo dependencies that I will remove after DF bumps its arrow-rs dependency, and 2) relies on an unapproved parquet-testing PR. I'm opening this as a draft to signal to other dependencies and try to get the timing right for upcoming releases.
Which issue does this PR close?
Rationale for this change
We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values.
What changes are included in this PR?
ParquetOptions
to coerce int96 resolution, with serialization support (I think I did this correctly)Are these changes tested?
Added a new test that relies on new int96_from_spark.parquet in parquet-testing.
Are there any user-facing changes?
There is a new field in
ParquetOptions
. There is an API-change to apub(crate)
test function to accept a provided table schema.