Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

mbutrovich · 2025-04-01T17:56:41Z

Note this PR is a draft because it 1) relies on an unreleased arrow-rs version so there are cargo dependencies that I will remove after DF bumps its arrow-rs dependency, and 2) relies on an unapproved parquet-testing PR. I'm opening this as a draft to signal to other dependencies and try to get the timing right for upcoming releases.

Which issue does this PR close?

N/A.

Rationale for this change

We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values.

What changes are included in this PR?

new option in ParquetOptions to coerce int96 resolution, with serialization support (I think I did this correctly)

Are these changes tested?

Added a new test that relies on new int96_from_spark.parquet in parquet-testing.

Are there any user-facing changes?

There is a new field in ParquetOptions. There is an API-change to a pub(crate) test function to accept a provided table schema.

mbutrovich · 2025-04-03T17:14:30Z

apache/parquet-testing#73 merged so I updated the parquet-testing dependency. Now waiting on an arrow-rs release and DF bumping to that version.

mbutrovich added 10 commits March 31, 2025 12:00

Stash.

4e8b309

Stash.

958050c

Checkpoint.

2dbcfbf

update arrow

7dd593d

Fix after merging main.

ad32c9f

Merge branch 'main' into int96_again

6fad37b

Merge branch 'main' into int96_again

72b2ac7

Add test for int96_from_spark.

4bfdc92

Remove commented out code.

f080f83

Update parquet-testing to include int96_from_spark.parquet.

69ed7d4

github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

mbutrovich commented Apr 1, 2025 •

edited

Loading

mbutrovich commented Apr 3, 2025

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Are you sure you want to change the base?

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537

Conversation

mbutrovich commented Apr 1, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mbutrovich commented Apr 3, 2025

mbutrovich commented Apr 1, 2025 •

edited

Loading