feat(datafusion): Support insert_into in IcebergTableProvider #1511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

CTTY wants to merge 19 commits into apache:main from CTTY:ctty/df-insert

Contributor

CTTY commented Jul 15, 2025

Which issue does this PR close?

A part of [EPIC] Support for appending data to iceberg table. #1382

What changes are included in this PR?

Are these changes tested?

CTTY added 5 commits

July 15, 2025 15:35


          Support Datafusion insert_into

a5593b4


          cleanup

558b402


          minor

847a2bb


          minor

b067656


          clippy ftw

f52a698

CTTY commented

View reviewed changes

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor Author

CTTY Jul 16, 2025

I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here?

Contributor

liurenjie1024 Jul 21, 2025

Could you help me to understand why we need to change this?


          minor

d367a7c

CTTY commented

View reviewed changes

crates/iceberg/src/spec/manifest/_serde.rs Outdated Show resolved Hide resolved

CTTY added 2 commits

July 15, 2025 18:13


          minor

99af430


          i luv cleaning up

2f9efa8

CTTY force-pushed the ctty/df-insert branch from 7843b0d to 2f9efa8 Compare

July 16, 2025 03:37


          fmt not working?

9d7c1c3

liurenjie1024 reviewed

View reviewed changes

Contributor

liurenjie1024 left a comment

Thanks @CTTY for this pr, just finished round of review. My suggestion is to start with unpartitioned table first.

crates/integrations/datafusion/src/table/mod.rs Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated

+                      // Define a schema.
+                      Arc::new(ArrowSchema::new(vec![
+                          Field::new("data_files", DataType::Utf8, false),
+                          Field::new("count", DataType::UInt64, false),

Contributor

liurenjie1024 Jul 16, 2025

What's the meaning of count?

Contributor Author

CTTY Jul 16, 2025

Datafusion expects insert_into to return the number of rows(count) it written: https://datafusion.apache.org/user-guide/sql/dml.html#insert Here I'm sending count to the commit node, and have the commit node to return the number of rows eventually.

Technically we don't need to follow Datafusion's convention on insert_into and can return nothing, do you think that would be better?

Contributor

liurenjie1024 Jul 17, 2025

I think we should still follow datafusion's convention. But do we really need this? DataFile has a field called record_count, and I think it's enough for insert only case?

Contributor Author

CTTY Jul 17, 2025

Yeah using record_count makes more sense, I'll fix this

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/iceberg/src/spec/manifest/mod.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/tests/integration_datafusion_test.rs

@@ @@ -432,3 +433,69 @@ async fn test_metadata_table() -> Result<()> { @@
                   Ok(())
               }
+              #[tokio::test]
+              async fn test_insert_into() -> Result<()> {

Contributor

liurenjie1024 Jul 16, 2025

I'm not a big fan of adding this kind of integration tests. How about adding sqllogictests?

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

CTTY added 6 commits

July 16, 2025 09:34


          Merge branch 'main' into ctty/df-insert

41a75bd


          do not expose serde

e25f888


          cut it down

b554701


          Use stricter wrapper data file wrapper

77b349b


          fix partitioning, and fmt ofc

88afe82


          minor

295e9b6

liurenjie1024 reviewed

View reviewed changes

crates/integrations/datafusion/src/table/mod.rs Show resolved Hide resolved

crates/integrations/datafusion/src/table/mod.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs Outdated

+                      // Define a schema.
+                      Arc::new(ArrowSchema::new(vec![
+                          Field::new("data_files", DataType::Utf8, false),
+                          Field::new("count", DataType::UInt64, false),

Contributor

liurenjie1024 Jul 17, 2025

I think we should still follow datafusion's convention. But do we really need this? DataFile has a field called record_count, and I think it's enough for insert only case?

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/write.rs

+                      PlanProperties::new(
+                          EquivalenceProperties::new(schema),
+                          input.output_partitioning().clone(),
+                          input.pipeline_behavior(),

Contributor

liurenjie1024 Jul 17, 2025

This should be Final?

Contributor Author

CTTY Jul 17, 2025 •

edited

Loading

I was thinking maybe IcebergWriteExec can be used for the steaming case so the pipeline behavior and boundedness should be the same as input's. for normal INSERT INTO query it shouldn't matter as well

Contributor

liurenjie1024 Jul 18, 2025

I'm not quite familiar with datafusion's streaming mode, but my suggestion is that we should not assume it's executed in streaming for now. We could always change this when we actually add streaming support.

crates/integrations/datafusion/src/physical_plan/write.rs

+                          EquivalenceProperties::new(schema),
+                          input.output_partitioning().clone(),
+                          input.pipeline_behavior(),
+                          input.boundedness(),

Contributor

liurenjie1024 Jul 17, 2025

It should be Bounded.

Contributor

liurenjie1024 Jul 18, 2025

Ditto.

crates/integrations/datafusion/src/physical_plan/write.rs Outdated Show resolved Hide resolved

CTTY added 4 commits

July 17, 2025 14:31


          partitioned shall not pass

92588f5


          implement children and with_new_children for write node, fix fmt

7db9432


          Merge branch 'main' into ctty/df-insert

6bd624c


          get row counts from data files directly

8c78046

liurenjie1024 reviewed

View reviewed changes

crates/integrations/datafusion/src/physical_plan/write.rs

+                      PlanProperties::new(
+                          EquivalenceProperties::new(schema),
+                          input.output_partitioning().clone(),
+                          input.pipeline_behavior(),

Contributor

liurenjie1024 Jul 18, 2025

I'm not quite familiar with datafusion's streaming mode, but my suggestion is that we should not assume it's executed in streaming for now. We could always change this when we actually add streaming support.

crates/integrations/datafusion/src/physical_plan/write.rs

+                          EquivalenceProperties::new(schema),
+                          input.output_partitioning().clone(),
+                          input.pipeline_behavior(),
+                          input.boundedness(),

Contributor

liurenjie1024 Jul 18, 2025

Ditto.

crates/integrations/datafusion/src/physical_plan/write.rs

+                  ) -> DFResult<Arc<dyn ExecutionPlan>> {
+                      if children.len() != 1 {
+                          return Err(DataFusionError::Internal(
+                              "IcebergWriteExec expects exactly one child".to_string(),

Contributor

liurenjie1024 Jul 21, 2025

Suggested change

      
                            "IcebergWriteExec expects exactly one child".to_string(),
          
                            "IcebergWriteExec expects exactly one child, but provided {} ".to_string(),

crates/integrations/datafusion/src/physical_plan/write.rs

+                      // Create data file writer builder
+                      let data_file_writer_builder = DataFileWriterBuilder::new(
+                          ParquetWriterBuilder::new(

Contributor

liurenjie1024 Jul 21, 2025

This should be RollingFileWriter

crates/integrations/datafusion/src/physical_plan/write.rs

+                  fn make_result_batch(data_files: Vec<String>) -> DFResult<RecordBatch> {
+                      let files_array = Arc::new(StringArray::from(data_files)) as ArrayRef;
+                      RecordBatch::try_from_iter_with_nullable(vec![("data_files", files_array, false)]).map_err(

Contributor

liurenjie1024 Jul 21, 2025

nit: Why not just try_new so that we could reuse the result of make_result_schema?

crates/integrations/datafusion/src/physical_plan/commit.rs

+                                  let batch = batch_result?;
+                                  let files_array = batch
+                                      .column_by_name("data_files")

Contributor

liurenjie1024 Jul 21, 2025

We should define these as constants

crates/integrations/datafusion/src/physical_plan/commit.rs

Comment on lines +252 to +258

+                          // // Apply the action and commit the transaction
+                          // let updated_table = action
+                          //     .apply(tx)
+                          //     .map_err(to_datafusion_error)?
+                          //     .commit(catalog.as_ref())
+                          //     .await
+                          //     .map_err(to_datafusion_error)?;

Contributor

liurenjie1024 Jul 21, 2025

Why comment out this?

crates/iceberg/src/spec/manifest/mod.rs

+              pub fn serialize_data_file_to_json(
+                  data_file: DataFile,
+                  partition_type: &super::StructType,
+                  is_version_1: bool,

Contributor

liurenjie1024 Jul 21, 2025

We hould use TableFormatVersion

crates/iceberg/src/arrow/nan_val_cnt_visitor.rs

Comment on lines +162 to +163

		println!("----StructArray from record stream: {:?}", struct_arr);
		println!("----Schema.as_struct from table: {:?}", schema.as_struct());

Contributor

liurenjie1024 Jul 21, 2025

We should use log here.

crates/iceberg/src/arrow/value.rs

@@ @@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { @@
                       Ok(schema_partner)
                   }
+                  // todo generate field_pos in datafusion instead of passing to here

Contributor

liurenjie1024 Jul 21, 2025

Could you help me to understand why we need to change this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet