Proto commits in apache/datafusion

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:4ac9b55
Author:张林伟
Committer:GitHub

Fix `CoalescePartitionsExec` proto serialization (#15824) * add fetch to CoalescePartitionsExecNode * gen proto code * Add test * fix * fix build * Fix test build * remove comments

The documentation is generated from this commit.

Commit:a4d494c
Author:Chen Chongchen
Committer:GitHub

fix: serialize listing table without partition column (#15737) * fix: serialize listing table without partition column * remove unwrap * format * clippy

Commit:7ff6c7e
Author:Matt Butrovich
Committer:GitHub

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing (#15537)

Commit:5ab5a03
Author:Andy Grove
Committer:GitHub

Rename protobuf Java package (#15658)

Commit:3269f01
Author:westhide
Committer:GitHub

feat: Support serde for FileScanConfig `batch_size` (#15335)

Commit:722ccb9
Author:westhide
Committer:GitHub

feat: Support serde for JsonSource PhysicalPlan (#15311)

Commit:e221a2c
Author:Chen Chongchen
Committer:GitHub

feat: support customize metadata in alias for dataframe api (#15120) * feat: support customize metadata in alias for dataframe api * update doc * remove clone

Commit:ce14fbc
Author:Andrey Koshchiy
Committer:GitHub

Add `statistics_truncate_length` parquet writer config (#14782) * Add parquet writer config * test fixes --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:a104661
Author:Marko Milenković
Committer:GitHub

feat: add resolved `target` to `DmlStatement` (to eliminate need for table lookup after deserialization) (#14631) * feat: serialize table source to DML proto * Update datafusion/core/src/dataframe/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * remove redundant comment --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:c0e78d2
Author:Sergey Zhukov
Committer:GitHub

Remove use of deprecated dict_id in datafusion-proto (#14173) (#14227) * Remove use of deprecated dict_id in datafusion-proto (#14173) * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * remove accidental file * undo deletion of test in copy.slt * Fix issues causing GitHub checks to fail --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:168fe49
Author:Dmitrii Blaginin
Committer:GitHub

Serialize `parquet_options` in `datafusion-proto` (#14465) * Serialize `parquet_options` * Fix format

Commit:f8063e8
Author:Nicholas Gates
Committer:GitHub

Add `ColumnStatistics::Sum` (#14074) * Add sum statistic * Add sum statistic * Add sum statistic * Add sum statistic * Add sum statistic * Add sum statistic * Add tests and Cargo fmt * fix up --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:25f02a7
Author:Tobias Schwarzinger
Committer:GitHub

Update Logical Types Branch (#14241) * Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <marc.droogh@imc.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * format --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <zjregee@gmail.com> * fix: update Cargo.lock --------- Signed-off-by: zjregee <zjregee@gmail.com> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <jonahgao@msn.com> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <jonahgao@msn.com> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <whitemat@uk.ibm.com> * Fix formatting Signed-off-by: MBWhite <whitemat@uk.ibm.com> * review suggestion Signed-off-by: MBWhite <whitemat@uk.ibm.com> --------- Signed-off-by: MBWhite <whitemat@uk.ibm.com> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * timestamp Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * date part Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * taplo format Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * tpch test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * msrc issue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * msrc issue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * explicit hash Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * fix signature Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * fix test Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <chuet@palantir.com> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * improve docs --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * replace done but has stackoverflow Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * exec API done Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pruning.rs Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> * Update pruning.rs Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <avo@iki.fi> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <comphead@users.noreply.github.com> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: mertak-synnada <mertak67+synaada@gmail.com> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Remove labels --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgao@msn.com> --------- Co-authored-by: Jonah Gao <jonahgao@msn.com> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * remove version pin and note --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <eduard.karacharov@gmail.com> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <chuet@palantir.com> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/petgraph@v0.6.2...petgraph@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve function and out of trait * Custom scalar to sql overrides support for DuckDB Unparser dialect (#13915) * Allow adding custom scalar to sql overrides for DuckDB (#68) * Add unit test: custom_scalar_overrides_duckdb * Move `with_custom_scalar_overrides` definition on `Dialect` trait level * Improve perfomance of `reverse` function (#14025) * Improve perfomance of 'reverse' function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Apply sugestion change * Fix typo --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * docs(ci): use up-to-date protoc with docs.rs (#14048) * fix (#14042) Co-authored-by: Cyprien Huet <chuet@palantir.com> * Re-export TypeSignatureClass from the datafusion-expr package (#14051) * Fix clippy for Rust 1.84 (#14065) * fix: incorrect error message of function_length_check (#14056) * minor fix * add ut * remove check for 0 arg * test: Add plan execution during tests for bounded source (#14013) * Bump `ctor` to `0.2.9` (#14069) * Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone (#14038) * Move collapse_lex_ordering to Lexordering::collapse * reduce diff * avoid clone, cleanup * Introduce LexRequirement::collapse * Improve performance of collapse, from @akurmustafa https://github.com/alamb/datafusion/pull/26 fix formatting * Revert "Improve performance of collapse, from @akurmustafa" This reverts commit a44acfdb3af5bf0082c277de6ee7e09e92251a49. * remove incorrect comment --------- Co-authored-by: Mustafa Akur <akurmustafa@gmail.com> * Bump `wasm-bindgen` and `wasm-bindgen-futures` (#14068) * update (#14070) * fix: make get_valid_types handle TypeSignature::Numeric correctly (#14060) * fix get_valid_types with TypeSignature::Numeric * fix sqllogictest * Minor: Make `group_schema` as `PhysicalGroupBy` method (#14064) * group shema as method Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * fmt Signed-off-by: Jay Zhan <jayzhan211@gmail.com> --------- Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Minor: Move `LimitPushdown` tests to be in the same file as the code (#14076) * Minor: move limit_pushdown tests to be with their pass * Fix clippy * cleaup use * fmt * Add comments to physical optimizer tests (#14075) * added "DEFAULT_CLI_FORMAT_OPTIONS" for cli and sqllogic test (#14052) * added "DEFAULT_CLI_FORMAT_OPTIONS" for cli and sqllotic test * cargo fmt fix * fixed few errors * Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) (#13996) * Add H2O.ai Database-like Ops benchmark to dfbench * Fix query and fmt * Change venv * Make sure venv version support falsa * Fix default path * Support groupby only now * fix * Address comments * fix * support python version higher * support higer python such as python 3.13 * Addressed new comments * Add specific query example * Add telemetry.sh to list of use cases (#14090) * chore: deprecate `ValuesExec` in favour of `MemoryExec` (#14032) * chore: deprecate `ValuesExec` in favour of `MemoryExec` * clippy fix * Update datafusion/physical-plan/src/values.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * change to memoryexec * Update datafusion/physical-plan/src/memory.rs Co-authored-by: Jay Zhan <jayzhan211@gmail.com> * use compute properties * clippy fix --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Jay Zhan <jayzhan211@gmail.com> * Improve performance of `find_in_set` function (#14020) * Improve performance of 'find_in_set' function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Remove clippy warnings * Support scalar args for 'find_in_set' function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * minor: Add link to example in catalog (#14062) * fix set (#14081) Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Simplify the return type of `sql_select_to_rex()` (#14088) * Minor: Add a link to RecordBatchStreamAdapter to `SendableRecordBatchStream` (#14084) * Update substrait requirement from 0.51 to 0.52 (#14107) * Update substrait requirement from 0.51 to 0.52 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.51.0...v0.52.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Handle addition of `ReadType::IcebergTable` --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix null date args in range (#14093) * fix null dates in range * fix --------- Co-authored-by: Cyprien Huet <chuet@palantir.com> * feat: add support for `LogicalPlan::DML(...)` serde (#14079) * Add support for DML serialization to proto closes: #13616 * add round trip test for DML serde * cover all cases in round trip test * minor: change ordering of enum type * Avoid Aliased Window Expr Enter Unreachable Code (#14109) * clarify logic in nth_value window function (#14104) * Move JoinSelection into datafusion-physical-optimizer crate (#14073) (#14085) * Move JoinSelection into datafusion-physical-optimizer crate (#14073) * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Fix issues causing GitHub checks to fail * Lock aws-sdk crates to fix MSRV check * fix comment * fix compilation --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * distinguish none and empty projection (#14116) * Add a hint about normalization in error message (#14089) (#14113) * Add a hint about normalization in error message (#14089) * normalization suggestion is only shown when a column name matches schema --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> * fix: incorrect NATURAL/USING JOIN schema (#14102) * fix: incorrect NATURAL/USING JOIN schema * Add test * Simplify exclude_using_columns * Add more tests * Chore: refactor DataSink traits to avoid duplication (#14121) * add some abstractions to file sinkers and centralize FileSinkConfig based behaviors * satisfy clippy * typo fix * move start_demuxer_task back into demux.rs add file_extension to FileSinkConfig * fix errors * merge get_writer_schema functions add schema() function to DataSink trait make FileSink a subtrait for DataSink * Unify write_all for all FileSink implementers * DRY builder/header fetch * Remove more duplication * enrich documentation for spawn_writer_tasks_and_join * fix cargo doc --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Add sqlite sqllogictest run to extended.yml for running sqlite test suite against every push to main. (#14101) * Fix duplicated SharedBitmapBuilder definitions (#14122) * feat: add `alias()` method for DataFrame (#14127) * feat: add `alias()` method for DataFrame * doc-gen: make user_doc to work with predefined consts (#14086) * Minor: Document the rationale for the lack of Cargo.lock (#14071) * Minor: Document the rationale for the lack of Cargo.lock * Update README.md --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Return err if wildcard is not expanded before type coercion (#14130) * Return err if wildcard is not expanded before type coercion * fix test * fix clippy * improve test --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix combine with session config (#14139) * fix combine with session config * must use it * Minor: move resolve_overlap a method on OrderingEquivalenceClas (#14138) * doc-gen: migrate scalar functions (encoding & regex) documentation (#13919) * doc-gen: migrate scalar functions (encoding & regex) documentation * fix: fix typo * doc: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * bugfix: create view with multi union may get wrong schema (#14132) (#14133) create view with analyzer rule of TypeCoercion Co-authored-by: chenmch <chenmch@diit.cn> * Deduplicate function `get_final_indices_from_shared_bitmap` (#14145) * Deduplicate function get_final_indices_from_shared_bitmap * update * Add tests for PR #14133 (view with multi unions) (#14152) * bugfix: create view with multi union may get wrong schema (#14132) create view with analyzer rule of TypeCoercion * test: Add tests for PR #14133 (create view with multi unions) --------- Co-authored-by: chenmch <chenmch@diit.cn> * Update datafusion-testing git hash (#14137) * Reuse `on` expressions values in HashJoinExec (#14131) * Reduce duplicated build side experssions evaluations in HashJoinExec * Reuse probe side on expressions values * fix: encode should work with non-UTF-8 binaries (#14087) * fix: encode function should work with strings and binary closes #14055 * chore: address comments, add test * chore: move `SanityChecker` into `physical-optimizer` crate (#14083) * chore: move into crate * chore: move SanityChecker tests out to datafusion/core/tests * chore: update datafusion-cli/Cargo.lock * fix cargo doc --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Propagate table constraints through physical plans to optimize sort operations (#14111) * Add projection to `Constraints` * Add constraints support to `EquivalenceProperties` * Pass constraints to physical plan * Add slt test for primary key sort optimization * Pass constraints to MemoryExec * Update properties.rs * Simplify MemoryExec instantiation * Rename EquivalenceProperties method name for clarity * Refactor projection handling in FileScanConfig * Bug fix * Display constraints on data sources * Bug fix and test improvements * Use different schemas for tests * Lint and visibility fix * Fixes after merge * Review part 1 * Update memory.rs * update dep * update proto * add aggregate distinct * minor * Update order.slt * undo proto * Update properties.rs * Move reserved entry * Update `FileScanConfig` to return a single projected configuration object * Improve constraint based ordering satisfaction logic * Update datafusion/physical-plan/src/aggregates/mod.rs Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Revert "Update `FileScanConfig` to return a single projected configuration object" This reverts commit bbe35d48fb5c4af573fdf0ef81375ea0c72c0327. * Refactor MemoryExec constraints display Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Avoid unnecessary clone * Refactor constraint based ordering satisfaction logic * Cargo fmt * Revert "Avoid unnecessary clone" This reverts commit ab93279287311e4f6b5239fde2e8f98a22141c54. * Avoid unnecessary clone * Update properties.rs * Bug fix * Make `update_elements_with_matching_indices` take iterators for proj_indices * Revert "Make `update_elements_with_matching_indices` take iterators for proj_indices" This reverts commit d136860e2eb5054bd0a57588ea62337aa2712035. --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * NestedLoopJoin Projection Pushdown (#14120) * nlj proj pushdown Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * move swap proj to util Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fix proto Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * use none Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * proto fix Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fix slt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Update projection_pushdown.rs * refactor: streamline projection pushdown logic for join operations * minor * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * Fix: regularize order bys when consuming from substrait (#14125) * Fix: regularize order bys when consuming from substrait * Add window_function_with_range_unit_and_no_order_by test * Fix typo in comment * Remove dependency on physical-optimizer on functions-aggregates (#14134) * Remove dependency on physical-optimizer on functions-aggregates * update lock * doc-gen: migrate scalar functions (other, conditional, and struct) documentation (#14163) * chore: fix flaky tests (#14170) * Upgrade arrow-rs, parquet to `54.0.0` and pyo3 to `0.23.3` (#14153) * Upgrade arrow-rs, parquet and pyo3 * Fix fmt CI * Simplify Bloom Filter Check (#14165) * Make `LexOrdering::inner` non pub, add comments, update usages (#14155) * Fix doctests in ScalarValue (#14164) (#14178) Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> * Add `ScalarValue::try_as_str` to get str value from logical strings (#14167) * fix: handle scalar predicates in CASE expressions to prevent internal errors for InfallibleExprOrNull eval method (#14156) * fix: handle scalar predicates in CASE expressions to prevent internal errors for InfallibleExprOrNull eval method * Update to latest datafusion-testing commit --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Consolidate dataframe tests into core_integration (#14169) * Cut/paste dataframe tests to integration * Fix test issues * clippy * Add a hint about expected extension in error message in register_csv,… (#14168) * Add a hint about expected extension in error message in register_csv, register_parquet, register_json, register_avro (#14144) * Add tests for error * fix test * fmt * Fix issues causing GitHub checks to fail * revert datafusion-testing change --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Test: Validate memory limit for sort queries to extended test (#14142) * External memory limit validation for sort * add bug tracker * cleanup * Update submodule * reviews * fix CI * move feature to module level * refactor: switch BooleanBufferBuilder to NullBufferBuilder in sort function (#14183) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add section to howtos.md (#14171) * refactor: switch BooleanBufferBuilder to NullBufferBuilder in correlation function (#14181) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Rename extended test job name (#14199) * Added job board as a separate header in the documentation (#14191) * Added job board as a separate header in the documentation * Update docs/source/contributor-guide/communication.md Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update docs/source/contributor-guide/communication.md Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * prettier --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Support spaceship operator (`<=>`) support (alias for `IS NOT DISTINCT FROM` (#14187) * Mapped the Spaceship operator with IsNotDistinctFrom * Added tests for Spaceship Operator <=> * Added sanity test for Spaceship Operator <=> * Add benchmark for planning sorted unions (#14157) * feat: Use `SchemaRef` in `JoinFilter` (#14182) * feat: Use `SchemaRef` in `JoinFilter` * Update datafusion/core/src/physical_optimizer/projection_pushdown.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/joins/join_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/joins/join_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/joins/join_filter.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: switch BooleanBufferBuilder to NullBufferBuilder in functions-nested functions (#14201) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Improve case expr constant handling, Add .slt test (#14159) * Minor add ticket references to deprecated code (#14174) --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> Signed-off-by: zjregee <zjregee@gmail.com> Signed-off-by: MBWhite <whitemat@uk.ibm.com> Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Jay Zhan <jayzhan211@gmail.com> Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: Eason <30045503+Eason0729@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Jax Liu <liugs963@gmail.com> Co-authored-by: Marc Droogh <33723117+mdroogh@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <marc.droogh@imc.com> Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: Onur Satici <onursatici@users.noreply.github.com> Co-authored-by: Tai Le Manh <manhtai.lmt@gmail.com> Co-authored-by: zjregee <zjregee@gmail.com> Co-authored-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: Oleks V <comphead@users.noreply.github.com> Co-authored-by: Rohan Krishnaswamy <47869999+rkrishn7@users.noreply.github.com> Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Matthew B White <matthew@mh-white.com> Co-authored-by: Jay Zhan <jayzhan211@gmail.com> Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Jack <56563911+jdockerty@users.noreply.github.com> Co-authored-by: Burak Şen <buraksenb@gmail.com> Co-authored-by: Qi Zhu <821684824@qq.com> Co-authored-by: Kyle Barron <kylebarron2@gmail.com> Co-authored-by: Eduard Karacharov <eduard.karacharov@gmail.com> Co-authored-by: cht42 <42912042+cht42@users.noreply.github.com> Co-authored-by: Cyprien Huet <chuet@palantir.com> Co-authored-by: mertak-synnada <mertak67+synaada@gmail.com> Co-authored-by: wiedld <wiedld@users.noreply.github.com> Co-authored-by: Arttu <Blizzara@users.noreply.github.com> Co-authored-by: Daniel Hegberg <daniel.hegberg@gmail.com> Co-authored-by: Costi Ciudatu <ccciudatu@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Alex Kesling <alex@kesling.co> Co-authored-by: Goksel Kabadayi <45314116+gokselk@users.noreply.github.com> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com> Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> Co-authored-by: Arttu Voutilainen <avo@iki.fi> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: zhuliquan <zlqlovecode@foxmail.com> Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> Co-authored-by: Victor Barua <victor.barua@datadoghq.com> Co-authored-by: robtandy <rob.tandy@gmail.com> Co-authored-by: Jack Park <xarus01@gmail.com> Co-authored-by: UBarney <UBarney@users.noreply.github.com> Co-authored-by: Zhang Li <richselian@gmail.com> Co-authored-by: zhangli20 <zhangli20@kuaishou.com> Co-authored-by: Spaarsh <67336892+Spaarsh@users.noreply.github.com> Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Namgung Chan <33323415+getChan@users.noreply.github.com> Co-authored-by: Tim Saucer <timsaucer@gmail.com> Co-authored-by: Takahiro Ebato <takahiro.ebato@gmail.com> Co-authored-by: Alihan Çelikcan <alihan.celikcan@synnada.ai> Co-authored-by: Ian Lai <108986288+Chen-Yuan-Lai@users.noreply.github.com> Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: delamarch3 <68732277+delamarch3@users.noreply.github.com> Co-authored-by: Andrew Kane <andrew@ankane.org> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Andre Weltsch <aweltsch@users.noreply.github.com> Co-authored-by: Sergey Zhukov <62326549+cj-zhukov@users.noreply.github.com> Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Aleksey Kirilishin <54231417+avkirilishin@users.noreply.github.com> Co-authored-by: irenjj <renj.jiang@gmail.com> Co-authored-by: Marko Milenković <milenkovicm@users.noreply.github.com> Co-authored-by: Eugene Marushchenko <maruschin@gmail.com> Co-authored-by: Lordworms <48054792+Lordworms@users.noreply.github.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> Co-authored-by: Dharan Aditya <dharan.aditya@gmail.com> Co-authored-by: Vadim Piven <vadim@piven.tech> Co-authored-by: kosiew <kosiew@gmail.com> Co-authored-by: Tim Van Wassenhove <github@timvw.be> Co-authored-by: nuno-faria <nunofpfaria@gmail.com> Co-authored-by: Mohamed Abdeen <83442793+MohamedAbdeen21@users.noreply.github.com> Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com> Co-authored-by: Wendell Smith <wendell.smith@datadoghq.com> Co-authored-by: niebayes <niebayes@gmail.com> Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com> Co-authored-by: Mustafa Akur <akurmustafa@gmail.com> Co-authored-by: Jagdish Parihar <jatin6972@gmail.com> Co-authored-by: TheBuilderJR <46176773+TheBuilderJR@users.noreply.github.com> Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: 张林伟 <lewiszlw520@gmail.com> Co-authored-by: ding-young <lsyhime@snu.ac.kr> Co-authored-by: Will Golioto <36157286+Curricane@users.noreply.github.com> Co-authored-by: chenmch <chenmch@diit.cn> Co-authored-by: Daniel Mesejo <mesejoleon@gmail.com> Co-authored-by: Mrinal Paliwal <mrinal16164@iiitd.ac.in> Co-authored-by: Gabriel <45515538+gabotechs@users.noreply.github.com> Co-authored-by: Owen Leung <owen.leung2@gmail.com> Co-authored-by: Edmondo Porcu <edmondo.porcu@gmail.com>

Commit:05f4e5a
Author:Jay Zhan
Committer:GitHub

NestedLoopJoin Projection Pushdown (#14120) * nlj proj pushdown Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * move swap proj to util Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fix proto Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * use none Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * proto fix Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * fix slt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Update projection_pushdown.rs * refactor: streamline projection pushdown logic for join operations * minor * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai>

Commit:3cd31af
Author:Goksel Kabadayi
Committer:GitHub

Propagate table constraints through physical plans to optimize sort operations (#14111) * Add projection to `Constraints` * Add constraints support to `EquivalenceProperties` * Pass constraints to physical plan * Add slt test for primary key sort optimization * Pass constraints to MemoryExec * Update properties.rs * Simplify MemoryExec instantiation * Rename EquivalenceProperties method name for clarity * Refactor projection handling in FileScanConfig * Bug fix * Display constraints on data sources * Bug fix and test improvements * Use different schemas for tests * Lint and visibility fix * Fixes after merge * Review part 1 * Update memory.rs * update dep * update proto * add aggregate distinct * minor * Update order.slt * undo proto * Update properties.rs * Move reserved entry * Update `FileScanConfig` to return a single projected configuration object * Improve constraint based ordering satisfaction logic * Update datafusion/physical-plan/src/aggregates/mod.rs Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Revert "Update `FileScanConfig` to return a single projected configuration object" This reverts commit bbe35d48fb5c4af573fdf0ef81375ea0c72c0327. * Refactor MemoryExec constraints display Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Avoid unnecessary clone * Refactor constraint based ordering satisfaction logic * Cargo fmt * Revert "Avoid unnecessary clone" This reverts commit ab93279287311e4f6b5239fde2e8f98a22141c54. * Avoid unnecessary clone * Update properties.rs * Bug fix * Make `update_elements_with_matching_indices` take iterators for proj_indices * Revert "Make `update_elements_with_matching_indices` take iterators for proj_indices" This reverts commit d136860e2eb5054bd0a57588ea62337aa2712035. --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

Commit:c7c200a
Author:mertak-synnada
Committer:GitHub

Chore: refactor DataSink traits to avoid duplication (#14121) * add some abstractions to file sinkers and centralize FileSinkConfig based behaviors * satisfy clippy * typo fix * move start_demuxer_task back into demux.rs add file_extension to FileSinkConfig * fix errors * merge get_writer_schema functions add schema() function to DataSink trait make FileSink a subtrait for DataSink * Unify write_all for all FileSink implementers * DRY builder/header fetch * Remove more duplication * enrich documentation for spawn_writer_tasks_and_join * fix cargo doc --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

Commit:722307f
Author:Marko Milenković
Committer:GitHub

feat: add support for `LogicalPlan::DML(...)` serde (#14079) * Add support for DML serialization to proto closes: #13616 * add round trip test for DML serde * cover all cases in round trip test * minor: change ordering of enum type

Commit:b54e648
Author:wiedld
Committer:GitHub

Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR

Commit:3467011
Author:Costi Ciudatu
Committer:GitHub

[bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS

Commit:01ffb64
Author:Daniel Hegberg
Committer:GitHub

Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:023b018
Author:Onur Satici
Committer:GitHub

support unknown col expr in proto (#13603)

Commit:d840e98
Author:Sherin Jacob
Committer:GitHub

fix: serialize user-defined window functions to proto (#13421) * Adds roundtrip physical plan test * Adds enum for udwf to `WindowFunction` * initial fix for serializing udwf * Revives deleted test * Adds codec methods for physical plan * Rewrite error message * Minor: rename binding + formatting fixes * Extends `PhysicalExtensionCodec` for udwf * Minor: formatting * Restricts visibility to tests

Commit:75a27a8
Author:Andrew Lamb
Committer:GitHub

Remove `BuiltInWindowFunction` (LogicalPlans) (#13393) * Remove BuiltInWindowFunction * fix docs * Fix typo

Commit:54ab128
Author:Burak Şen
Committer:GitHub

Convert `nth_value` builtIn function to User Defined Window Function (#13201) * refactored nth_value * continue * test * proto and rustlint * fix datatype * cont * cont * apply jcsherins early validation * docs * doc * Apply suggestions from code review Co-authored-by: Sherin Jacob <jacob@protoship.io> * passes lint but does not have tests * continue * Update roundtrip_physical_plan.rs * udwf, not udaf * fix bounded but not fixed roundtrip * added * Update datafusion/sqllogictest/test_files/errors.slt Co-authored-by: Sherin Jacob <jacob@protoship.io> --------- Co-authored-by: Sherin Jacob <jacob@protoship.io> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:cd69e37
Author:Leonardo Yvens
Committer:GitHub

support recursive CTEs logical plans in datafusion-proto (#13314) * support LogicaPlan::RecursiveQuery in datafusion-proto * fixed and failing test roundtrip_recursive_query * fix rebase artifact * add node for CteWorkTableScan in datafusion-proto * Use Arc::clone --------- Co-authored-by: jonahgao <jonahgao@msn.com>

Commit:39aa15e
Author:Alihan Çelikcan
Committer:GitHub

Change `schema_infer_max_rec ` config to use `Option<usize>` rather than `usize` (#13250) * Make schema_infer_max_rec an Option * Add lifetime parameter to CSV and compression BoxStreams

Commit:e8520ab
Author:Lordworms
Committer:GitHub

fix bugs explain with non-correlated query (#13210) * fix bugs explain with non-correlated query * Use explicit enum for physical errors * fix comments / fmt * strip_backtrace to passs ci --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:2047d7f
Author:Emil Ejbyfeldt
Committer:GitHub

feat: Implement LeftMark join to fix subquery correctness issue (#13134) * Implement LeftMark join In https://github.com/apache/datafusion/pull/12945 the emulation of an mark join has a bug when there is duplicate values in the subquery. This would be fixable by adding a distinct before the join. But this patch instead implements a LeftMark join with the desired semantics and uses that. The LeftMark join will return a row for each in the left input with an additional column "mark" that is true if there was a match in the right input and false otherwise. Note: This patch does not implement the full null semantics for the mark join described in http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf which which will be needed if we and `ANY` subqueries. The version is this patch the mark column will only be true for had a match and false when no match was found, never `null`. * Use mark join in decorrelate subqueries This fixes a correctness issue in the current approach. * Add physical plan sqllogictest * fmt * Fix join type in doc comment * Minor clean ups * Add more documentation to LeftMark join * Remove qualification * fix doc --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:02b9693
Author:Jagdish Parihar
Committer:GitHub

Convert `ntile` builtIn function to UDWF (#13040) * converting to ntile udwf * updated the window functions documentation file * wip: update the ntile udwf function * fix the roundtrip_logical_plan.rs * removed builtIn ntile function * fixed field name issue * fixing the return type of ntile udwf * error if UInt64 conversion fails * handling if null is found * handling if value is zero or less than zero * removed unused import * updated prost.rs file * removed dead code * fixed clippy error * added inner doc comment * minor fixes and added roundtrip logical plan test * removed parse_expr in ntile

Commit:13a4225
Author:Jax Liu
Committer:GitHub

Introduce `binary_as_string` parquet option, upgrade to arrow/parquet `53.2.0` (#12816) * Update to arrow-rs 53.2.0 * introduce binary_as_string parquet option * Fix test --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:a4e6b07
Author:Jonathan Chen
Committer:GitHub

feat: Convert CumeDist to UDWF (#13051) * Transferred cumedist * fixes * remove expr tests * small fix * small fix * check * clippy fix * roundtrip fix

Commit:c7e5d8d
Author:Duong Cong Toai
Committer:GitHub

Improve recursive `unnest` options API (#12836) * refactor * refactor unnest options * more test * resolve comments * add back doc * fix proto * flaky test * clippy * use indexmap * chore: compile err * chore: update cargo * chore: fmt cargotoml --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:efe5708
Author:jcsherin
Committer:GitHub

Convert `BuiltInWindowFunction::{Lead, Lag}` to a user defined window function (#12857) * Move `lead-lag` to `functions-window` package * Builds with warnings * Adds `PartitionEvaluatorArgs` * Extracts `shift_offset` from input expressions * Computes shift offset * Get default value from input expression * Implements `partition_evaluator` * Fixes compiler warnings * Comments out failing tests * Fixes `cargo test` errors and warnings * Minor: taplo formatting * Delete code * Define `lead`, `lag` user-defined window functions * Fixes `cargo build` errors * Export udwf and expression public APIs * Mark result field as nullable * Delete `return_type` tests for `lead` and `lag` * Disables test: window function case insensitive * Fixes: lowercase name in logical plan * Reverts to old methods for computing `shift_offset`, `default_value` * Implements expression reversal * Fixes: lowercase name in logical plans * Fixes: doc test compilation errors Fixes: doc test build errors * Temporarily quite clippy errors * Fixes proto defintion * Minor: fixes formatting * Fixes: doc tests * Uses macro for defining `lag_udwf()` and `leag_udwf()` * Fixes: window fuzz test cases * Copies doc comments verbatim from `BuiltInWindowFunction` enum * Deletes from window function case insensitive test * Deletes `BuiltInWindowFunction` expression APIs * Delete from `create_built_in_window_expr` * Deletes proto serialization * Delete from `BuiltInWindowFunction` enum * Deletes test for finding built-in window function * Fixes build errors + deletes redundant code * Deletes more code * Delete unnecessary structs * Refactors shift offset computation * Passes range unit test * Fixes: clippy::get-first error * Rewrite unit tests for WindowUDF * Fixes: unit test for lag with default value * Consistent input expressions and data types in unit tests * Minor: fixes formatting * Restore original helper method for unit tests * Revert "Refactors shift offset computation" This reverts commit 000ceb76409e66230f9c5017a30fa3c9bb1e6575. * Moves helper functions into `functions-window-common` package * Uses common helper functions in `{lead, lag}` * Minor: formatting * Revert "Moves helper functions into `functions-window-common` package" This reverts commit ab8a83c9c11ca3a245278f6f300438feaacb0978. * Moves common functions to utils * Minor: formatting fixes * Update lowercase names in explain output * Adds doc for `lead()` and `lag()` expression functions * Add doc for `WindowShiftKind::shift_offset` * Remove `arrow` dev dependency * Minor: formatting * Update inner doc comment * Serialize 1 or more window function arguments * Adds logical plan roundtrip test cases * Refactor: readability of unit tests * Minor: rename variable bindings * Minor: copy edit * Revert "Remove `arrow` dev dependency" This reverts commit 3eb09856c8ec4ddce20472deee2df590c2fd3f35. * Move null argument handling helper to utils * Disable failing sqllogic tests for handling NULL input * Revert "Disable failing sqllogic tests for handling NULL input" This reverts commit 270a2030637012d549c001e973a0a1bb6b3d4dd0. * Fixes: incorrect NULL handling in `lead`/`lag` window function * Adds more tests cases --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:636f433
Author:Haile
Committer:GitHub

Minor: add flags for temporary ddl (#12561) * Minor: add flags for temporary ddl Signed-off-by: Haile Lagi <52631736+hailelagi@users.noreply.github.com> * Update datafusion/proto/src/logical_plan/mod.rs Co-authored-by: Jonah Gao <jonahgao@msn.com> --------- Signed-off-by: Haile Lagi <52631736+hailelagi@users.noreply.github.com> Co-authored-by: Jonah Gao <jonahgao@msn.com>

Commit:939ef9e
Author:Jagdish Parihar
Committer:GitHub

Convert `rank` / `dense_rank` and `percent_rank` builtin functions to UDWF (#12718) * wip: converting rank builtin function to UDWF * commented BuiltInWindowFunction in datafusion.proto and fixed issue related to Datafusion window function * implemented rank.rs, percent_rank.rs and dense_rank.rs in datafusion functions-window * removed a test from built in window function test for percent_rank and updated pbson fields * removed unnecessary code * added window_functions field to the MockSessionState * updated rank, percent_rank and dense_rank udwf to use macros * wip: fix rank functionality in sql integration * fixed rank udwf not found issue in sql_integration.rs * evaluating rank, percent_rank and dense_rank udwf with evaluate_with_rank function * fixed rank projection test * wip: fixing the percent_rank() documentation * fixed the docs error issue * fixed data type of the percent_rank udwf * updated prost.rs file * updated test and documentation * Fix logical conflicts * tweak module documentation --------- Co-authored-by: jatin <jagdish@kunato.io> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:3892499
Author:Fredrik Meringdal
Committer:GitHub

Support REPLACE INTO for INSERT statements (#12516) * Add support for REPLACE INTO statements. This commit introduces an `InsertOp` enum to replace the boolean `overwrite` flag to provide a more clear and flexible control over how data is inserted. This change updates the following APIs and configs to reflect the change: `TableProvider::insert_into`, `FileSinkConfig` and `DataFrameWriteOptions`. * fix clippy and add license * Update vendored code --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:b35e720
Author:Duong Cong Toai
Committer:GitHub

Refactor to support recursive unnest in physical plan (#11577) * chore: poc * fix unnest struct * UT for memoization * remove unnessary projection * chore: temp test case * multi depth unnest supported * chore: add map of original column and transformed col * transformation map to physical layer * prototype for recursive array length * chore: some compile err * finalize input type in physical layer * chore: refactor unnest builder * add unnesting type inferred * fix compile err * fail test in builder * Compile err * chore: detect some bugs * some work * support recursive unnest in physical layer * UT for new build batch function * compile err * fix unnesting into empty arrays * some comment * fix unnest struct * some note * chore: fix all test failure * fix projection pushdown * custom rewriter for recursive unnest * simplify * rm unnecessary projection * chore: better comments * more comments * chore: better comments * remove breaking api * rename * more unit test * remove debug * clean up * fix proto * fix dataframe * fix clippy * cargo fmt * fix some test * fix all test * fix unnest in join * fix doc and tests * chore: better doc * better doc * tune comment * rm todo * refactor * chore: reserve test * add a basic test * chore: more document * doc on ColumnUnnestType List * chore: add partialord to new types

Commit:3ece7a7
Author:wiedld
Committer:GitHub

Fix parquet statistics for ListingTable and Utf8View with `schema_force_string_view`, rename config option to `schema_force_view_types` (#12232) * chore: move schema_force_string_view upwards to be listed with other reading props * refactor(12123): have file schema be merged on view types with table schema * test(12123): test for with, and without schema_force_string_view * test(12123): demonstrate current upstream failure when reading page stats * chore(12123): update config.md * chore: cleanup * chore(12123): temporarily remove test until next arrow release * chore(12123): rename all variables to force_view_types * refactor(12123): make interface ParquetFormat::with_force_view_types public * chore(12123): rename helper method which coerces the schema (not merging fields) * chore(12123): add dosc to ParquetFormat to clarify exactly how the view types are used * test(12123): cleanup tests to be more explicit with ForceViews enum * test(12123): update tests to pass now that latest arrow-rs release is in * fix: use proper naming on benchmark

Commit:4659096
Author:Emil Ejbyfeldt
Committer:GitHub

feat: Add projection to FilterExec (#12281) * Implement Partitioning::project to avoid code duplication * feat: Projection inside FilterExec * Fix proto serialization * Update tpch plans * PR comments, improved doc string and spelling

Commit:ddfbf7a
Author:Georgi Krastev
Committer:GitHub

Support encoding and decoding UnnestExec (#12344)

Commit:008c942
Author:Jax Liu
Committer:GitHub

Support the custom terminator for the CSV file format (#12263) * add terminator config to CsvConfig * add test and fix missing builder * remove the debug message and fix the doc * support EscapedStringLiteral * add create external table tests * refactor the error assertion * add issue reference

Commit:85adb6c
Author:Piotr Findeisen
Committer:GitHub

Remove Sort expression (`Expr::Sort`) (#12177) * Take Sort (SortExpr) in file options Part of effort to remove `Expr::Sort`. * Return Sort from Expr.Sort Part of effort to remove `Expr::Sort`. * Accept Sort (SortExpr) in `LogicalPlanBuilder.sort` Take `expr::Sort` in `LogicalPlanBuilder.sort`. Accept any `Expr` in new function, `LogicalPlanBuilder.sort_by` which apply default sort ordering. Part of effort to remove `Expr::Sort`. * Operate on `Sort` in to_substrait_sort_field / from_substrait_sorts Part of effort to remove `Expr::Sort`. * Take Sort (SortExpr) in tests' TopKPlanNode Part of effort to remove `Expr::Sort`. * Remove Sort expression (`Expr::Sort`) Remove sort as an expression, i.e. remove `Expr::Sort` from `Expr` enum. Use `expr::Sort` directly when sorting. The sort expression was used in context of ordering (sort, topk, create table, file sorting). Those places require their sort expression to be of type Sort anyway and no other expression was allowed, so this change improves static typing. Sort as an expression was illegal in other contexts. * use assert_eq just like in LogicalPlan.with_new_exprs * avoid clone in replace_sort_expressions * reduce cloning in EliminateDuplicatedExpr * restore SortExprWrapper this commit is longer than advised in the review comment, but after squashing the diff will be smaller * shorthand SortExprWrapper struct definition

Commit:4c3b744
Author:Huaijin
Committer:GitHub

fix: ser/der fetch in CoalesceBatchesExec (#12107)

Commit:bd2d4ee
Author:jcsherin
Committer:GitHub

Convert built-in `row_number` to user-defined window function (#12030) * Adds new crate for window functions * Moves `row_number` to window functions crate * Fixes build errors * Regenerates protobuf * Makes `row_number` no-op temporarily * Minor: fixes formatting * Implements `WindowUDF` for `row_number` * Minor: fixes formatting * Adds singleton instance of UDWF: `row_number` * Adds partition evaluator * Registers default window functions * Implements `evaluate_all` * Fixes: allow non-uppercase globals * Minor: prefix underscore for unused variable * Minor: fixes formatting * Uses `row_number_udwf` * Fixes: unparser test for `row_number` * Uses row number to represent functional dependency * Minor: fixes formatting * Removes `row_number` from case-insensitive name test * Deletes wrapper for `row_number` window expression * Fixes: lowercase name in error statement * Fixes: `row_number` fields are not nullable * Fixes: lowercase name in explain output * Updates Cargo.lock * Fixes: lowercase name in explain output * Adds support for result ordering * Minor: add newline between methods * Fixes: re-export crate name in doc comments * Adds doc comment for `WindowUDFImpl::nullable` * Minor: renames variable * Minor: update doc comments * Deletes code * Minor: update doc comments * Minor: adds period * Adds doc comment for `row_number` window UDF * Adds fluent API for creating `row_number` expression * Minor: removes unnecessary path prefix * Adds roundtrip logical plan test case * Updates unit tests for `row_number` * Deletes code * Minor: copy edit doc comments * Minor: deletes comment * Minor: copy edits udwf doc comments

Commit:f4e519f
Author:Edmondo Porcu
Committer:GitHub

Move min and max to user defined aggregate function, remove `AggregateFunction` / `AggregateFunctionDefinition::BuiltIn` (#11013) * Moving min and max to new API and removing from protobuf * Using input_type rather than data_type * Adding type coercion * Fixed doctests * Implementing feedback from code review * Implementing feedback from code review * Fixed wrong name * Fixing name

Commit:fa50636
Author:Lordworms
Committer:GitHub

Implement physical plan serialization for parquet Copy plans (#11735) * Implement physical plan serialization for parquet Copy plans * fix clippy

Commit:a591301
Author:Andrew Lamb
Committer:GitHub

Merge `string-view2` branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) (#11667) * Pin to pre-release version of arrow 52.2.0 * Update for deprecated method * Add a config to force using string view in benchmark (#11514) * add a knob to force string view in benchmark * fix sql logic test * update doc * fix ci * fix ci only test * Update benchmarks/src/util/options.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add String view helper functions (#11517) * add functions * add tests for hash util * Add ArrowBytesViewMap and ArrowBytesViewSet (#11515) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * add more tests * make doc happy * update new implementation * fix bug * avoid unused dep * update dep * update * fix cargo check * update doc * pick up the comments change again --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Enable `GroupValueBytesView` for aggregation with StringView types (#11519) * add functions * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * merge * update cast * consistent dep * fix ci * avoid unused dep * update dep * update * fix cargo check * better group value view aggregation * update --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Initial support for regex_replace on `StringViewArray` (#11556) * initial support for string view regex * update tests * Add support for Utf8View for date/temporal codepaths (#11518) * Add StringView support for date_part and make_date funcs * run cargo update in datafusion-cli * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * GC `StringViewArray` in `CoalesceBatchesStream` (#11587) * gc string view when appropriate * make clippy happy * address comments * make doc happy * update style * Add comments and tests for gc_string_view_batch * better herustic * update test * Update datafusion/physical-plan/src/coalesce_batches.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662) * fix bug in return type inference * update doc * add tests --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix clippy * Increase ByteViewMap block size to 2MB (#11674) * better default block size * fix related test * Change `--string-view` to only apply to parquet formats (#11663) * use inferenced schema, don't load schema again * move config to parquet-only * update * update * better format * format * update * Implement native support StringView for character length (#11676) * native support for character length * Update datafusion/functions/src/unicode/character_length.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove uneeded patches * cargo fmt --------- Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: Xiangpeng Hao <me@haoxp.xyz> Co-authored-by: Andrew Duffy <a10y@users.noreply.github.com>

Commit:c50fd88
Author:Andrew Lamb
Committer:GitHub

Rename `ColumnOptions` to `ParquetColumnOptions` (#11512) * Rename `ColumnOptions` to `ParquetColumnOptions` * Update error message

Commit:3cfb99d
Author:Lordworms
Committer:GitHub

Implement physical plan serialization for json Copy plans (#11645)

Commit:fc8e7b9
Author:Jay Zhan
Committer:GitHub

Remove ArrayAgg Builtin in favor of UDF (#11611) * rm def Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rewrite test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:77311a5
Author:Leonardo Yvens
Committer:GitHub

support Decimal256 type in datafusion-proto (#11606)

Commit:2587df0
Author:Chris Connelly
Committer:GitHub

Support `newlines_in_values` CSV option (#11533) * feat!: support `newlines_in_values` CSV option This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. BREAKING CHANGE: This adds new public fields to types with all public fields, which is a breaking change. * docs: normalise `newlines_in_values` documentation * test: add/fix sqllogictests for `newlines_in_values` * docs: document `datafusion.catalog.newlines_in_values` * fix: typo in config.md * chore: suppress lint on too many arguments for `CsvExec::new` * fix: always checkout `*.slt` with LF line endings This is a bit of a stab in the dark, but it might fix multiline tests on Windows. * fix: always checkout `newlines_in_values.csv` with `LF` line endings The default git behaviour of converting line endings for checked out files causes the `csv_files.slt` test to fail when testing `newlines_in_values`. This appears to be due to the quoted newlines being converted to CRLF, which are not then normalised when the CSV is read. Assuming that the sqllogictests do normalise line endings in the expected output, this could then lead to a "spurious" diff from the actual output. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:0232699
Author:Georgi Krastev
Committer:GitHub

Add extension hooks for encoding and decoding UDAFs and UDWFs (#11417) * Add extension hooks for encoding and decoding UDAFs and UDWFs * Add tests for encoding and decoding UDAF

Commit:f5d88d1
Author:张林伟
Committer:GitHub

Support serialization/deserialization for custom physical exprs in proto (#11387) * Add PhysicalExtensionExprNode * regen proto * Add ser/de extension expr logic * Add test and fix clippy lint

Commit:4ac1428
Author:jcsherin
Committer:GitHub

Convert `nth_value` to UDAF (#11287) * Copies `NthValueAccumulator` to `functions-aggregate` * Partial implementation of `AggregateUDFImpl` Pending methods are: - `accumulator` - `state_fields` - `reverse_expr` * Implements `accumulator` method * Retains existing comments verbatim * Removes unnecessary path prefix * Implements `reverse_expr` method * Adds `nullable` field to `NthValue` * Revert to existing name * Implements `state_fields` method * Removes `nth_value` from `physical-expr` * Adds default * Exports `nth_value` * Fixes build error in physical plan roundtrip test * Minor: formatting * Parses `N` from input expression * Fixes build error by using `nth_value_udaf` * Fixes `reverse_expr` by passing correct `N` * Update plan with lowercase UDF name * Updates error message for incorrect no. of arguments This error message is manually formatted to remain consistent with existing error statements. It is not formatted by running: ``` cargo test -p datafusion-sqllogictest --test sqllogictests errors -- --complete ``` * Fixes nullable "item" in `state_fields` * Minor: fix formatting after resolving conflicts * Updates multiple existing plans with lowercase name * Implements `retract_batch` for window aggregations * Fixes: regex mismatch for error message in CI * Revert "Updates multiple existing plans with lowercase name" This reverts commit 1913efda49e585816286b54b371d4166ac894d1f. * Revert "Implements `retract_batch` for window aggregations" This reverts commit 4bb204f6ec8028c4e3313db5af3fabfcdaf7fea8. * Fixes: use builtin window function instead of udaf * Revert "Updates error message for incorrect no. of arguments" This reverts commit fa61ce62dcae6eae6f8e9c9900ebf8cff5023bc0. * Refactor: renames field and method * Removes hack for nullability * Minor: refactors `reverse_expr` * Minor: removes unncessary path prefix * Minor: cleanup arguments for creating aggregate expr * Refactor: extracts `merge_ordered_arrays` to `physical-expr-common` * Minor: adds todo for configuring nullability * Retrigger CI --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:c049a94
Author:Jax Liu
Committer:GitHub

Implement ScalarValue::Map (#11224) * tmp * introduce ScalarValue::Map * add display test * cargo fmt * address comments and enhance tests

Commit:4bc3228
Author:kamille
Committer:GitHub

Covert grouping to udaf (#11147) * define Grouping udf and impl AggregateUDFImpl for it. * add `grouping` to default list. * remove the old grouping related codes. * continue to remove codes. * regen pbs in proto. * remove built-in grouping in proto codes. * fix sql it. * Add test + export fn --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:90145df
Author:Andrew Lamb
Committer:GitHub

Optionally display schema in explain plan (#11177)

Commit:330ece8
Author:Hector Veiga
Committer:GitHub

feat: Conditionally allow to keep partition_by columns when using PARTITIONED BY enhancement (#11107) * feat: conditionally allow to keep partition_by columns * feat: add flag to file sink config, add tests * this commit contains: - separate options by prefix 'hive.' - add hive_options to CopyTo struct - add more documentation - add session execution flag to enable feature, false by default * do not add hive_options to CopyTo * npx prettier * fmt * change prefix to execution. , update override order for condition. * improve handling of flag, added test for config error * trying to make CI happier * prettier * Update test * update doc --------- Co-authored-by: Héctor Veiga Ortiz <hveigaortiz@tesla.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:5501e8e
Author:Devin D'Angelo
Committer:GitHub

Support COPY TO Externally Defined File Formats, add FileType trait (#11060) * wip create and register ext file types with session * Add contains function, and support in datafusion substrait consumer (#10879) * adding new function contains * adding substrait test * adding doc * adding doc * Update docs/source/user-guide/sql/scalar_functions.md Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * adding entry --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * logical planning updated * compiling * removing filetype enum * compiling * working on tests * fix some tests * test fixes * cli fix * cli fmt * Update datafusion/core/src/datasource/file_format/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * review comments * review comments * review comments * typo fix * fmt * fix err log style * fmt --------- Co-authored-by: Lordworms <48054792+Lordworms@users.noreply.github.com> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:a202a01
Author:Heran Lin
Committer:GitHub

Change wildcard qualifier type from `String` to `TableReference` (#11073) * Change wildcard qualifier type * Update protobuf * Minor update

Commit:d32747d
Author:Kevin Su
Committer:GitHub

Convert Correlation to UDAF (#11064) * init Signed-off-by: Kevin Su <pingsutw@apache.org> * test Signed-off-by: Kevin Su <pingsutw@apache.org> * test Signed-off-by: Kevin Su <pingsutw@apache.org> * test Signed-off-by: Kevin Su <pingsutw@apache.org> * remove files Signed-off-by: Kevin Su <pingsutw@apache.org> --------- Signed-off-by: Kevin Su <pingsutw@apache.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:fdd1e3d
Author:Dharan Aditya
Committer:GitHub

Convert Average to UDAF #10942 (#10964) * add avg udaf * remove avg from expr * add test stub * migrate avg udaf * change avg udaf signature remove avg phy expr * fix tests * fix state_fields fn * fix ut in phy-plan aggr * refactor Average to Avg * refactor Average to Avg * fix type coercion tests * fix example and logic tests * fix py expr failing ut * update docs * fix failing tests * formatting examples * remove duplicate code and fix uts * addressing PR comments * add ut for logical avg window * fix physical plan roundtrip_window test case

Commit:08e4e6a
Author:Sava Vranešević
Committer:GitHub

Fix `FormatOptions::CSV` propagation (#10912) * Fix sink output schema being passed in to `FileSinkExec` where input schema was expected * Propagate CSV options (quote, double quote, and escape) through protos * Add test for double quotes * Test quote escape when double quotes are disabled * regen --------- Co-authored-by: svranesevic <svranesevic@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:89def2c
Author:jcsherin
Committer:GitHub

Convert `bool_and` & `bool_or` to UDAF (#11009) * Port `bool_and` and `bool_or` to `AggregateUDFImpl` * Remove trait methods with default implementation * Add `bool_or_udaf` * Register `bool_and` and `bool_or` * Remove from `physical-expr` * Add expressions to logical plan roundtrip test * minor: remove methods with default implementation * Removes redundant tests * Removes hard-coded function names

Commit:a873f51
Author:张林伟
Committer:GitHub

Convert `StringAgg` to UDAF (#10945) * Convert StringAgg to UDAF * generate proto code * Fix bug * Fix * Add license * Add doc * Fix clippy * Remove aliases field * Add StringAgg proto test * Add roundtrip_expr_api test

Commit:f373a86
Author:Xiangpeng Hao
Committer:GitHub

Add initial support for Utf8View and BinaryView types (#10925) * add view types * Add slt tests * comment out failing test * update vendored code --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:2daadb7
Author:Dharan Aditya
Committer:GitHub

Convert BitAnd, BitOr, BitXor to UDAF (#10930) * remove bit and or xor from expr * remove bit and or xor from physical expr and proto * add proto regen changes * impl BitAnd, BitOr, BitXor UADF * add support for float * removing support for float * refactor helper macros * clippy'fy * simplify Bitwise operation * add documentation * formatting * fix lint issue * remove XorDistinct * update roundtrip_expr_api test * linting * support groups accumulator

Commit:c884bdb
Author:Jax Liu
Committer:GitHub

Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF (#10917) * pass logical expr of arguments for udaf * implement approx_percentile_cont udaf * register udaf * remove ApproxPercentileCont * convert with_wegiht to udaf and remove original * fix conflict * fix compile check * fix doc and testing * evaluate args through physical plan * public use Literal * fix tests * rollback the experimental tests * remove unused import * rename args and inline code * remove unnecessary partial eq trait * fix error message

Commit:cc60278
Author:Emil Ejbyfeldt
Committer:GitHub

Move Regr_* functions to use UDAF (#10898) * Move Regr_* functions to use UDAF Closes #10883 and is part of #8708 * Format and regen * tweak error check --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:b627ca3
Author:Jay Zhan
Committer:GitHub

Remove builtin count (#10893) * rm expr fn Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm function Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix query and fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix example Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Update datafusion/expr/src/test/function_stub.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:8f718dd
Author:Jay Zhan
Committer:GitHub

Move `Count` to `functions-aggregate`, update MSRV to rust 1.75 (#10484) * mv accumulate indices Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * complete udaf Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * register Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix expr Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * filter distinct count Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * todo: need to move count distinct too Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move code around Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move distinct to aggr-crate Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * replace Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * backup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix function name and physical expr Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix physical optimizer Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix all slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix with args Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add label Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert builtin related code back Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix substrait Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix udaf macro for distinct but not apply Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix count distinct and use workspace Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add reverse Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove old code Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * backup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * use macro Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * expr builder Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce expr builder Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add example Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clean agg sta Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * combine agg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * limit distinct and fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup name Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix window Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix merged Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix rebase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * use std Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * update mrsv Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd msrv Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * downgrade to 1.75 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * 1.76 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * ahas Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert to 1.75 Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm count Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix merge Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm sum in test_no_duplicate_name Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:11e143c
Author:Lordworms
Committer:GitHub

Convert approx_distinct to UDAF (#10851) * Convert approx_distinct to UDAF proto style fix prost fix clippy fix doc fix clippy * refactor code

Commit:5912025
Author:Benjamin Bannier
Committer:GitHub

Add support for reading CSV files with comments (#10467) This patch adds support for parsing CSV files containing comment lines. Closes #10262.

Commit:3773fb7
Author:Jax Liu
Committer:GitHub

Convert `approx_median` to UDAF (#10840) * move tdigest to physical-expr-common * move approx_percentile_cont_accumulator to function-aggregate * implement approx_meidan udaf * remove approx_median aggregation function * fix sqllogictests * add removed type tests * cargo fmt and clippy * add logical roundtrip test * fix dataframe test * fix test and proto gen * update lock in datafusion-cli * fix typo * fix test and doc * fix sql_integration * cargo fmt * follow the checking style like other udaf * add comment and modified dependency * update lock and fmt * add missing test annotation

Commit:e8fdc09
Author:Matt Nawara
Committer:GitHub

Convert `VariancePopulation` to UDAF (#10836)

Commit:8b1f06b
Author:Jax Liu
Committer:GitHub

Convert `stddev` and `stddev_pop` to UDAF (#10834) * add stddev and stddev_pot udaf * remove aggregation function stddev and stddev_pop * register func and modified return type * cargo fmt * regen proto * cargo clippy * fix window function support * cargo fmt * throw not_impl_err instead * use default sliding accumulator

Commit:6b70214
Author:Jay Zhan
Committer:GitHub

Remove Built-in sum and Rename to lowercase `sum` (#10831) * rm sum Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * mv stub to df:expr Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix sql example Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * lowercase in slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename to lowercase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename stl in tpch Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:cb9068c
Author:Ruihang Xia
Committer:GitHub

build(deps): update Arrow/Parquet to `52.0`, object-store to `0.10` (#10765) * fix compile on default feature config Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix test of common, functions, optimizer and physical-expr Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix other tests Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix one last test Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix clippy warnings Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix datafusion-cli Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * switch to git deps Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * regen proto file Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix pyo3 feature Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix slt Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix symmetric hash join cases Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * update integration result Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * fix up spill test Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * shift to the released packages Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * Update cargo.lock * Update datafusion/optimizer/src/analyzer/type_coercion.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update document Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * move memory limit to parameter pos Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:70744d5
Author:Yue Yin
Committer:GitHub

Convert variance sample to udaf (#10713) * Without migrating tests * Should fail VAR(DISTINCT) but doesn't * Pass all other tests. * Return error for var(distinct) * Migrate tests * Fix tests * Lint * Fix tests * Fix use

Commit:826331e
Author:张林伟
Committer:GitHub

Cleanup GetIndexedField (#10769) * Cleanup GetIndexedField * Generate pb --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:888504a
Author:Jay Zhan
Committer:GitHub

Introduce Sum UDAF (#10651) * move accumulate Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move prim_op Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move test to slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove sum distinct Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move sum aggregate Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix args Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add sum Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * merge fix Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix sum sig Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * todo: wait ahash merge Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rebase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * disable ordering req by default Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * check arg count Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm old workflow Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix failed test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * doc and fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * check udaf first Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix err msg AGAIN Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm sum in builtin test which covered in sql Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * proto for window with udaf Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix err msg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix exprfn Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ciy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix ci Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename first/last to lowercase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * skip sum Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix firstvalue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm has_ordering_req Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * default hard req Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * insensitive for sum Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup duplicate code Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Re-introduce check --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Mustafa Akur <mustafa.akur@synnada.ai>

Commit:5bd8751
Author:张林伟
Committer:GitHub

Separate proto partitioning (#10708)

Commit:70a215b
Author:Kun Liu
Committer:GitHub

support limit in agg exec for ser/deser (#10692)

Commit:5a9712e
Author:Andrey Koshchiy
Committer:GitHub

minor: unnest protobuf serde support (#10681) Signed-off-by: Andrey Koshchiy <an.koshchiy@gmail.com>

Commit:3194a98
Author:Mustafa Akur
Committer:GitHub

Factor out common datafusion types into another proto file (#10649) * Initial commit * Minor changes * Minor changes * Minor changes * Minor changes * Move Column type * Minor changes buggy * Compiles * Add new types to common * Move scalarvalue * Move new types to common * Move additional fields * Move common types * Minor changes * Change proto definition order * Simplifications * Minor changes * REimport protobuf_common from proto * Address reviews

Commit:549cf84
Author:Mustafa Akur
Committer:GitHub

Convert first, last aggregate function to UDAF (#10648) * move out the ordering ruel Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce rule Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert test result Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pass mulit order test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * with new childes Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert back Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm rewrite in new child Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * backup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * only move conversion to optimizer Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * find test that do reverse Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test for first and last Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pass all test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add aggregate test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * final draft Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup again Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pull out finer ordering code and reuse Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove finer in optimize rule Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comments and clenaup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename fun Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename fun Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * avoid unnecessary recursion and rename Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor changes * Add new API for aggregate optimization * Minor changes * Minor changes * Remove old code * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Update comments * Minor changes * Minor changes * Review Part 1 * TMP * Update display of aggregate fun exprs * TMP * TMP * Update tests * TMP buggy * modify name in place * Minor changes * Tmp * Tmp * Tmp * TMP * Simplifications * Tmp * Tmp * Compiles * Resolve linter errors * Resolve linter errors * Minor changes * Simplifications * Minor chagnes * Move cast to common * Minor changes * Fix test * Minor changes * Simplifications * Review * Address reviews * Address reviews * Update documentation, rename method * Minor changes --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

Commit:26b44f4
Author:Jay Zhan
Committer:GitHub

Move Median to `functions-aggregate` and Introduce Numeric signature (#10644) * introduce median udaf Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm agg median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm old median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce numeric signature Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add proto roundtrip Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:f807947
Author:张林伟
Committer:GitHub

Migrate testing optimizer rules to use `rewrite` API (#10576) * Migrate testing optimizer rules * Remove error * fmt

Commit:3ebc31d
Author:Jay Zhan
Committer:GitHub

Remove `Expr::GetIndexedField`, replace `Expr::{field,index,range}` with `FieldAccessor`, `IndexAccessor`, and `SliceAccessor` (#10568) * remove expr Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add expr extension Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move test that has struct Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add foc and fix displayed name Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rebase Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * move doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:e7858ff
Author:Dan Harris
Committer:GitHub

Handle dictionary values in ScalarValue serde (#10563) * Handle dictionary values in ScalarValue serde * Do not panic on failed physical expr decoding (#241) * revert clippy change

Commit:58cc4e1
Author:Berkay Şahin
Committer:GitHub

Make `CREATE EXTERNAL TABLE` format options consistent, remove special syntax for `HEADER ROW`, `DELIMITER` and `COMPRESSION` (#10404) * Simplify format options * Keep PG copy from tests same * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/datasource/file_format/csv.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove WITH HEADER ROW * Review Part 1 * . * Fix failing tests * Revert "Fix failing tests" This reverts commit 9d816017f2c11d0197c35e4d8a98c249840a6f96. * Final commit * Minor * Review * Update avro.slt * Apply suggestions * Fix imports --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

Commit:9f0e016
Author:Junhao Liu
Committer:GitHub

Move Covariance (Population) covar_pop to be a User Defined Aggregate Function (#10418) * move covariance * add sqllogictest

Commit:a0fccbf
Author:Jay Zhan
Committer:GitHub

Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function (#10372) * introduce CovarianceSample Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rewrite macro Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm old statstype Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * register Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * state field Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm builtin Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * addres comments Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

Commit:6d77748
Author:张林伟
Committer:GitHub

Split parquet bloom filter config and enable bloom filter on read by default (#10306) * Split bloom filter config * Fix proto * Set bloom_filter on write as false * Fix tests * fmt md * Fix test * Fix slt tests * clippy * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove enabled suffix * Regen proto and fix tests * Update configs.md * Improve bloom filter test --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Commit:7c1c794
Author:Matthew Cramerus
Committer:GitHub

feat: Determine ordering of file groups (#9593) * add statistics to PartitionedFile * just dump work for now * working test case * fix jumbled rebase * forgot to annotate #[test] * more refactoring * add a link * refactor again * whitespace * format debug log * remove useless itertools * refactor test * fix bug * use sort_file_groups in ListingTable * move check into a better place * refactor test a bit * more testing * more testing * better error message * fix log msg * fix again * add sqllogictest and fixes * fix test * Update datafusion/core/src/datasource/listing/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/core/src/datasource/physical_plan/file_scan_config.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * more unit tests * rename to split_groups_by_statistics * only use groups if there's <= target_partitions * refactor a bit, no need for projected_schema * fix reverse order * save work for now * lots of test cases in new slt * remove output check * fix * fix last test * comment on params * clippy * revert parquet.slt * no need to pass projection separately * Update datafusion/core/src/datasource/listing/mod.rs Co-authored-by: Nga Tran <nga-tran@live.com> * update comment on in * fix test? * un-fix? * add fix back in? * move indices_sorted_by_min to MinMaxStatistics * move MinMaxStatistics to its own module * fix license * add feature flag * update config --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Nga Tran <nga-tran@live.com>

Commit:b41ef20
Author:Andrew Lamb
Committer:GitHub

Minor: Add comments for removed protobuf nodes (#10252) * Minor: Add comments for removed protobuf nodes * Apply suggestions from code review Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * regenerate proto and move location --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

Commit:767bbca
Author:Dmitry Bugakov
Committer:GitHub

[MINOR] Remove ScalarFunction from datafusion.proto #10173 (#10202)

Commit:77f43c5
Author:Bruce Ritchie
Committer:GitHub

Move coalesce to datafusion-functions and remove BuiltInScalarFunction (#10098)

Commit:8730466
Author:Bruce Ritchie
Committer:GitHub

Move concat, concat_ws, ends_with, initcap to datafusion-functions (#10089)