Proto commits in apache/datafusion-comet

These 68 commits are when the Protocol Buffers files have changed:

2025-05-09

Commit:	25e39ab
Author:	Andy Grove	2025-05-09 14:37:19 -0600
Committer:	GitHub	2025-05-09 14:37:19 -0600

perf: Add performance tracing capability (#1706)

The documentation is generated from this commit.

2025-05-02

Commit:	a93d972
Author:	Andy Grove	2025-05-02 09:32:18 -0600
Committer:	GitHub	2025-05-02 09:32:18 -0600

chore: Remove fast encoding option (#1703)

2025-03-25

Commit:	a31ece9
Author:	Kazantsev Maksim	2025-03-25 23:39:58 +0400
Committer:	GitHub	2025-03-25 12:39:58 -0700

Chore: simplify array related functions impl (#1490) ## Which issue does this PR close? Related to issue: https://github.com/apache/datafusion-comet/issues/1459 ## Rationale for this change Defined under Issue: https://github.com/apache/datafusion-comet/issues/1459 ## What changes are included in this PR? In functions related to arrays, scalarExprToProtoWithReturnType or scalarExprToProto is used instead of creating a separate proto for each function. ## How are these changes tested? Regression with available unit tests

2025-02-28

Commit:	1160914
Author:	Zhen Wang	2025-03-01 06:58:48 +0800
Committer:	GitHub	2025-02-28 14:58:48 -0800

feat: Support IntegralDivide function (#1428) ## Which issue does this PR close? Closes #1422. ## Rationale for this change Support IntegralDivide function ## What changes are included in this PR? Since datafusion div operator conforms to the logic of intergal div, we only need to convert `IntegralDivide(...)` to `Cast(Divide(...), LongType)` and then convert it to native. ## How are these changes tested? added unit test

2025-02-13

Commit:	527cb57
Author:	Matt Butrovich	2025-02-13 14:37:53 -0500
Committer:	GitHub	2025-02-13 12:37:53 -0700

perf: Use DataFusion FilterExec for experimental native scans (#1395) * Add boolean field to Filter's proto, set based on Comet native scan implementation. Planner uses that field to construct the correct FilterExec implementation. CometFilterExec does a deep copy of the batch due to logic in Comet Scan, while DF FilterExec can do a shallow copy because native Scans do not reuse batch buffers. * Refactor to reduce duplicate code. * Fix native test. * Address nit.

2025-02-08

Commit:	a1e6a39
Author:	Zhen Wang	2025-02-09 02:43:26 +0800
Committer:	GitHub	2025-02-08 11:43:26 -0700

perf: improve performance of update metrics (#1329)

2025-01-28

Commit:	07274e8
Author:	Eren Avsarogullari	2025-01-28 15:53:10 -0800
Committer:	GitHub	2025-01-28 16:53:10 -0700

Support arrays_overlap function (#1312)

2025-01-23

Commit:	9a4e5b5
Author:	Eren Avsarogullari	2025-01-23 15:23:32 -0800
Committer:	GitHub	2025-01-23 16:23:32 -0700

Feat: Support array_join function (#1290)

2025-01-21

Commit:	824ad1a
Author:	Eren Avsarogullari	2025-01-21 14:45:33 -0800
Committer:	GitHub	2025-01-21 15:45:33 -0700

Feat: Support array_intersect function (#1271) * Feat: Support array_intersect * Address review comment

Commit:	c3a552f
Author:	Andy Grove	2025-01-21 15:42:30 -0700
Committer:	GitHub	2025-01-21 15:42:30 -0700

chore: merge comet-parquet-exec branch into main (#1318)

2025-01-16

Commit:	32b9338
Author:	Parth Chandra	2025-01-16 07:52:34 -0800
Committer:	GitHub	2025-01-16 08:52:34 -0700

chore: Comet parquet exec merge from main(20250114) (#1293) * feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <agrove@apache.org> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <agrove@apache.org> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <agrove@apache.org> * fix the nit in comment --------- Co-authored-by: himadripal <hpal@apple.com> Co-authored-by: Andy Grove <agrove@apache.org> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After https://github.com/apache/datafusion-comet/pull/1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * feat: add support for array_contains expression (#1163) * feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After https://github.com/apache/datafusion-comet/pull/1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update UT Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> * fix typo in UT Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> --------- Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: KAZUYUKI TANIMURA <ktanimura@apple.com> Co-authored-by: Parth Chandra <parthc@apache.org> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. (#1021) * feat: Reenable tests for filtered SMJ anti join (#1211) * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests * Add CoalesceBatchesExec around SMJ with join filter * adding `CoalesceBatches` * adding `CoalesceBatches` * adding `CoalesceBatches` * feat: reenable filtered SMJ Anti join tests * feat: reenable filtered SMJ Anti join tests --------- Co-authored-by: Andy Grove <agrove@apache.org> * chore: Add safety check to CometBuffer (#1050) * chore: Add safety check to CometBuffer * Add CometColumnarToRowExec * fix * fix * more * Update plan stability results * fix * fix * fix * Revert "fix" This reverts commit 9bad173c7751f105bf3ded2ebc2fed0737d1b909. * Revert "Revert "fix"" This reverts commit d527ad1a365d3aff64200ceba6d11cf376f3919f. * fix BucketedReadWithoutHiveSupportSuite * fix SparkPlanSuite * remove unreachable code (#1213) * test: Enable Comet by default except some tests in SparkSessionExtensionSuite (#1201) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1197 ## Rationale for this change Since `loadCometExtension` in the diffs were not using `isCometEnabled`, `SparkSessionExtensionSuite` was not using Comet. Once enabled, some test failures discovered ## What changes are included in this PR? `loadCometExtension` now uses `isCometEnabled` that enables Comet by default Temporary ignore the failing tests in SparkSessionExtensionSuite ## How are these changes tested? existing tests * extract struct expressions to folders based on spark grouping (#1216) * chore: extract static invoke expressions to folders based on spark grouping (#1217) * extract static invoke expressions to folders based on spark grouping * Update native/spark-expr/src/static_invoke/mod.rs Co-authored-by: Andy Grove <agrove@apache.org> --------- Co-authored-by: Andy Grove <agrove@apache.org> * chore: Follow-on PR to fully enable onheap memory usage (#1210) * Make datafusion's native memory pool configurable * save * fix * Update memory calculation and add draft documentation * ready for review * ready for review * address feedback * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update docs/source/user-guide/tuning.md Co-authored-by: Kristin Cowalcijk <bo@wherobots.com> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update docs/source/user-guide/tuning.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * remove unused config --------- Co-authored-by: Kristin Cowalcijk <bo@wherobots.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant * chore: extract agg_funcs expressions to folders based on spark grouping (#1224) * extract agg_funcs expressions to folders based on spark grouping * fix rebase * extract datetime_funcs expressions to folders based on spark grouping (#1222) Co-authored-by: Andy Grove <agrove@apache.org> * chore: use datafusion from crates.io (#1232) * chore: extract strings file to `strings_func` like in spark grouping (#1215) * chore: extract predicate_functions expressions to folders based on spark grouping (#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <agrove@apache.org> * build(deps): bump protobuf version to 3.21.12 (#1234) * extract json_funcs expressions to folders based on spark grouping (#1220) Co-authored-by: Andy Grove <agrove@apache.org> * test: Enable shuffle by default in Spark tests (#1240) ## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test * chore: extract hash_funcs expressions to folders based on spark grouping (#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <agrove@apache.org> * fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates (#1253) * perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported (#1209) * fix regression (#1259) * feat: add support for array_remove expression (#1179) * wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array * fix: Fall back to Spark for distinct aggregates (#1262) * fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment * feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * docs: Update TPC-H benchmark results (#1257) * fix: disable initCap by default (#1276) * fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <agrove@apache.org> * address review comments --------- Co-authored-by: Andy Grove <agrove@apache.org> * chore: Add changelog for 0.5.0 (#1278) * Add changelog * revert accidental change * move 2 items to performance section * update TPC-DS results for 0.5.0 (#1277) * fix: cast timestamp to decimal is unsupported (#1281) * fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment * Fix build after merge * Fix tests after merge * Fix plans after merge * fix partition id in execute plan after merge (from Andy Grove) --------- Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> Co-authored-by: NoeB <noe.brehm@bluewin.ch> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Raz Luvaton <raz.luvaton@flarion.io> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: KAZUYUKI TANIMURA <ktanimura@apple.com> Co-authored-by: Sem <ssinchenko@apache.org> Co-authored-by: Himadri Pal <mehimu@gmail.com> Co-authored-by: himadripal <hpal@apple.com> Co-authored-by: gstvg <28798827+gstvg@users.noreply.github.com> Co-authored-by: Adam Binford <adamq43@gmail.com> Co-authored-by: Matt Butrovich <mbutrovich@users.noreply.github.com> Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Dharan Aditya <dharan.aditya@gmail.com> Co-authored-by: Kristin Cowalcijk <bo@wherobots.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> Co-authored-by: Zhen Wang <643348094@qq.com> Co-authored-by: Jagdish Parihar <jatin6972@gmail.com>

2025-01-13

Commit:	d7a7812
Author:	Andy Grove	2025-01-13 10:49:05 -0700
Committer:	GitHub	2025-01-13 10:49:05 -0700

feat: Implement custom RecordBatch serde for shuffle for improved performance (#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Commit:	b49c17b
Author:	Parth Chandra	2025-01-13 06:57:23 -0800
Committer:	GitHub	2025-01-13 07:57:23 -0700

chore: [comet-parquet-exec] Unit test fixes, default scan impl to native_comet (#1265) * fix: fix tests failing in native_recordbatch but not in native_full * fix: use session timestamp in native scans * Revert "fix: use session timestamp in native scans" This reverts commit e601deb472037338a36300992434a987bdb026e8. * Revert Change to native record batch timezone * Change stability plans to match original scan. * fix after rebase * Update plans; generate distinct plans for full native scan * generate plans for native_recordbatch * In struct tests, check Comet operator only for scan types that support complex types * Revert "Revert Change to native record batch timezone" This reverts commit 4a147f3766b25dc9245448e529e16a781086f3c6. * Reapply "fix: use session timestamp in native scans" This reverts commit 370f9015d3b0d134737a1864fab8638d6740bb2f. * Fix previous commit * Rename configs and default scan impl to 'native_comet' * add missing change * fix build * update plans for spark 3.5 * Add new plans for spark 3.5 * Update plans for Spark 4.0 * Plans updated from Spark 4

2025-01-12

Commit:	c25060e
Author:	Jagdish Parihar	2025-01-12 12:30:40 -0700
Committer:	GitHub	2025-01-12 12:30:40 -0700

feat: add support for array_remove expression (#1179) * wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array

2025-01-07

Commit:	74a6a8d
Author:	Andy Grove	2025-01-06 17:47:16 -0700
Committer:	GitHub	2025-01-06 17:47:16 -0700

feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support (#1192) * Implement native decoding and decompression * revert some variable renaming for smaller diff * fix oom issues? * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator * fix oom and prep for review * format * Add LZ4 support * clippy, new benchmark * rename metrics, clean up lz4 code * update test * Add support for snappy * format * change default back to lz4 * make metrics more accurate * format * clippy * use faster unsafe version of lz4_flex * Make compression codec configurable for columnar shuffle * clippy * fix bench * fmt * address feedback * address feedback * address feedback * minor code simplification * cargo fmt * overflow check * rename compression level config * address feedback * address feedback * rename constant

2025-01-02

Commit:	4f8ce75
Author:	Dharan Aditya	2025-01-03 01:13:02 +0530
Committer:	GitHub	2025-01-02 12:43:02 -0700

feat: add support for array_contains expression (#1163) * feat: add support for array_contains expression * test: add unit test for array_contains function * Removes unnecessary case expression for handling null values * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After https://github.com/apache/datafusion-comet/pull/1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184) * add test * fix * fix * fix * feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto * minor: move shuffle classes from common to spark (#1193) * minor: refactor decodeBatches to make private in broadcast exchange (#1195) * minor: refactor prepare_output so that it does not require an ExecutionContext (#1194) * fix: fix missing explanation for then branch in case when (#1200) * minor: remove unused source files (#1202) * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * save * save * save * remove unused imports * clippy * implement more hashers * implement Hash and PartialEq * implement Hash and PartialEq * implement Hash and PartialEq * benches * fix ScalarUDFImpl.return_type failure * exclude test from miri * ignore correct test * ignore another test * remove miri checks * use return_type_from_exprs * Revert "use return_type_from_exprs" This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7. * use DF main branch * hacky workaround for regression in ScalarUDFImpl.return_type * fix repo url * pin to revision * bump to latest rev * bump to latest DF rev * bump DF to rev 9f530dd * add Cargo.lock * bump DF version * no default features * Revert "remove miri checks" This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9. * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930 * update pin * Update Cargo.toml Bump to 44.0.0-rc2 * update cargo lock * revert miri change --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update UT Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> * fix typo in UT Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> --------- Signed-off-by: Dharan Aditya <dharan.aditya@gmail.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: KAZUYUKI TANIMURA <ktanimura@apple.com> Co-authored-by: Parth Chandra <parthc@apache.org> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

2024-12-20

Commit:	ea6d205
Author:	Andy Grove	2024-12-20 11:11:14 -0700
Committer:	GitHub	2024-12-20 11:11:14 -0700

feat: Make native shuffle compression configurable and respect `spark.shuffle.compress` (#1185) * Make shuffle compression codec and level configurable * remove lz4 references * docs * update comment * clippy * fix benches * clippy * clippy * disable test for miri * remove lz4 reference from proto

Commit:	3c43234
Author:	Matt Butrovich	2024-12-20 13:07:30 -0500
Committer:	GitHub	2024-12-20 11:07:30 -0700

[comet-parquet-exec] Merge upstream/main and resolve conflicts (#1183) * feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+ * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory allocator (#1063) * docs: Update benchmarking.md (#1085) * feat: Require offHeap memory to be enabled (always use unified memory) (#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE config (#1087) * Add changelog for 0.4.0 (#1089) * chore: Prepare for 0.5.0 development (#1090) * Update version number for build * update docs * build: Skip installation of spark-integration and fuzz testing modules (#1091) * Add hint for finding the GPG key to use when publishing to maven (#1093) * docs: Update documentation for 0.4.0 release (#1096) * update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72 * fix: Unsigned type related bugs (#1095) ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types * chore: Include first ScanExec batch in metrics (#1105) * include first batch in ScanExec metrics * record row count metric * fix regression * chore: Improve CometScan metrics (#1100) * Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows * chore: Add custom metric for native shuffle fetching batches from JVM (#1108) * feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <agrove@apache.org> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <agrove@apache.org> * feat: enable decimal to decimal cast of different precision and scale (#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <agrove@apache.org> * fix the nit in comment --------- Co-authored-by: himadripal <hpal@apple.com> Co-authored-by: Andy Grove <agrove@apache.org> * docs: fix readme FGPA/FPGA typo (#1117) * fix: Use RDD partition index (#1112) * fix: Use RDD partition index * fix * fix * fix * fix: Various metrics bug fixes and improvements (#1111) * fix: Don't create CometScanExec for subclasses of ParquetFileFormat (#1129) * Use exact class comparison for parquet scan * Add test * Add comment * fix: Fix metrics regressions (#1132) * fix metrics issues * clippy * update tests * docs: Add more technical detail and new diagram to Comet plugin overview (#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback * Stop passing Java config map into native createPlan (#1101) * feat: Improve ScanExec native metrics (#1133) * save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback * chore: Remove unused StringView struct (#1143) * Remove unused StringView struct * remove more dead code * docs: Add some documentation explaining how shuffle works (#1148) * add some notes on shuffle * reads * improve docs * test: enable more Spark 4.0 tests (#1145) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled * chore: Refactor cast to use SparkCastOptions param (#1146) * Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches * Enable more scenarios in CometExecBenchmark. (#1151) * chore: Move more expressions from core crate to spark-expr crate (#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports * remove dead code (#1155) * fix: Spark 4.0-preview1 SPARK-47120 (#1156) ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/372 and https://github.com/apache/datafusion-comet/issues/551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR fixes the new test SPARK-47120 added in Spark 4.0 ## How are these changes tested? tests enabled * chore: Move string kernels and expressions to spark-expr crate (#1164) * Move string kernels and expressions to spark-expr crate * remove unused hash kernel * remove unused dependencies * chore: Move remaining expressions to spark-expr crate + some minor refactoring (#1165) * move CheckOverflow to spark-expr crate * move NegativeExpr to spark-expr crate * move UnboundColumn to spark-expr crate * move ExpandExec from execution::datafusion::operators to execution::operators * refactoring to remove datafusion subpackage * update imports in benches * fix * fix * chore: Add ignored tests for reading complex types from Parquet (#1167) * Add ignored tests for reading structs from Parquet * add basic map test * add tests for Map and Array * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169) * Add Spark-compatible SchemaAdapterFactory implementation * remove prototype code * fix * refactor * implement more cast logic * implement more cast logic * add basic test * improve test * cleanup * fmt * add support for casting unsigned int to signed int * clippy * address feedback * fix test * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176) * test: enabling Spark tests with offHeap requirement (#1177) ## Which issue does this PR close? ## Rationale for this change After https://github.com/apache/datafusion-comet/pull/1062 We have not running Spark tests for native execution ## What changes are included in this PR? Removed the off heap requirement for testing ## How are these changes tested? Bringing back Spark tests for native execution * feat: Improve shuffle metrics (second attempt) (#1175) * improve shuffle metrics * docs * more metrics * refactor * address feedback * Fix redundancy in Cargo.lock. * Format, more post-merge cleanup. * Compiles * Compiles * Remove empty file. * Attempt to fix JNI issue and native test build issues. * Test Fix * Update planner.rs Remove println from test. --------- Co-authored-by: NoeB <noe.brehm@bluewin.ch> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Raz Luvaton <raz.luvaton@flarion.io> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Parth Chandra <parthc@apache.org> Co-authored-by: KAZUYUKI TANIMURA <ktanimura@apple.com> Co-authored-by: Sem <ssinchenko@apache.org> Co-authored-by: Himadri Pal <mehimu@gmail.com> Co-authored-by: himadripal <hpal@apple.com> Co-authored-by: gstvg <28798827+gstvg@users.noreply.github.com> Co-authored-by: Adam Binford <adamq43@gmail.com>

2024-12-05

Commit:	e0d8077
Author:	Matt Butrovich	2024-12-05 12:10:56 -0500
Committer:	GitHub	2024-12-05 10:10:56 -0700

[comet-parquet-exec] Simplify schema logic for CometNativeScan (#1142) * Serialize original data schema and required schema, generate projection vector on the Java side. * Sending over more schema info like column names and nullability. * Using the new stuff in the proto. About to take the old out. * Remove old logic. * remove errant print. * Serialize original data schema and required schema, generate projection vector on the Java side. * Sending over more schema info like column names and nullability. * Using the new stuff in the proto. About to take the old out. * Remove old logic. * remove errant print. * Remove commented print. format. * Remove commented print. format. * Fix projection_vector to include partition_schema cols correctly. * Rename variable.

2024-12-02

Commit:	ebdde77
Author:	Andy Grove	2024-12-02 11:26:50 -0700
Committer:	GitHub	2024-12-02 11:26:50 -0700

fix: Various metrics bug fixes and improvements (#1111)

2024-11-22

Commit:	c3ad26e
Author:	Liang-Chi Hsieh	2024-11-22 15:56:18 -0800
Committer:	GitHub	2024-11-22 15:56:18 -0800

fix: Support partition values in feature branch comet-parquet-exec (#1106) * init * more * more * fix clippy * Use Spark and Arrow types for partition schema

Commit:	9990b34
Author:	Sem	2024-11-22 14:25:01 +0100
Committer:	GitHub	2024-11-22 06:25:01 -0700

feat: support array_insert (#1073) * Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <agrove@apache.org> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <agrove@apache.org>

2024-11-13

Commit:	9657b75
Author:	NoeB	2024-11-14 00:57:42 +0100
Committer:	GitHub	2024-11-13 16:57:42 -0700

feat: support array_append (#1072) * feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+

Commit:	eafda43
Author:	Matt Butrovich	2024-11-13 14:58:32 -0500
Committer:	GitHub	2024-11-13 12:58:32 -0700

[comet-parquet-exec] Pass Spark's partitions to DF's ParquetExec (#1081) * I think serde works. Gonna try removing the old stuff. * Fixes after merging in upstream. * Remove previous file_config logic. Clippy. * Temporary assertion for testing. * Remove old path proto value. * Selectively generate projection vector.

Commit:	bd68db8
Author:	Parth Chandra	2024-11-13 03:20:17 -0800
Committer:	GitHub	2024-11-13 04:20:17 -0700

wip - CometNativeScan (#1078)

2024-11-12

Commit:	311bc9e
Author:	Andy Grove	2024-11-12 13:56:29 -0700

Revert "wip - CometNativeScan (#1076)" This reverts commit 38e32f7c74c0fa7b79337ea8bc84285f4f167d75.

Commit:	38e32f7
Author:	Parth Chandra	2024-11-12 12:13:55 -0800
Committer:	GitHub	2024-11-12 13:13:55 -0700

wip - CometNativeScan (#1076) * wip - CometNativeScan * fix and make config internal

2024-11-11

Commit:	16033d9
Author:	Andy Grove	2024-11-11 12:56:11 -0700

upmerge

2024-11-09

Commit:	0b0d6e8
Author:	Andy Grove	2024-11-09 08:09:02 -0700

add partial support for multiple parquet files

2024-11-08

Commit:	22b648d
Author:	Matt Butrovich	2024-11-08 16:21:48 -0500

filters?

Commit:	9535117
Author:	Matt Butrovich	2024-11-08 13:54:27 -0500

Stash changes. Maybe runs TPC-H q1?

2024-11-06

Commit:	562a877
Author:	Andy Grove	2024-11-06 08:16:18 -0700
Committer:	GitHub	2024-11-06 08:16:18 -0700

Refactor UnaryExpr and MathExpr in protobuf (#1056)

2024-11-05

Commit:	69760a3
Author:	Andy Grove	2024-11-05 08:22:03 -0700
Committer:	GitHub	2024-11-05 08:22:03 -0700

minor: Refactor binary expr serde to reduce code duplication (#1053) * Use one BinaryExpr definition in protobuf * refactor And * refactor remaining binary expressions * update test * update test

2024-10-18

Commit:	e3ac6cf
Author:	Matt Butrovich	2024-10-18 16:43:36 -0400
Committer:	GitHub	2024-10-18 14:43:36 -0600

feat: Implement bloom_filter_agg (#987) * Add test that invokes bloom_filter_agg. * QueryPlanSerde support for BloomFilterAgg. * Add bloom_filter_agg based on sample UDAF. planner instantiates it now. Added spark_bit_array_tests. * Partial work on Accumulator. Need to finish merge_batch and state. * BloomFilterAgg state, merge_state, and evaluate. Need more tests. * Matches Spark behavior. Need to clean up the code quite a bit, and do `cargo clippy`. * Remove old comment. * Clippy. Increase bloom filter size back to Spark's default. * API cleanup. * API cleanup. * Add BloomFilterAgg benchmark to CometExecBenchmark * Docs. * API cleanup, fix merge_bits to update cardinality. * Refactor merge_bits to update bit_count with the bit merging. * Remove benchmark results file. * Docs. * Add native side benchmarks. * Adjust benchmark parameters to match Spark defaults. * Address review feedback. * Add assertion to merge_batch. * Address some review feedback. * Only generate native BloomFilterAgg if child has LongType. * Add TODO with GitHub issue link.

2024-10-07

Commit:	b131cc3
Author:	Adam Binford	2024-10-07 11:59:04 -0400
Committer:	GitHub	2024-10-07 09:59:04 -0600

feat: Support `GetArrayStructFields` expression (#993) * Start working on GetArrayStructFIelds * Almost have working * Working * Add another test * Remove unused * Remove unused sql conf

2024-09-23

Commit:	459b2b0
Author:	Huaxin Gao	2024-09-23 14:12:41 -0700
Committer:	GitHub	2024-09-23 14:12:41 -0700

fix: window function range offset should be long instead of int (#733) * fix: window function range offset should be long instead of int * fix error * fall back to Spark if range offset is not int or long * uncomment tests * rebase * fix offset datatype * fix data type * address comments * throw Err for WindowFrameUnits::Groups * formatting

2024-09-05

Commit:	0f80df2
Author:	Adam Binford	2024-09-05 07:02:37 -0400
Committer:	GitHub	2024-09-05 05:02:37 -0600

feat: Array element extraction (#899)

2024-08-30

Commit:	e57ead4
Author:	Liang-Chi Hsieh	2024-08-29 17:22:16 -0700
Committer:	GitHub	2024-08-29 17:22:16 -0700

feat: Support sort merge join with a join condition (#553) * Init * test * test * test * Use specified commit to test * Fix format * fix clippy * fix * fix * Fix * Change to SQL syntax * Disable SMJ LeftAnti with join filter * Fix * Add test * Add test * Update to last DataFusion commit * fix format * fix * Update diffs

2024-08-28

Commit:	cd530f8
Author:	Andy Grove	2024-08-28 17:39:44 -0600
Committer:	GitHub	2024-08-28 17:39:44 -0600

feat: Implement to_json for subset of types (#805) * add skeleton for StructsToJson * first test passes * add support for nested structs * add support for strings and improve test * clippy * format * prepare for review * remove perf results * update user guide * add microbenchmark * remove comment * update docs * reduce size of diff * add failing test for quotes in field names and values * test passes * clippy * revert a docs change * Update native/spark-expr/src/to_json.rs Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com> * address feedback * support tabs * newlines * backspace * clippy * fix test regression * cargo fmt --------- Co-authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@gmail.com>

Commit:	5f41063
Author:	Andy Grove	2024-08-28 11:20:55 -0600
Committer:	GitHub	2024-08-28 11:20:55 -0600

feat: Implement basic version of string to float/double/decimal (#870) * basic version of string to float/double/decimal * docs * update benches * update benches * rust doc

2024-08-09

Commit:	e8d94d6
Author:	Emil Ejbyfeldt	2024-08-09 23:43:29 +0200
Committer:	GitHub	2024-08-09 15:43:29 -0600

feat: Optimze CreateNamedStruct preserve dictionaries (#789) * feat: Optimze CreateNamedStruct preserve dictionaries Instead of serializing the return data_type we just serialize the field names. The original implmentation was done as it lead to slightly simpler implementation, but it clear from #750 that this was the wrong choice and leads to issues with the physical data_type. * Support dictionary data_types in StructVector and MapVector * Add length checks

Commit:	1d1479e
Author:	Andy Grove	2024-08-09 02:42:23 -0600
Committer:	GitHub	2024-08-09 02:42:23 -0600

feat: Improve native explain (#795)

2024-08-03

Commit:	5b5142b
Author:	Adam Binford	2024-08-02 20:35:03 -0400
Committer:	GitHub	2024-08-02 18:35:03 -0600

feat: Add GetStructField expression (#731) * Add GetStructField support * Add custom types to CometBatchScanExec * Remove test explain * Rust fmt * Fix struct type support checks * Support converting StructArray to native * fix style * Attempt to fix scalar subquery issue * Fix other unit test * Cleanup * Default query plan supporting complex type to false * Migrate struct expressions to spark-expr * Update shouldApplyRowToColumnar comment * Add nulls to test * Rename to allowStruct * Add DataTypeSupport trait * Fix parquet datatype test

2024-08-02

Commit:	e33d560
Author:	Andy Grove	2024-08-02 06:39:49 -0600
Committer:	GitHub	2024-08-02 06:39:49 -0600

feat: Implement basic version of RLIKE (#734)

2024-07-15

Commit:	de8c55e
Author:	Andy Grove	2024-07-15 14:55:31 -0600
Committer:	GitHub	2024-07-15 14:55:31 -0600

chore: Move protobuf files to separate crate (#661) * move protobuf files to separate crate * format * revert accidental delete * Update native/proto/README.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

2024-07-08

Commit:	a8433b5
Author:	Andy Grove	2024-07-08 09:42:34 -0600
Committer:	GitHub	2024-07-08 09:42:34 -0600

chore: Convert Rust project into a workspace (#637) * convert into workspace project * update GitHub actions * update Makefile * fix regression * update target path * update protobuf path in pom.xml * update more paths

2024-07-05

Commit:	d8fef2b
Author:	Emil Ejbyfeldt	2024-07-05 18:00:56 +0200
Committer:	GitHub	2024-07-05 10:00:56 -0600

feat: Add support for CreateNamedStruct (#620) * Add support for CreateNamedStruct * Exclude HashJoins using struct keys as this is currently unsupported in datafusion * Add CreateNamedStruct to docs * Add message

2024-07-02

Commit:	0d2fcbc
Author:	Huaxin Gao	2024-07-02 08:12:53 -0700
Committer:	GitHub	2024-07-02 09:12:53 -0600

feat: Initial support for Window function (#599) * feat: initial support for Window function Co-authored-by: comphead <comphead@ukr.net> * fix style * fix style * address comments * abs()->unsigned_abs() * address comments --------- Co-authored-by: comphead <comphead@ukr.net>

2024-06-22

Commit:	2e4ec70
Author:	Huaxin Gao	2024-06-22 10:45:02 -0700
Committer:	GitHub	2024-06-22 10:45:02 -0700

feat: add nullOnDivideByZero for Covariance (#564) * feat: add nullOnDivideByZero for Covariance * change test * fix style * refactor test to reuse code * Trigger Build

2024-06-11

Commit:	e07f24c
Author:	Pablo Langa	2024-06-11 09:28:02 -0400
Committer:	GitHub	2024-06-11 07:28:02 -0600

feat: Support Ansi mode in abs function (#500) * change proto msg * QueryPlanSerde with eval mode * Move eval mode * Add abs in planner * CometAbsFunc wrapper * Add error management * Add tests * Add license * spotless apply * format * Fix clippy * error msg for all spark versions * Fix benches * Use enum to ansi mode * Fix format * Add more tests * Format * Refactor * refactor * fix merge * fix merge

2024-06-08

Commit:	32c61f5
Author:	Liang-Chi Hsieh	2024-06-08 00:09:30 -0700
Committer:	GitHub	2024-06-08 00:09:30 -0700

feat: Add HashJoin support for BuildRight (#437) * feat: Add HashJoin support for BuildRight * Enable test * Update plan stability * More * Update plan stability * Refine * Fix * Update diffs to fix Spark tests * Update diff * Update Spark 3.4.3 diff * Use BuildSide enum * Update diffs * Update plan stability for Spark 4.0 * Update q5a plan

2024-06-07

Commit:	f75aeef
Author:	Andy Grove	2024-06-07 15:15:23 -0600
Committer:	GitHub	2024-06-07 15:15:23 -0600

docs: Improve user documentation for supported operators and expressions (#520) * Improve documentation about supported operators and expressions * Improve documentation about supported operators and expressions * more notes * Add more supported expressions * rename protobuf Negative to UnaryMinus for consistency * format * remove duplicate ASF header * SMJ not disabled by default * Update docs/source/user-guide/operators.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update docs/source/user-guide/operators.md Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * remove RLike --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

2024-06-03

Commit:	edd63ef
Author:	Vipul Vaibhaw	2024-06-04 02:37:51 +0530
Committer:	GitHub	2024-06-03 15:07:51 -0600

feat: Implement ANSI support for UnaryMinus (#471) * checking for invalid inputs for unary minus * adding eval mode to expressions and proto message * extending evaluate function for negative expression * remove print statements * fix format errors * removing units * fix clippy errors * expect instead of unwrap, map_err instead of match and removing Float16 * adding test case for unary negative integer overflow * added a function to make the code more readable * adding comet sql ansi config * using withTempDir and checkSparkAnswerAndOperator * adding macros to improve code readability * using withParquetTable * adding scalar tests * adding more test cases and bug fix * using failonerror and removing eval_mode * bug fix * removing checks for float64 and monthdaynano * removing checks of float and monthday nano * adding checks while evalute bounds * IntervalDayTime splitting i64 and then checking * Adding interval test * fix ci errors

Commit:	3a83133
Author:	Prashant K. Sharma	2024-06-04 01:54:15 +0900
Committer:	GitHub	2024-06-03 10:54:15 -0600

feat: Use enum to represent CAST eval_mode in expr.proto (#415) * Fixes Issue #361: Use enum to represent CAST eval_mode in expr.proto * Update expr.proto and QueryPlanSerde.scala for handling enum EvalMode for cast message * issue 361 fixed type issue for eval_mode in planner.rs * issue 361 refactored QueryPlanSerde.scala for enhanced type safety and localization compliance, including a new string-to-enum conversion function and improved import organization. * Updated planner.rs, expr.proto, QueryPlanSerde.scala for enum support * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala --------- Co-authored-by: Prashant K. Sharma <prakush@foundation.local> Co-authored-by: Andy Grove <andygrove73@gmail.com>

2024-05-23

Commit:	6d23e28
Author:	Huaxin Gao	2024-05-23 08:20:27 -0700
Committer:	GitHub	2024-05-23 08:20:27 -0700

feat: correlation support (#456) * feat: correlation support * fmt * remove un-used import * address comments * address comment --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>

2024-05-07

Commit:	c40bc7c
Author:	Huaxin Gao	2024-05-06 17:20:45 -0700
Committer:	GitHub	2024-05-06 17:20:45 -0700

feat: Supports Stddev (#348) * feat: Supports Stddev * fix fmt * update q39a.sql.out * address comments * disable q93a and q93b for now * address comments --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>

2024-04-25

Commit:	49bf503
Author:	Huaxin Gao	2024-04-25 11:12:22 -0700
Committer:	GitHub	2024-04-25 11:12:22 -0700

feat: Support Variance (#297) * feat: Support Variance * Add StatisticsType in expr.poto * add explainPlan info and fix fmt * remove iunnecessary cast * remove unused import --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>

2024-04-22

Commit:	7fa23d5
Author:	Andy Grove	2024-04-22 06:36:59 -0600
Committer:	GitHub	2024-04-22 06:36:59 -0600

feat: Support ANSI mode in CAST from String to Bool (#290)

2024-04-18

Commit:	4710d62
Author:	Huaxin Gao	2024-04-17 18:07:20 -0700
Committer:	GitHub	2024-04-17 18:07:20 -0700

feat: Port Datafusion Covariance to Comet (#234) * feat: Port Datafusion Covariance to Comet * feat: Port Datafusion Covariance to Comet * fmt * update EXPRESSIONS.md * combine COVAR_SAMP and COVAR_POP * fix fmt * address comment --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

2024-03-20

Commit:	ce38812
Author:	Liang-Chi Hsieh	2024-03-19 22:11:24 -0700
Committer:	GitHub	2024-03-19 22:11:24 -0700

feat: Support HashJoin operator (#194) * feat: Support HashJoin * Add comment * Clean up test * Fix join filter * Fix clippy * Use consistent function with sort merge join * Add note about left semi and left anti joins * For review * Merging * Move tests * Add a function to parse join parameters

2024-03-18

Commit:	8aab44c
Author:	Liang-Chi Hsieh	2024-03-18 14:22:19 -0700
Committer:	GitHub	2024-03-18 14:22:19 -0700

feat: Support sort merge join (#178) * feat: Support sort merge join * Update PlanStability * Update Spark diff * For review * Fix format * For review * Add CometJoinSuite * Remove uuid * Fix diff

2024-03-14

Commit:	969f683
Author:	advancedxy	2024-03-15 05:24:16 +0800
Committer:	GitHub	2024-03-14 14:24:16 -0700

feat: Support BloomFilterMightContain expr (#179)

Commit:	ed5de4b
Author:	Huaxin Gao	2024-03-14 08:33:36 -0700
Committer:	GitHub	2024-03-14 08:33:36 -0700

feat: Support bitwise aggregate functions (#197)

2024-03-05

Commit:	a131c44
Author:	Liang-Chi Hsieh	2024-03-04 17:03:56 -0800
Committer:	GitHub	2024-03-04 17:03:56 -0800

fix: Final aggregation should not bind to the input of partial aggregation (#155) This patch adds the check of the index of bound reference. The aggregate expressions of final aggregation are not bound to the input of partial aggregation anymore but sent to native side as unbound expressions.

2024-02-23

Commit:	fbe7f80
Author:	Huaxin Gao	2024-02-23 08:39:54 -0800
Committer:	GitHub	2024-02-23 08:39:54 -0800

feat: Support `First`/`Last` aggregate functions (#97) Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>

2024-02-20

Commit:	e738d46
Author:	Liang-Chi Hsieh	2024-02-20 08:57:04 -0800
Committer:	GitHub	2024-02-20 08:57:04 -0800

feat: Nested map support for columnar shuffle (#51)

2024-02-16

Commit:	c5aee56
Author:	Liang-Chi Hsieh	2024-02-16 11:26:52 -0800
Committer:	GitHub	2024-02-16 11:26:52 -0800

feat: Add native shuffle and columnar shuffle (#30) * feat: Add native shuffle and columnar shuffle * For review

2024-02-09

Commit:	383c8fd
Author:	Chao Sun	2024-02-09 15:43:28 -0800
Committer:	GitHub	2024-02-09 16:43:28 -0700

Initial PR (#1) * Initial PR Co-authored-by: Liang-Chi Hsieh <liangchi@apple.com> Co-authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Co-authored-by: Steve Vaughan Jr <s_vaughan@apple.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Parth Chandra <parthc@apple.com> Co-authored-by: Oleksandr Voievodin <ovoievodin@apple.com> * Add license header to Makefile. Remove unncessary file core/.lldbinit. * Update DEBUGGING.md * add license and address comments --------- Co-authored-by: Liang-Chi Hsieh <liangchi@apple.com> Co-authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Co-authored-by: Steve Vaughan Jr <s_vaughan@apple.com> Co-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Parth Chandra <parthc@apple.com> Co-authored-by: Oleksandr Voievodin <ovoievodin@apple.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>