Proto commits in kwai/blaze

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:8dfef9f
Author:gaoouyenizi
Committer:GitHub

add expr string to SparkUDFWrapper (#967) Co-authored-by: lihao29 <lihao29@kuaishou.com>

The documentation is generated from this commit.

Commit:97ff610
Author:lihao29

add expr string to SparkUDFWrapper

The documentation is generated from this commit.

Commit:f6de2b6
Author:Harvey Yue
Committer:GitHub

Expect sha2 function result will be consistent with spark (#966)

Commit:6845c44
Author:Zhang Li
Committer:GitHub

supports WindowGroupLimitExec (#957) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:f4e26b5
Author:zhangli20
Committer:zhangli20

supports WindowGroupLimitExec

Commit:6e9098e
Author:Zhang Li
Committer:GitHub

code refactoring and bug fixes (#952) * fix rss writer: do not commit when native shuffle writing is not succeeded. fix unsafe lifetime error in spark UDAF wrapper. fix hash join and aggregate exec metrics when failed back to sorted refactor UnionExec to support PartitionerAwareUnionRDD. refactor celeborn reader: remove hard copied code from celeborn. * fix EmptyRDD error with empty local table scan --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:c4c57da
Author:zhangli20
Committer:zhangli20

fix rss writer: do not commit when native shuffle writing is not succeeded. fix unsafe lifetime error in spark UDAF wrapper. fix hash join and aggregate exec metrics when failed back to sorted refactor UnionExec to support PartitionerAwareUnionRDD. refactor celeborn reader: remove hard copied code from celeborn.

Commit:c343f0b
Author:zhangli20
Committer:zhangli20

refactor UnionExec to support PartitionerAwareUnionRDD. fix unsafe lifetime error in spark UDAF wrapper.

Commit:26eef94
Author:zhangli20
Committer:zhangli20

refactor UnionExec to support PartitionerAwareUnionRDDPartition

Commit:3316f0d
Author:Zhang Li
Committer:GitHub

convert scalar value using arrow ipc, fixing unsupported map type (#938) fix GetIndexField with scalar value fix arrow writer with map type Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:cb655ec
Author:zhangli20
Committer:zhangli20

convert scalar value using arrow ipc, fixing unsupported map type fix GetIndexField with scalar value fix arrow writer with map type

Commit:5952072
Author:Zhang Li
Committer:GitHub

NativeConverters adds aggregate function return type (#930) Coalesce adds params casting Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:c918c08
Author:zhangli20
Committer:zhangli20

NativeConverters adds aggregate function return type Coalesce adds params casting

Commit:3c40f39
Author:Zhang Li
Committer:GitHub

rewrite UnionExec and support auto type casting (#927) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:660b30c
Author:zhangli20
Committer:zhangli20

rewrite UnionExec and support auto type casting

Commit:09ce344
Author:Zhang Li
Committer:GitHub

ProjectExec adds cast automatically when data types not matched (#916) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:de4c260
Author:zhangli20
Committer:zhangli20

ProjectExec adds cast automatically when data types not matched

Commit:2c2f0a9
Author:gy11233
Committer:GitHub

Supports UDAF and other aggregate functions not implemented (#848)

Commit:c5a1ea8
Author:guoying06

formate name and schema

Commit:9d88168
Author:guoying06

optimize

Commit:e3ba97b
Author:guoying06
Committer:guoying06

Dev udaf new merge NativeConverters

Commit:f8f2097
Author:guoying06
Committer:guoying06

init DeclarativeAggregate udaf Resolving conflicted_file ArrowFFIExporter

Commit:a110e53
Author:gy11233
Committer:GitHub

Supports range partitioning (#734) Co-authored-by: guoying06 <guoying06@kuaishou.com>

Commit:d53d1b3
Author:guoying06

update range partition

Commit:01108e3
Author:guoying06

update range partition

Commit:9e459f8
Author:gy11233
Committer:GitHub

Dev repartitioning (#693) * add round robin shuffle * update roundrobin * update roundrobin * update roundrobin * update roundrobin * update roundrobin log info * update roundrobin * update roundrobin * update roundrobin * update roundrobin * update roundrobin * update roundrobin * update partition proto * update partition proto * update sort before round robin * update sort before round robin * update round robin * update round robin * update sort_batch_by_partition_id test * update sort_batch_by_partition_id test --------- Co-authored-by: guoying06 <guoying06@kuaishou.com>

Commit:b83338f
Author:guoying06

update partition proto

Commit:b9af073
Author:guoying06

update partition proto

Commit:009d904
Author:Zhang Li
Committer:GitHub

optimize bloom filter (#620) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:dd59e4d
Author:zhangli20
Committer:zhangli20

optimize bloom filter

Commit:4027465
Author:Zhang Li
Committer:GitHub

update to datafusion-v42/arrow-v53/arrow-java-v16 (#574) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:6a68b11
Author:zhangli20
Committer:zhangli20

update to datafusion-v42/arrow-v53/arrow-java-v16

Commit:a0d4b20
Author:Harvey Yue
Committer:GitHub

support native scan orc format (#544) Co-authored-by: Zhang Li <richselian@gmail.com>

Commit:c2cc15f
Author:Zhang Li
Committer:GitHub

supports bloom filter join (#532) fix decimal issue supports spark351 Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:a023a41
Author:zhangli20
Committer:zhangli20

supports bloom filter join fix decimal issue supports spark351

Commit:baa06d6
Author:Zhang Li
Committer:GitHub

supports native shuffled hash join (#509) * supports native shuffled hash join * supports distinct hash map during semi/anti join --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:a153dac
Author:zhangli20
Committer:zhangli20

supports native shuffled hash join

Commit:5fff824
Author:Zhang Li
Committer:GitHub

release version 3.0.0 (#506) * debug * supports using spark.io.compression.codec for shuffle/broadcast compression * use cached parquet metadata * remove out-of-date codes * refactor native broadcast to avoid duplicated broadcast jobs * fix in_list conversion in from_proto.rs * supports date type casting * release version 3.0.0 refactor join implementations to support existence joins and BHJ building hash map on driver side. supports spark333 batch shuffle reading. update rust-toolchain to latest nightly version. other minor improvements. update docs. --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:8e4b4cd
Author:zhangli20
Committer:zhangli20

release version 3.0.0 refactor join implementations to support existence joins and BHJ building hash map on driver side. supports spark333 batch shuffle reading. update rust-toolchain to latest nightly version. other minor improvements. update docs.

Commit:ca48cd1
Author:zhangli20
Committer:zhangli20

release version 3.0.0

Commit:5e33279
Author:zhangli20
Committer:zhangli20

joins refactoring

Commit:9fbff78
Author:zhangli20
Committer:zhangli20

joins refactoring

Commit:abd0308
Author:zhangli20
Committer:zhangli20

supports BHJ in blaze

Commit:a905014
Author:zhangli20
Committer:zhangli20

refactor sort-merge join

Commit:696a6a3
Author:Zhang Li
Committer:GitHub

supports UDTF fallbacks to spark (#483) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:438de89
Author:zhangli20
Committer:zhangli20

supports UDTF fallbacks to spark

Commit:753be42
Author:Zhang Li
Committer:GitHub

supports brickhouse UDF/UDAFs: array_union, collect, combine_unique (#479) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:f064b9c
Author:zhangli20
Committer:zhangli20

supports brickhouse UDF/UDAFs: array_union, collect, combine_unique

Commit:480148a
Author:Zhang Li
Committer:GitHub

supports parquet sink with zstd compression level. (#471) use stable sorting for dynamic partitions. fix parquet sink metrics. code formatting. Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:59fae25
Author:zhangli20
Committer:zhangli20

supports parquet sink with zstd compression level. use stable sorting for dynamic partitions. fix parquet sink metrics. code formatting.

Commit:c84ad6a
Author:Zhang Li
Committer:GitHub

update datafusion/arrow version to v36/v50. (#428) delegate sort-merge join to datafusion implementation. Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:b7beb9e
Author:zhangli20
Committer:zhangli20

update datafusion/arrow version to v36/v50. delegate sort-merge join to datafusion implementation.

Commit:a218cd6
Author:Zhang Li
Committer:GitHub

fix acc loader/saver for AggDynScalar (#413) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:f9657b9
Author:zhangli20
Committer:zhangli20

fix acc loader/saver for AggDynScalar

Commit:07cc20d
Author:Zhang Li
Committer:GitHub

implement get_json udf with sonic-rs (with serde-json fail-backing). (#399) implement json_tuple udtf. get_json flats nested arrays (keep consistent with hive UDFJson). Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:093747d
Author:zhangli20
Committer:zhangli20

implement get_json udf with sonic-rs (with serde-json fail-backing). implement json_tuple udtf. get_json flats nested arrays (keep consistent with hive UDFJson).

Commit:aa89239
Author:zhangli20
Committer:zhangli20

implement get_json udf with sonic-rs (with serde-json fail-backing). implement json_tuple udtf. get_json flats nested arrays (keep consistent with hive UDFJson).

Commit:5dda9a0
Author:zhangli20
Committer:zhangli20

supports compressing multiple batches in ipc format. improved batch_serde format for better compression. implement radix tournament tree for linear time k-way merging. implement accumulator store. other minor refactoring and optimizating.

Commit:e7d9082
Author:Zhang Li
Committer:GitHub

supports compressing multiple batches in ipc format. (#392) improved batch_serde format for better compression. implement radix tournament tree for linear time k-way merging. implement accumulator store. other minor refactoring and optimizating. Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:7d0b947
Author:zhangli20
Committer:zhangli20

supports compressing multiple batches in ipc format. improved batch_serde format for better compression. implement radix tournament tree for linear time k-way merging. implement accumulator store. other minor refactoring and optimizating.

Commit:25c2ce3
Author:zhangli20
Committer:zhangli20

supports compressing multiple batches in ipc format. improved batch_serde format for better compression. implement radix tournament tree for linear time k-way merging. implement accumulator store. other minor refactoring and optimizating.

Commit:53d2cca
Author:Zhang Li
Committer:GitHub

close FSDataOutputStream after writing finished (#363) supports writing parquet table with dynamic partitions Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:b557acf
Author:zhangli20
Committer:zhangli20

close FSDataOutputStream after writing finished supports writing parquet table with dynamic partitions

Commit:50fd3d0
Author:Zhang Li
Committer:GitHub

minor fixes and reafcatoring (#333) * code refactoring supports partial aggregate skipping fix incorrect assertion when converting concat_ws function fix ffi "not all nodes and buffers were consumed" issues * supports nested type hashing supports nested type array() function * use arrow snapshot version --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:4947ed5
Author:zhangli20
Committer:zhangli20

code refactoring supports partial aggregate skipping fix incorrect assertion when converting concat_ws function fix ffi "not all nodes and buffers were consumed" issues

Commit:3440789
Author:Zhang Li
Committer:GitHub

supports native broadcast nested loop join (#282) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>

Commit:92cc27b
Author:zhangli20
Committer:zhangli20

supports native broadcast nested loop join

Commit:3b48eaa
Author:zhangli20
Committer:zhangli20

add shims spark-3.3.3

Commit:f6d50a1
Author:zhangli20
Committer:zhangli20

refactor SMJ to support post filters add BlazeConf

Commit:a5c4cdc
Author:zhangli20
Committer:zhangli20

native parquet writer

Commit:c29917b
Author:lihao29
Committer:zhangli20

supports combining filter and projection exec. supports cached exprs evaluator.

Commit:fb3d74d
Author:zhangli20
Committer:zhangli20

supports native first() aggregator. supports scalar map value. refactor agg_buf by adding AggDynScalar.

Commit:3214fd5
Author:zhangli20
Committer:zhangli20

convert If to CaseWhen expr

Commit:a3711ff
Author:zhangli20
Committer:zhangli20

use sc_and/or from datafusion v25-blaze

Commit:85de466
Author:zhangli20
Committer:zhangli20

remove datafusion-ext-file-format and move parquet-exec to datafusion-ext-plans

Commit:7ba14bb
Author:zhangli20
Committer:zhangli20

supports converting literal array values

Commit:dd42e51
Author:zhangli20
Committer:zhangli20

new broadcast join

Commit:7a4bac0
Author:zhangli20
Committer:zhangli20

supports name_batch in named_struct

Commit:6383958
Author:lihao29
Committer:zhangli20

Map entries

Commit:f7a25ed
Author:zhangli20
Committer:zhangli20

supports all timestamp units use postcard to serialize/deserialize data types

Commit:737ab73
Author:zhangli20
Committer:zhangli20

next KDev_MR_link:https://ksurl.cn/SMbgH7by

Commit:8f37ecc
Author:zhangli20
Committer:zhangli20

reduce partitions size in parquet scan

Commit:beb0110
Author:zhangli20
Committer:zhangli20

implements native GenerateExec

Commit:bc312b3
Author:lihao29
Committer:zhangli20

1.update datafusion version from 15.0.0 to 20.0.0 2.update arrow version from 28.0.0 to 34.0.0 3.update parquet version from 28.0.0 to 34.0.0 4.use new method to convert like expr 5.concat_batch method fix(skip name check) 6.decimal compute in add/minus/multipy/divide cast to double and then pass to datafusion

Commit:08f71d8
Author:zhangli20
Committer:zhangli20

remove SortMergeJoin.null_equals_null (always false in spark)

Commit:a8e3df7
Author:zhangli20
Committer:zhangli20

supports native SortAggregate

Commit:32c8264
Author:lihao29
Committer:zhangli20

support rss shuffle for blaze

Commit:3ee8c37
Author:zhangli20
Committer:zhangli20

supports spillable aggregation

Commit:ed5f1e4
Author:zhangli20
Committer:zhangli20

update datafusion to 15.0.0

Commit:1475c19
Author:zhangli20
Committer:zhangli20

supports native expand exec

Commit:8dcf25a
Author:zhangli20
Committer:zhangli20

implement short-circuiting logical operators fix aggregation transforming with inconvertible result expressions

Commit:721a355
Author:zhangli20
Committer:zhangli20

add native if expression

Commit:fe7e1e2
Author:zhangli20
Committer:zhangli20

coalesce batches before shuffle write and ipc write

Commit:ded3178
Author:zhangli20
Committer:zhangli20

refactor: split out datafusion-ext-common, datafusion-ext-plans

Commit:2992bf2
Author:zhangli20
Committer:zhangli20

refactor: rename package plan-serde to blaze-serde

Commit:7314bcd
Author:zhangli20
Committer:zhangli20

update arrow/datafusion to latest revision

Commit:7d24bd8
Author:lihao29
Committer:zhangli20

add string expr

Commit:42fb286
Author:zhangli20
Committer:zhangli20

refactor SparkFallbackToJvmExpr to SparkExpressionWrapperExpr

Commit:5893a1a
Author:zhangli20
Committer:zhangli20

implement native limits exec

Commit:24f4066
Author:lihao29
Committer:zhangli20

fix error for partition column data type does not match