These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)
Commit: | 94dafda | |
---|---|---|
Author: | luyang |
refine
The documentation is generated from this commit.
Commit: | 1809053 | |
---|---|---|
Author: | Luyang | |
Committer: | GitHub |
Merge branch 'master' into dev_refactor_xccl_primitive
Commit: | 5c62322 | |
---|---|---|
Author: | Jianhua Zheng | |
Committer: | GitHub |
add support for xpu/KunLunXin device (#10540)
The documentation is generated from this commit.
Commit: | c9b7811 | |
---|---|---|
Author: | luyang |
raw impl
Commit: | cce5d8c | |
---|---|---|
Author: | Jianhua Zheng |
add support for xpu/KunLunXin device
The documentation is generated from this commit.
Commit: | b11f102 | |
---|---|---|
Author: | Luyang | |
Committer: | GitHub |
Support Huawei Ascend910b chip (#10386) adaptation of huawei ascend910b chip on oneflow: - https://github.com/Oneflow-Inc/OneTeam/issues/2181 --------- Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Commit: | ae52678 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Plan separation compile (#9920) ## 背景 rank 数多时,master 编译 所有 rank 的 task node - 顺序编译,速度慢; - plan 大(可能超过 2G),总体要发送的数据规模可能达到上百G,传输太慢; 所以必须改成每个 rank 独立编译自己的执行计划。 ## 测评数据 - 模拟 n 卡的数据并行:https://github.com/Oneflow-Inc/OneTeam/issues/1679#issuecomment-1282195951 - 实测:https://github.com/Oneflow-Inc/OneTeam/issues/1944 ## 实现思路总结 https://github.com/Oneflow-Inc/OneTeam/issues/1791 --------- Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: ZZK <359521840@qq.com> Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Ping Zhu <58718936+reygu@users.noreply.github.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: Shiyuan Shangguan <shiyuan@oneflow.org> Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com> Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Zhimin Yang <76760002+small1945@users.noreply.github.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Dongche Zhang <zhang2000dc@gmail.com> Co-authored-by: leaves-zwx <kunta0932@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Liang Depeng <liangdepeng@gmail.com> Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: WangYi <buaawangyi03@gmail.com> Co-authored-by: rejoicesyc <47683675+rejoicesyc@users.noreply.github.com> Co-authored-by: songyicheng <int.rejoice@gmail.com> Co-authored-by: QI JUN <qijun1994@hotmail.com> Co-authored-by: zhaoyongke <zhaoyongke@yeah.net> Co-authored-by: JiaKui Hu <hjk1938927583@163.com> Co-authored-by: cheng cheng <472491134@qq.com>
Commit: | a9a339b | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Plan rank compiler (#10141) Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Commit: | 4114f1a | |
---|---|---|
Author: | Houjiang Chen | |
Committer: | GitHub |
force load mlir static libs (#10275)
Commit: | b51f131 | |
---|---|---|
Author: | WangYi |
use PyMemoryFormat_New in tensor.cpp, recover stride ctor, resort and rename memory_format proto
Commit: | f2e0f90 | |
---|---|---|
Author: | Wang Yi | |
Committer: | GitHub |
Dev memory format (#10181) - 这个 PR 导出了 MemoryFormat 类,在 Tensor 对应的函数里面增加了 MemoryFormat 参数 - 实现了 Tensor.to(memory_format),等价于 permute,行为目前和 torch 不对齐(torch 不做 permute,只修改 stride),下一个 PR 对齐 - PR https://github.com/Oneflow-Inc/oneflow/pull/9959 之前导出了 MemoryFormat,但是是空接口并且使用 pybind11 导出,这个 PR 里删掉了 pybind11 的代码,改用 CPython 导出 --------- Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Commit: | e929c63 | |
---|---|---|
Author: | WangYi |
Merge branch 'dev_memory_format' into dev_to_memoryformat
Commit: | fef5c81 | |
---|---|---|
Author: | Houjiang Chen | |
Committer: | GitHub |
Merge branch 'master' into dev_memory_format
Commit: | 951f631 | |
---|---|---|
Author: | hjchen2 |
merge upstream master
Commit: | 8b0e9e5 | |
---|---|---|
Author: | binbinHan | |
Committer: | GitHub |
refactor_collective_boxing_executor_backend (#10082) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 129a4c2 | |
---|---|---|
Author: | chengtbv |
nccl_use_compute_stream conf failed
Commit: | 1826e96 | |
---|---|---|
Author: | Houjiang Chen | |
Committer: | GitHub |
develop opencl backend and support pointwise binary ops (#150) * fix_eager_comm_mgr_use_error * auto format by CI * develop opencl * update * update * update * fix * implement event record * add opencl primitive memcpy and memset * disable using bin allocator since opencl memory could not be split * [opencl] support pointwise binary operation * add opencl binary ops test * comes with opencl cpp header * fix building with low version opencl --------- Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Commit: | 15b36eb | |
---|---|---|
Author: | yuhao |
try
Commit: | d03d4ca | |
---|---|---|
Author: | WangYi |
modify order in proto, add support non-contiguous in ops.td
Commit: | f3c3cf3 | |
---|---|---|
Author: | Wang Yi | |
Committer: | GitHub |
Merge branch 'master' into dev_memory_format
Commit: | 4dd7144 | |
---|---|---|
Author: | WangYi |
Merge branch 'master' into dev_to_memoryformat
Commit: | ff34a03 | |
---|---|---|
Author: | WangYi |
kNCHW -> kContiguous, kNHWC -> kChannelsLast, remove TensorLayout
Commit: | 709c98d | |
---|---|---|
Author: | daquexian |
update Signed-off-by: daquexian <daquexian566@gmail.com>
Commit: | c5a4062 | |
---|---|---|
Author: | hjchen2 | |
Committer: | hjchen2 |
fix
Commit: | 6d0c0ad | |
---|---|---|
Author: | Houjiang Chen | |
Committer: | GitHub |
add mlu device type (#10164)
Commit: | abc9c0f | |
---|---|---|
Author: | hjchen2 |
implement memory format
Commit: | cc0b8f1 | |
---|---|---|
Author: | yuhao | |
Committer: | GitHub |
speed up developing ofmempool in mlir codegen (#10168) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: WangXingyu <1908865287@qq.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: zhao di <54237695+Alokia@users.noreply.github.com> Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>
Commit: | ffe7ec5 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Task to/from proto (#10119) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 8904522 | |
---|---|---|
Author: | WangYi |
support memory_format by cherry-pick https://github.com/Oneflow-Inc/oneflow-cambricon/pull/122
Commit: | 140ef45 | |
---|---|---|
Author: | WangXingyu | |
Committer: | GitHub |
Add meta device and skip init function (#10008) 引入了meta device以及skip init功能,具体见:https://github.com/Oneflow-Inc/OneTeam/issues/1951 --------- Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: zhao di <54237695+Alokia@users.noreply.github.com> Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Commit: | 298aea4 | |
---|---|---|
Author: | strint |
rm useless
Commit: | b0c7ad0 | |
---|---|---|
Author: | strint |
rm useless change
Commit: | d7b7594 | |
---|---|---|
Author: | strint |
add rank compiler
Commit: | 5a7f554 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Plan sep compile merge rm collective boxing (#10114)
Commit: | 108891e | |
---|---|---|
Author: | strint |
merge upstream
Commit: | 49c8d18 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Update task_edge.proto
Commit: | db84632 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Update boxing_task_graph.proto
Commit: | 62dee17 | |
---|---|---|
Author: | strint |
add task to/from proto
Commit: | 8731999 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Revert "rm collective boxing in seperation compile" (#10113) Reverts Oneflow-Inc/oneflow#10112
Commit: | e9d9c5b | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
rm collective boxing in seperation compile (#10112)
Commit: | b1e86f6 | |
---|---|---|
Author: | cheng cheng | |
Committer: | GitHub |
TaskNode::order_in_chain (#10102) 拆分离编译下的 : - https://github.com/Oneflow-Inc/oneflow/pull/9909 PR 到 master 上合并。 依赖: - https://github.com/Oneflow-Inc/oneflow/pull/10097 先合并 移除 order_in_graph,使用 order_in_chain,在 LogicalChainPass 打开的情况下(分离编译强制 LogicalChain),logical chain 将 order_in_logical_chain 写入各个 op,从逻辑图上读取 order,跳过物理图的拓扑信息。 refine LightPlan 的输出信息 --------- Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 95dd077 | |
---|---|---|
Author: | strint |
merge upstream
Commit: | 677ece4 | |
---|---|---|
Author: | chengtbv |
refine task proto id
Commit: | 36e93d0 | |
---|---|---|
Author: | cheng cheng | |
Committer: | GitHub |
remove mem_chain merge (#10097) 移除 MemChain 的 merge 逻辑 这个逻辑很久之前就默认关闭了。 有几个问题: 1. mem chain 级别的 chain 合并可能会影响性能,比如 pp 下 fw bw chain merge,控制边会导致流水失效 2. 之前在 collective boxing + embedding 多机下有 bug 3. 如果有多个 time shape,中间插入不同的 time shape ,会导致 merge 正确性出错 4. mem chain 借助了 order in graph,但我们目前认为不存在全局的 order,只有 chain 内的 order(尤其是分离编译下) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 9f6d21e | |
---|---|---|
Author: | chengtbv |
TaskNode::order_in_chain
Commit: | 6aea136 | |
---|---|---|
Author: | Houjiang Chen | |
Committer: | GitHub |
support memory format (#122) * add memory format * refine avg pool functor * rename MemoryFormat kDefault by kUnused * fix
Commit: | f1c2078 | |
---|---|---|
Author: | chengtbv |
remove mem_chain merge
Commit: | f135048 | |
---|---|---|
Author: | daquexian | |
Committer: | nono-Sang |
Tensor Rematerialization (a.k.a. DTR/Coop) (#9861) 核心的逻辑是: 1. 用不同的 device 区分支持/不支持重计算的 tensor 2. 在 remat::Allocator 里实现了选择 cost 最低的 tensor 并 evict 的逻辑(对内存布局和 evict 方式的优化就是在这里) 3. 在 OpCallInstructionUtil::Compute 里实现了重新计算出被用到但已被 evict 的 tensor 的逻辑 其他的都是一些周边改动 使用方式: ```python x1 = flow.ones(3).to('cuda+remat') # 移动到支持重计算的 device 上 x2 = flow.ones(3).to('cuda') # 移动到不支持重计算的 device 上 x3 = x2 + x3 # 报错:device 不同 # ----- model = ResNet50() model.to('cuda+remat') data, label = dataloader() data, label = data.to('cuda+remat'), label.to('cuda+remat') loss = model(data) # 如果过程中显存满了,会自动丢弃一些 tensor loss.backward() # 如果接下来又用到被丢弃的 tensor,会自动把它们重新计算出来 ``` ---- 一部分通用的改动已经在前置 PR 里被合并: * https://github.com/Oneflow-Inc/oneflow/pull/9698 * https://github.com/Oneflow-Inc/oneflow/pull/9791 * https://github.com/Oneflow-Inc/oneflow/pull/9850 * https://github.com/Oneflow-Inc/oneflow/pull/9851 --------- Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Peihong Liu <mosout@qq.com>
Commit: | e24c9d8 | |
---|---|---|
Author: | strint |
merge master
Commit: | f39fda8 | |
---|---|---|
Author: | mergify[bot] | |
Committer: | GitHub |
Merge branch 'master' into wangxingyu
Commit: | 86c82db | |
---|---|---|
Author: | daquexian | |
Committer: | GitHub |
Tensor Rematerialization (a.k.a. DTR/Coop) (#9861) 核心的逻辑是: 1. 用不同的 device 区分支持/不支持重计算的 tensor 2. 在 remat::Allocator 里实现了选择 cost 最低的 tensor 并 evict 的逻辑(对内存布局和 evict 方式的优化就是在这里) 3. 在 OpCallInstructionUtil::Compute 里实现了重新计算出被用到但已被 evict 的 tensor 的逻辑 其他的都是一些周边改动 使用方式: ```python x1 = flow.ones(3).to('cuda+remat') # 移动到支持重计算的 device 上 x2 = flow.ones(3).to('cuda') # 移动到不支持重计算的 device 上 x3 = x2 + x3 # 报错:device 不同 # ----- model = ResNet50() model.to('cuda+remat') data, label = dataloader() data, label = data.to('cuda+remat'), label.to('cuda+remat') loss = model(data) # 如果过程中显存满了,会自动丢弃一些 tensor loss.backward() # 如果接下来又用到被丢弃的 tensor,会自动把它们重新计算出来 ``` ---- 一部分通用的改动已经在前置 PR 里被合并: * https://github.com/Oneflow-Inc/oneflow/pull/9698 * https://github.com/Oneflow-Inc/oneflow/pull/9791 * https://github.com/Oneflow-Inc/oneflow/pull/9850 * https://github.com/Oneflow-Inc/oneflow/pull/9851 --------- Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Peihong Liu <mosout@qq.com>
Commit: | 9d29cad | |
---|---|---|
Author: | binbinHan | |
Committer: | GitHub |
Support boxing based on cncl (#5) * support eager boxing * suport lazy boxing * refine * make of_format * register ccl mgr * set default val of CNCL_LOG_LEVEL * register task stream for lazy cncl * set 6 as default val of CNCL_MEM_POOL_MULTI_CLIQUE_ENABLE * move all code to cambricon/collective_communication * refine * use DeviceType replace Backend * refine global_cast * refine local to global * add REGISTER_CREATE_SUB_TASK_GRAPH_BUILDER_FN * fix typo * add ccl_sub_task_graph_builders * fix typo * refine collective_boxing_sub_task with cl_sub_task_graph_builder * remove WITH_MLU * remove IsDeviceTypeCPUOrCUDA and IsDeviceTypeCPUOrMLU * extract hierarchical_sub_task_graph_builder_util * add mlu_hierarchical_sub_task_graph_builder * add test file * fix merge master error * rename GetUniqueDeviceType * refine * refine * refine * add more info in GetCnclDataType * fix local to global error * refine * implement bangc gather kernel * refine * Update bang_kernels.h * refine * refine * add new line at end * refine * refine error info * add tanh_grad kernel * refine --------- Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by: BBuf <1182563586@qq.com>
Commit: | 2871fa6 | |
---|---|---|
Author: | strint |
merge master
Commit: | 2f0968f | |
---|---|---|
Author: | nono-Sang |
Change the order of device types and return in skip_init function
Commit: | e5fef10 | |
---|---|---|
Author: | nono-Sang |
add meta device and skip init function
Commit: | e292e6e | |
---|---|---|
Author: | daquexian |
Revert "fix more clang errors" This reverts commit 4b66c14e26b4ff9065e005d43496207cdb8612b9.
Commit: | 4b66c14 | |
---|---|---|
Author: | daquexian |
fix more clang errors Signed-off-by: daquexian <daquexian566@gmail.com>
Commit: | c26365a | |
---|---|---|
Author: | daquexian |
Merge branch 'master' into sym_shape Signed-off-by: daquexian <daquexian566@gmail.com>
Commit: | 361fcd4 | |
---|---|---|
Author: | levi131 | |
Committer: | GitHub |
Support complex64 and complex128 datatype, construt ComplexFloatTensor and ComplexDoubleTensor, CompleDoubleAttr (#9987) ### Original requirements Some AI for Science related models(FNO, AFNO, PINO) use complex data types and related operations ### Main design: #### for complex64: `flow.cfloat` == `flow.complex64` as flow.dtype `ComplexFloatTensor` as tensortype `complex` as python object type `NPY_COMPLEX64` as numpy array type `Py_complex` as pythonc type `std::complex<float>` as c++ type #### for complex128: `flow.cdouble` == `flow.complex128` as flow.dtype `ComplexDoubleTensor` as tensortype `complex` as python object type `NPY_COMPLEX128` as numpy array type `Py_complex` as pythonc type `std::complex<double>` as c++ type ### Works in this pr: - Support complex64 and complex128 datatype and corresponding tensortype - Add `std::complex<float>` and `std::complex<double>` into corresponding TYPE_DATA_SEQ - Add `ComplexDoubleAttr` in protobuf and tablegen files - Extent `class Scalar` to support represent complex number - Modify factory of FillPrimitive to support `kComplex64` and `kComplex128` datatype on cpu - Extent some apis, add some datatype checking functions, make some changes for compatibility, and add some tests ### Future works - Support `flow.fft.*` apis - Support more ops related with complex datatype - Support complex32
Commit: | ae14534 | |
---|---|---|
Author: | daquexian |
fix mlir tests Signed-off-by: daquexian <daquexian566@gmail.com>
Commit: | 0273adc | |
---|---|---|
Author: | hjchen2 |
rename cambricon to MLU
Commit: | 8496052 | |
---|---|---|
Author: | luyang |
adaptation of cambricon mlu ep device
Commit: | d51447f | |
---|---|---|
Author: | hjchen2 | |
Committer: | hjchen2 |
refine code to reduce coupling
Commit: | 48ed476 | |
---|---|---|
Author: | strint |
index search O(log(n)) to O(1)
Commit: | d617f0e | |
---|---|---|
Author: | strint |
config straighten alg with env var
Commit: | 4f11a23 | |
---|---|---|
Author: | strint |
merge master
Commit: | 132a8a7 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Rank task graph fix (#9749) To be fixed distributed test. --------- Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: cheng cheng <472491134@qq.com>
Commit: | 92f0f5d | |
---|---|---|
Author: | daquexian | |
Committer: | daquexian |
update ShapeProto and other methods, support mlir Signed-off-by: daquexian <daquexian566@gmail.com>
Commit: | b3a8b26 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Fix straighten memory diffuser bug (#9580) 拉直算法除了考虑内存,还考虑op的最小生命周期,能对 https://github.com/Oneflow-Inc/OneTeam/issues/1806 的内存进行进一步的缩减,大概5000M砍个200M这样子。 此外还移植了考虑内存的自动并行的部分修改。为logical chain的拉直打好基础。 初衷已改,此pr专注于极限压缩内存复用的内存。 测试样例:vit的单机单卡。 ``` Lower bound: 2925178880 Algorithm id: 3, use compact insert? 0, memory size: 2959723520 Algorithm id: 1, use compact insert? 0, memory size: 2939256320 Algorithm id: 2, use compact insert? 0, memory size: 2949406208 Algorithm id: 0, use compact insert? 0, memory size: 2931854336 (原算法最优) Algorithm id: 3, use compact insert? 1, memory size: 2935819264 Algorithm id: 2, use compact insert? 1, memory size: 3021840384 Algorithm id: 1, use compact insert? 1, memory size: 2945655296 Algorithm id: 0, use compact insert? 1, memory size: 2929897472 (此pr算法最优) ``` compact insert为当初内存压缩技术分享时子问题二的第二个优化技巧,现在在这个pr里面实现。 在简单的图下优化效果会不明显(很多简单的图原有算法就已经达到了最优解lower bound),理论上越复杂的图优化效果越明显。 --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | f4be05d | |
---|---|---|
Author: | PYNing | |
Committer: | GitHub |
Add oneflow.nn.functional.depend api (#9807) 增加一个Python OP,用于: (1)防止指定OP在静态图优化时被消除或重排序; (2)作为用户增加静态图控制边的接口,实现对执行序的约束或修改。 该OP存在于其他具有静态图特性的框架,例如: Mindspore(https://www.mindspore.cn/docs/zh-CN/r1.9/api_python/ops/mindspore.ops.Depend.html) Tensorflow (https://www.tensorflow.org/api_docs/python/tf/control_dependencies) ### 特性: (1)为避免Eager Mode下的性能损失,Python接口判别在Eager Mode还是在Grpah Mode下运行,Eager Mode直接返回输入; (2)为避免Grpah Mode下的性能损失,增加可配置开关的Pass,用于消除多添加的OP,并相应的添加底层的控制边; (3)考虑self-loop导致的死锁的情况; (4)Pass考虑了多个depend OP连锁的情况,以及可能重复添加控制边的情况; (5)Kernel直接复用已有代码; (6)包含了单元测试(考虑用户多种可能的用法)和文档 ### 效果: 以单元测试的第一个例子(test_depend_graph_case0)为例 **网络定义** ```python class TestModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(128, 128) def forward(self, x): # to ensure "x * 2" be executed before "self.linear(x)" in graph mode # base use case x1 = x * 2 x = nn.functional.depend(x, x1) x2 = self.linear(x) return x2 ``` **不使用nn.functional.depend时,job_TestGraph_0_plan.dot的截图** <img width="859" alt="before" src="https://user-images.githubusercontent.com/11667149/215424410-9e23e4a7-dc3a-4074-a0af-0f46aed82599.png"> 从图可知,OP “model-scalar_mul-0”和“model.linear-matmul-1”并无执行顺序的约束,且从ID大小推测后者将先与前者执行,**与用户定义的OP执行顺序不一致** **使用nn.functional.depend后,job_TestGraph_0_plan.dot的截图** <img width="909" alt="after" src="https://user-images.githubusercontent.com/11667149/215425282-1a1726b1-ca39-47f0-ab42-df2c8e0188ad.png"> 从图可知,OP “model-scalar_mul-0”与“model.linear-matmul-1”之间增加了一条控制边,且从ID大小推测前者将先与后者执行,**达到用户控制OP执行顺序的目的**。且由于存在控制边,防止了 “model-scalar_mul-0”被其他Pass消除 --------- Co-authored-by: PYNing <ningpeiyang@gmail.com>
Commit: | 4bfef84 | |
---|---|---|
Author: | daquexian | |
Committer: | GitHub |
refine device-related code (#9791) DTR 的设计中打算用不同的 device 区分开启/不开启重计算的 tensor(和 torch/xla 做法相同),实现过程中发现 device 相关的代码有些可改进的地方 1. device 无法作为 op 的 attr,master 里用分别设置 device_type 和 device_id 两个 attr 来代替,因此产生了很多无中生有的代码: ```c++ Device a; op.SetAttr(a.device_type(), a.device_id()); Device b = Device::New(op.attr("device_type"), op.attr("device_id")); // 直接 Device b = a 显然更简单 ``` ```c++ inline Maybe<bool> device_equal(const std::string& device_name, const int device_id, Symbol<Device> device) { return (device_name == device->type() && device_id == device->device_id()); } Device a; op.SetAttr(a.device_type(), a.device_id()); if (device_equal(op.attr("device_type"), op.attr("device_id"), b)) // 直接 if (a == b) 显然更简单 ``` 这些冗余代码在给 Device 类增加新参数时也会引起额外的改动量 2. 一些地方错误地使用了 Optional::value_or,如 ```c++ auto device = device_.has_value() ? device_.value_or(Symbol<Device>()) : JUST(input->device()); ``` 3. 一些命名问题,如 `ParsingDeviceTag` 没有用动词(改为 `ParseDeviceTag`)、`Device::ThreadLocalGetOrNew` 和 `Device::New` 功能相同,"New" 的含义互相冲突(删掉了 `Device::ThreadLocalGetOrNew`) 3. operator== 和 operator!= 逻辑重复 --------- Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Commit: | 7e36694 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Feat straighten delay short gpu (#9748) 推迟gpu时间很短的op的执行,同时会优先执行紧急的op,最大程度地保证了gpu短时的op在最后关头执行 用拉直做这个的好处在于,不会有额外的消耗,不会改变计算图。 虽然现在是physical task层面,但是op graph层面的拉直,logical chain会复用/共享这里的一些函数以及类,所以也算是op graph层面拉直的预热 效果如图,在没有选择这个拉直选项的时候,weight是挤在前半部分的  在选择这个拉直选项(DelayShortGpu)以后,weight散布在整个执行序列  经过处理以后,下游op消费之前才会执行相应的variable op 而且对expand也做了处理,expand的下游紧接着broadcast add 在validation阶段能够极好地缩短间隙 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 2e52599 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Auto Parallel consider memory (#9258) Let auto parallel give the fastest strategy under the limitation of memory. Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 111c40d | |
---|---|---|
Author: | Panlichen |
Merge remote-tracking branch 'upstream/master' into sync_upstream
Commit: | 03ece9b | |
---|---|---|
Author: | Shiyuan Shangguan | |
Committer: | GitHub |
rm log dir default (#9552) python直接执行脚本时,默认不创建log文件夹,glog日志默认输出到stderr,默认级别为WARN 环境变量ONEFLOW_DEBUG_MODE=1时,创建log文件夹,日志写文件默认级别为INFO,屏幕输出默认级别为WARN 通过distributed.launch启动多进程时,创建log文件夹,日志写文件默认级别为INFO,屏幕输出默认级别为WARN Co-authored-by: jackalcooper <jackalcooper@gmail.com>
Commit: | 586706b | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Add a switch for memory share strategy (#9509) Add a switch for the memory share strategy. It would be off by default. It solves this issue https://github.com/Oneflow-Inc/oneflow/issues/9508
Commit: | 5ecad89 | |
---|---|---|
Author: | luyang |
merge master
Commit: | a4e67b0 | |
---|---|---|
Author: | Xiaoyu Xu | |
Committer: | GitHub |
Rank task graph merge master (#9440) * Use Primitive in Scalar Pow Grad (#8620) * scalar math use primitive * fix * support pow grad * dev scalar pow grad * remove useless code * use std * auto format by CI * Refine Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Add higher order derivative for loss function (#9070) * add higher order derivative for smooth_l1/nll loss * add higher order derivative for bce/kl_div loss * fix bug and refine testcase * fix wrong sbp signature of bce loss * optimize code and align precision with pytorch * add some index check * disable calc derivative for target in bce loss * remove unnecessary header include * fix sbp setting in testcase, and restore out_grads size check * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add higher order derivative for softmax and activation (#9032) * add higher order derivative for softmax/logsoftmax * add higher order derivative for mish/gelu activation * auto format by CI * add comment for constexpr parameter Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add higher order derivative for pool (#9096) * add higher order derivative for pool * refine * optimize * fix ndim check error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Cross Encropy 支持 probability 的 target (#9064) * support prob for crossentropy, still has bug for dims > 2 * fix bug of for ndim > 2 inputs, refine code * refine code, use template HasLabelSmoothing * fix grad bug of for ndim > 2 inputs, use pre-calculated factor in kernel * format code, remove redundant including header files * refine op * restore wrong modification * remove op, implement at functor layer * set bind_python to false, remove redundant header files * add docs * fix missing default param in unittest, fix typo in docstr example * auto format by CI * Update loss.py * remove useless file Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix nvjpegDecodeParamsSetROI (#9101) * Fix nvjpegGetImageInfo * fix set ROI * add series op : adaptive_max_pool1d/2d/3d (#9023) * startup: cpu adaptive max pool 2d finished (a draft) * add 1d/2d/3d forward * add return_indices * refine files hieararchy * add adaptive_max_pool2d_grad for test * draft backward op for maxpool 2d * cpu op/kernel finished * reformat * gpu draft kernel * gpu forward finished * draft gpu backward version * refine gpu backward * add nn.AdaptiveMaxPoolnd Module * add docstring * rename avg pool gpu file * refine .td file * refine * refine test case * refine * refine by comments of zzk * refine according to clang_tidy errors * refine * refine by comments of zhuping * one_embedding physical_block_size change to 4096 (#9017) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * OneEmbedding add ONEFLOW_ONE_EMBEDDING_DISABLE_PIPELINE (#9098) * one_embedding eager forward * deterministic forward gen random * merge master * merge master * grad op add attrs * Revert "grad op add attrs" This reverts commit 33b67c75d1e5d0e6529a108f7e7a17bc458dc661. * auto format by CI * format * refine * prefetch consume id_shuffle out and exec in advance * add new task_node * sort and add ctrl edge * rm id_shuffle_task_node * add register same output blob regst num * rm tasktype * refine * address review * rename * refine * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * develop eager AMP (#9088) * implement eager AMP * skip autocast for inplace and implement make autocast meta * fix * rm unused code * autocast python api * fix * fix * refine * skip autocast if any input is float32 for gray or clear list * refine * fix dead loop * add autocast unittest * refine worker seed (#9102) * refine worker seed * refine * reifne * use default_generator.seed * Dev GroupNorm (#7784) * add groupnorm infer * Add groupnorm forward * refine other forawrd situation * groupnorm backward still has bug * fix forward * support backward * add slow groupnorm param grad kernel * use blockreduce * update blocknum * add gradient func * simplify code * refine and add global test * remove annotation * not limit split dim * fix compile error * Add spatialsize pack logic and fix launch blocknum bug * add two stage reduced backward kernel * refine * simplify logic * refine pack logic * use THREAD_CACHED_MUTABLE_ATTR_MAP * fix comment * refine * refine comment * Refine more check * fix affine=False bug * fix bug * tmp use gemm reduce * use ComputeType buf * fix nvbfloat16 compute type * add amp gray list * Revert back * fix clang analysis * refine userops.td * fix userops * remove result_segment_sizes * add dispatch logic for groupnorm grad uncached block impl Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Introduce bfloat16 type (#9067) * introduce_bfloat16_type * storage * fix compile error * support bfloat16 ep operator * support create cpu bfloat tensor * refine code * minor fix * fix static check error * reslove comment * add more test case * fix bfloat16 numeric_limits * fix error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine check in ibverbs (#8974) * refine check in ibverbs * format * fix typo and test * refine error message when there is no errno Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support padding_idx in OneEmbedding (#8998) * init * Add attribute val in Userops.td * simply add paddingidx logic in EncodeLookupKernel * add simple padding_idx EmbeddingGrad * when index is -1 let gather add 0 * skip atomicadd when row index equals to padding_idx * change padding_idx type to int64 * fix compile error * set padding_idx in Pass * 1n1d eval success * refine * remove print * fix compile error * revert * refine * fix compile * refine * Refine * refine * refine store options * remove embedding grad shuffle redundant padding_idx * move gather in datashuffle kernel * remove redundant code * Refine * refine * remove redundant header file * Set padding idx as optional and remove attr has_padding_idx * Add padding_idx unittest * use array equal instead of allclose * remove a test * enlarge timeout * launch oneflow kernels in code generated with MLIR (#8980) * init * registry * add KernelLaunchFunctionPass * pass ninja and relu test * mlir test script & lowering * relu py * fi * kernel launch * fix * fix op and pass interfaces * add comment * add readme docs * fix typo * kenerl launch function pass is done * use template and rename func.func * declare * pass string through mlir.llvm dialect to c interface: llvm.mlir.global internal constant @"relu-0_var"("relu-0") %0 = "llvm.mlir.addressof"() {global_name = @"relu-0_var"} : () -> !llvm.ptr<array<6 x i8>> %1 = "llvm.mlir.constant"() {value = 0 : index} : () -> i64 %2 = "llvm.getelementptr"(%0, %1, %1) {structIndices = dense<-2147483648> : tensor<2xi32>} : (!llvm.ptr<array<6 x i8>>, i64, i64) -> !llvm.ptr<i8> * use symbol table * use oneflow variable op * fix symboltable * fix * ninja c1 check * split into kernel-launch-function pass and kernel-launch-with-llvm pass * restore pass 1 * Gen kernel example (#9042) * add example * add todo * add basic assertion * add file check * create pass in translation * sanitizeIdentifier * enable print * fix * update test file * kernel llvm pass is ok * pass ctx ptr to func and this ptr will be an operand to call c interface function * restore llvm ptr type to llvm.ptr<i8> * Kernel lookup in launch op (#9059) * add * move function to another unit * create map * add iter * impl TensorDesc4ArgNameAndIndex * set dev tag * load lib when ONEFLOW_MLIR_FUSE_KERNEL_LAUNCH is set * sharedlibs enables and pass enables in commpute * enable c interface callee * impl todo * naming * rm * add invalid * fix invoke arg * typed * rm log * rename pass * Update user_op_kernel_registry.h * Update user_op_kernel_registry.h * Update OneFlowOps.td * Update Passes.cpp * add comp ctx * add todo * refine todo * refactor op infer * minor fix * add check * refine error * refine msg * fix typo * fix typo * remove string in llvm * impl Tensor4ArgNameAndIndex * fix ninja c1 bug * realize gpu and add cuda test * auto format by CI * fix merge * fix ninja with cpu version * auto format by CI * rename * merge def * deduplicate code * fix * refactor * fix license * cache * add back TODO() * add jit arg type check * rm comment * fix typo * fix ci * todo ci * fix code style * rm misadded * rm misadded * Update Passes.cpp * pass ninja without debug about hungry mode of knerel init * fix null parsed module problem * fix dynamic cast of state problem * fix gpu error * fix * fix * auto format by CI * fix * Update kernel_launch_op.cpp * move * fix * auto format by CI * done * fix * fix * auto format by CI * fix * fix * auto format by CI * Update kernel_launch_op.cpp * rename * auto format by CI * fix * done * Update kernel_launch_op.cpp * fix * fix * fix * fix * fix * auto format by CI * Update oneflow/ir/oneflow-extension/kernel_launch_op.cpp Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * fix * fix * fix * fix * fix * Update oneflow/ir/lib/OneFlow/Passes.cpp Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * fix * fix * fix Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * interpolate api align (#9118) * Fix masked select op bug (#9120) * fix masked_select bug * refine * fix ci error * align with pytorch RANK env (#9111) * align with pytorch RANK env * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add oneflow hub (#9116) * add OneflowHub feature, consistent with PyTorchHub * add oneflow hub docs * refine docs and add test * refine * refine * refine * fix comment * auto format by CI * skip unittest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix where op data_type infer bug (#9121) * fix where op data_type infer bug * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix like op infer dtype (#9127) * elementwise.cuh remove template parameter tail (#9128) * fix_global_tensor_detach_bug (#9134) * fix_global_tensor_detach_bug * fix test case * Add deform_conv2d op (#9095) * add new op * add kernel * add deform_conv * add some test * modify test * modify format * modify test * fix the bug and add test * Add error message * modify kernel and add test * adjust the format * add global test * Update python/oneflow/test/modules/test_deform_conv2d.py * add doc and modify global test * adjust OneFlowUserOps.td * remove headfile and modify doc * modify doc * add docs at rst * modify global test * remove unnecessary code * remove unnecessary code * remove debug code * initialize fields * modify global test * modify test * modify test * modify test * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix inplace mul 0size check bug (#9132) * fix inplace mul 0-size tensor check bug * code format * revert * Align round op to support round half to even (#9135) * align round op * add test * modify doc ,test and kernel * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * rm dict in module apply (#9137) * rm dict in module apply * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * one_embedding support broadcast table_ids (#9109) * support broadcast table_ids * address review * fix like op infer dtype * address review * address review * refine * refine error message for framework (#9104) * refine error msg for framework * more error messages * fix size_t comparison with zero * check for incomplete error messages * err msg for inconsistent placement * modify acc. to review * convert enum to string in error msg * fix redundant error info; clean up * refine error msg for consistency check * auto format by CI Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix loss scale precision (#9126) * fix loss scale cast * amp_white_identity * revert debug log * move constant like back Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * one embedding eager (#8984) * forward * one_embedding eager * fix one_embedding grad * fix * fix * fix * fix amp * fix of_tidy * ONEFLOW_ONE_EMBEDDING_FUSE_UPDATE_PUT default true * merge master * save shadow var * get all ptr from embedding_state * reuse update and put op/kernel * mv id_shuffle to cuh * refine * refine * refine * refine * refine * refine * one_embedding eager forward * deterministic forward gen random * merge master * merge master * merge master * add table_ids in grad op * test pass * refine * create lazy state in lazy mode * optional learning_rate * add attr in update * refine * refine * refine * refine * fix adam and add adagrad attr * refine * refine * refine * refine * refine * address review * refine name * address review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * module.to aligned with pytorch (#9083) * module.to aligned with pytorch Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * fix to str Signed-off-by: daquexian <daquexian566@gmail.com> * fix kwargs device bug Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: binbinHan <han_binbin@163.com> * eager global zero_grad update sbp from b to p (#8853) * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * skip in lazy Signed-off-by: daquexian <daquexian566@gmail.com> * implement zero_grad in c++ Signed-off-by: daquexian <daquexian566@gmail.com> * _zero_grad to _zero_grad_, skip boxing of lazy tensor Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * auto format by CI * skip test in cpu only mode Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support inplace scatter (#9016) * refine scatter * fix * refine * refine * add atomicMul & refine * refine * Dev linalg cross (#8979) * add linalg_cross in yaml * add linalg cross * fix * refine broadcast * add global test * reformat * refine and fix * fix tidy * add nansum (#9113) * add nansum, can work on cpu, fail on cuda * implement nansum on cuda * restore modification in preprocessor_internal.h * register only for floating types * remove kernel register for int types, and it works * add whole reduce functor * add backward func * add export in __init__ and refine code * refine code * refine code, and register kernel * add sbp * just for debuging, cannot compile * just for debuging, cannot compile * use primitive to implement assign nan * refine code * add docs, remove useless op and functor * remove useless kernel * add docs, fix bug of primitive * fix typo in global test * refine code * refine code * refine code * refine code * auto format by CI * Update binary_func.h * Update binary_func.h Co-authored-by: MARD1NO <359521840@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Feat eager global tensor indexing (#9138) * test(TensorIndexing): add global basic indexing test * format code * feat(TensorIndexing): support eager global advance indexing * test(TensorIndex): add global tensor indexing error message test * format code * feat(TensorIndexing): support global tensor combined indexing * format code * feat(TensorIndexing): eager global combined basic with advance indexing * fix(TensorIndexing): fix global tensor write back bug * remove useless code * refine test and comment * fix(TensorIndexing): remove an unnecessary slice_update * add comment * fix with static analysis Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add lr_scale for optimizers (#9008) * add lr_scale for opt * revert import * set lr scale in pass * add test * lr_scale default value * improve readability * fix_ctc_loss_error_with_float_target_input (#9143) * fix_ctc_loss_error_with_float_target_input * minor fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Inplace masked fill (#9133) * add inpalce masked_fill * reformat * refine * auto format by CI * refine according by comments of hbb * export via cpp directly * export oneflow.masked_fill_ * rename arg * refine test case Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix numpy>=1.23.0 advance indexing code (#9139) * test(TensorIndexing): fix numpy>=1.23.0 * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add_tensor_new_full_func (#9149) * add_tensor_new_full_func * auto format by CI * add global test case * fix error Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * As strided regist more dtype (#9150) * as_strided register more kernel * add test * fix commnet * fix ci error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Auto Parallel (#8891) * add auto_parallel code add auto_parallel pass * Feat ap remove hierarchy cast (#7919) * feat(AutoParallel): support remove parallel_cast ops * feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops * format code * Fix add conv grad cost (#7972) * feat(Conv): add grad computation cost * fix ConvDataGrad computation cost * update conv grad cost * refine * Auto parallel/fast collector (#7958) * Try to speed up sbp collector. However, throughput drop * Shrink the parallel candidates for the proxy node * Print out some information and then refine * Store the sbp set for each consumer * Update binary set intersection * Remove impossible parallel candidates from sbp proxy * Refine binary set * Add a Clear() in binary set * Filter out those proxy candidates containing two sbps from the same unique group * refine * Check spells * Clip useless edges * AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033) * feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge * use if instead std::max * fix(AutoParallel): fix pooling computation cost function bug (#8147) * [WIP] Fix auto parallel dump uniform sbp bug (#8330) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * refine source op judgement * update auto_parallel config (#8356) * Refactor dump nd sbp for auto parallel (#8353) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf * refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn * rename Global to Singleton * Refactor SbpEdge (#8684) * refactor(AP): refactor SbpEdge * Rename variables * Add const for some functions Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> * Refactor auto parallel sbp node (#8712) * Rename * Code clean up * Code clean up * Code clean up and package up * Rename * Add const for some functions * Refactor auto parallel sbp graph (#8722) * Code clean up * Package up * Code clean up and package up in SbpNode and SbpEdge * Rename * Rename * Rename mainstem to trunk * Typo, small bugs and rename * Rename and of format * Refactor auto parallel rest (#8731) * Package up SbpCollector * Add const for SbpGraph * Add const for SbpNode * Add const for SbpEdge * Add const for SbpCollector * Add const, rename, and package up for BinarySet * Rename for BinarySet * Rename for SbpCollector * Rename for SbpCollector * Rename for algorithm utils * Fix a bug for an unused function AddEntries() * Rename for BinarySet * Rename for SbpConstructor * Rename for BoxingCollector * Add const for sbp utils * fix merge conflict * Remove template for sbp signature (#8787) * Remove template for sbp signature * Remove _H_ from cpp files * Remove namespace specifier oneflow:: * Remove namespace specifier oneflow:: * Of format * Move the inline functions to cpp files * Can not add inline specifier? * Update oneflow/core/auto_parallel/sbp_graph.h Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Of format Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor auto parallel class object stuff (#8835) * Delete copy/move constructor/operator * Move the deconstructor of SbpEdge to the cpp file * Equal by address for Sbp data structor * Replace sbp_sig_list_ with sbp_sig_obj_list_ * Fix auto parallel copy cost infer2 (#8788) * Check the output shape for operator in auto parallel * Return infinity for different sbps while is_mutable * Update oneflow/core/auto_parallel/sbp_constructor.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Update oneflow/core/operator/operator.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * with output -> check output Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor prune identity as much as possible (#8849) * Prune a line of parallel cast ops * Avoid repeated pruning * Code clean up * Remove identity op * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Fix auto parallel low throughput (#8876) * Speed up after pruning identity * Slight changes * Refactor auto parallel final check (#8887) * Of format * Use const auto & * Of format and rename * Re-compute cost if steals sbp signatures * Docs auto parallel doc (#8896) * doc(AutoParallel): add auto parallel document framework * docs(AutoParallel): add document * fix typo * refine document * refine documentation * Test alexnet for auto_parallel (#8917) * test(AutoParallel): test alexnet for auto_parallel * test(AutoParallel): test model add auto_parallel config * Fix get sbp bug (#8939) * Fix the bug of missing sbp for uniform op * Speed up * Add the mising sbp for optional input UserSourceOpTickInput * Remove the repeated all-B sbp signature * Add sbp for undefined UserSourceOpTickInput * Resolve confits while merging master * Recompute cost with time shape (#9009) * Address comments * fix merge conflict * Address comments * Disabled ZeRO when enabled AutoParallel (#9087) fix(AutoParallel): disabled ZeRO when enabled AutoParallel * Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp * Address comments * Address comment. GetComputationCostFn -> GetComputationCost * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * New interface for pr#9018 * Static analysis * Fix ones like sbp bug and fix test import error in CI (#9123) fix(AutoParallel): skip 1n1d sbp agreement check * auto format by CI * test(AutoParallel): skip acc check * Address comments * rename source op set nd_sbp function and add check * fix typo * Feat full auto parallel (#9140) * Use B for inplace op and remove the check for sbp while truning the auto prallelism on * Slight change * Not using B as the constrain * Address comments * add debugg log for non-deleted cast ops * update prune parallel cast op log * rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * refine oneflow op infer dtype error message (#9155) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix to_global PyArg_ParseTupleAndKeywords (#9158) * Fix tensor local_to_global parse keywords * use PyObject Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Implement exponential_ and multinomial (#9073) * add exponential distribution cpu kernel * add exponential distribution cuda kernel and local tests * refine test * fix bug * auto format by CI * auto format by CI * implement multinomial functor and cpu kernel * auto format by CI * add multinomial cuda kernel * auto format by CI * refine * add multinomial tests * auto format by CI * add categorical distribution module and docs * refine * refine * refine doc * refine * refine * revert Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Disable IB when there no active IB devices (#9115) * fix lru_cache offset (#9162) fix lru_cache offset for larger than uint32 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Rename cast to global and cast from global (#9151) * rename_cast_to_global_and_cast_from_global * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine datatype error message part2 (#9168) * refine more ops dtype infer error message * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support tensor.triu_ (#9159) * support tensor.triu_ * Update tensor_functions.cpp * tensor.copy_ support stride (#9142) * tensor.copy_ support stride * add test case * PersistentTable add read_only flag (#9145) * read only * fix * avg_pool_nd support half (#9170) * avg_pool_nd support half * refine * refine * fix new_ones size paramater (#9161) * fix new_ones size paramater * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * hot-fix (#9191) * hot-fix * refine * skip env var check and calculate local rank if not given (#9183) * skip env var check Signed-off-by: daquexian <daquexian566@gmail.com> * calc local rank if need * No warning for absent LOCAL_RANK Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * set to_contiguous to amp clear list (#9171) * add tensor.nansum (#9182) * Add slight cost for different sbp in 1 device (#9172) * Add slight cost for different sbp in 1 device * Print to INFO Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * refine_to_contiguous_dtype_register (#9196) * refine_to_contiguous_dtype_register * add test case * pool_nd_ops register gray list * skip autocast for non-user op (#9199) * `copy_` support numpy fp16 (#9189) * copy_ support numpy fp16 Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix matmul 0 size input error (#9147) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat functional scalar tensor parameter (#9190) * add ScalarTensor check and unpack, bug has link error * refine scalar tensor item function * feat(functional): functional support ScalarTensor transfer to Scalar automatically * feat(functional): support ScalarTensor transfer to Scalar * change auto transfer rule * test(Functional): add functional scalar tensor param test * format code * refine GetItemInScalarTensor function * Fix broadcast fmod grad (#8865) * impl trunc divide * fix broadcast fmod grad * trunc_div grad, scalar_trunc_div, and primitive * format * gradient_func * add test * rename * compatible with older versions of torch * resolve warning * test global Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat straighten compress memory (#9094) * An initial inplementation of linear programming primal matrix * Coding for the revised simplex method * Finish coding for the phase 1 * Fix bug. Now we can get a corrent x for the initial basic feasible solution * Drive the artificial variables out in phase 1 * Bland's rule and bug fix * Adjust the mapping between the basic variables and compact columns * No columns removed while driving artificial variables out. Terminates the code if positive optimal cost found in auxiliary problem. * Implement the phase 2 of the revised simplex method. Remove columns of the inverse base matrix. * Update is_solved status and original problem recovery. * Rows and artificial columns activation * An initial implementation of mix integer programming * Try to assemble the original problem but fial due to the massive exclusion * Steal initial position from current setting * Compute the optimal cost from the compact relationship * Move to a neighbor status and compute the cost * Find the smallest cost and actually move to that status * Check conflit after the adjustment. Adaptively cost reduce * Generate a compact position from nothing * Straighten for memory * Update the offset * Add a demo for using the revised simplex method * Remove the linear programming part * Recompute the compact relationship after moving to a new status * Rename * Code clean up * Set the tag for the straighten algorithm * Code clean up * An attemp to explore the dependency between consumer nodes of a register * Revert "An attemp to explore the dependency between consumer nodes of a register" This reverts commit f219851fb85943d07d28b84c45e5c4bae80872a0. * Compute the lower bound and only execute the adjustment 2 for those cases with possible reduction in memory * Pre-compute and store the memory size for registers * Use pre-stored total register num * Limit the maximum iteration step * Use VLOG(3) instead of std::cout * Change interface * Package up memory share strategy interfaces * Address comments * Address comments * Of format * Fix bug lower bound = 0 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add contains magic method (#9185) * refine more ops dtype infer error message * refine * add tensor.__contains__ magic method Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Build cuda 11.8 (#9204) * export unsorted segment sum (#9206) export unsorted_segment_sum python Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Optimize OneEmbedding Save Snapshot (#9112) * init * fix compile error * refine * Refine put logic * todo lrucache logic * refine dump logic * finish * add flag check * Add env var * fix * fix a silly bug * fix template args * fix comment * add template * Refine comment * remove * fix bug * fix compile error * refine initial Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add Tensor.scatter_add & refine scatter (#9201) add Tensor.scatter_add & refine scatter * optimize layernorm need padding cols perf (#9195) * optimize layernorm need padding cols perf * auto format by CI * reduce binary size Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support Inplace behavior in Type Promotion (#9200) * support inplace * refine * add const * refine Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Fix Broadcast Matmul check (#9213) fix check * Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209) * export to graph config * refine or Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of matmul dim check in `oneflow.bmm` (#9215) * fix bug of matmul dim check * refine code * Update nn_functor.cpp * Regist arange fp16 (#9202) * arange op support cuda half * add test * format * fix comment * fix comment * refine * ci test error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix graph out argstree type judge (#9211) * reproduce bug * fix custom class type deal * fix typo * support ordereddict * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix ConcatFunctor error message (#9225) * Check async errors after kernel launched (#9226) Check errors after kernel launched * Skip unnecessary passes (#9219) * Skip unnecessary passes * refine * one_embedding fix typo (#9230) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [GetAsyncError] Add op name to error message (#9228) GetAsyncError refine error message * [JobBuildAndInferCtx]Remove an inefficient check (#9229) Remove an inefficient check * Fix linalg cross 0-size input error (#9232) * Add silu to amp list (#9233) * Disable CUDA virtual arch compilation (#9236) * Support set/get_default_dtype interface (#9227) * feat(DType): support set/get_default_dtype interface * doc(*): fix set/get_default_dtype document * doc(DType): refine document * feat(oneflow.tensor): support infer dtype as get_default_dtype * test(DType): add default dtype test * refine throw error * modify doctest because it will affect default dtype for other test * fix(DType): make DefaultDType is global * use default type in TensorWithDataCtorFunctor * fix(DType): flow.Tensor support DefaultDType * refine function name Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Enhance doctest error message (#9237) * test(doctest): enhance doctest error message * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Yao Chi <later@usopp.net> * Feat: script to import oneflow as torch globally (#9160) * feat: global `import torch as oneflow` * use `console_scripts` to install oneflow-mock-torch to PATH * close quote * use os.makedirs to create temp torch directory * rename to `oneflow-mock-torch` * don't create temp files * use positional argument with 2 choices * add `mock torch test` in CI * uncomment env setup * default argument is enable * fix docker exec * refactor test script * check successful recover * don't run setup.py * support submodule importing & display error message * fix import * and import-from * move mock_torch to oneflow dir; update test command * fix error message * update mock test (less strict) * add more tests for torch imports * modify export path * mock_torch is a package Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add time and mem log tools (#9164) * add time and mem log tools * refine format * auto format by CI * address review * auto format by CI * log with json format * rm useless * refine log format Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support bool for `oneflow.nn.functional.pad` (#9234) * support bool in functor and kernel, add unittest for int and bool * refine unittest * check value for bool tensor * Feat: rand/randn support float16 kernel (#9238) * feat(Op): rand/randn support float16 kernel * add error message and refine code Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * reduc auto tick generate time (#9235) * reduc time * rm useless * address review, refact structure * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * TensorIndexing support float16 (#9247) * feat(TensorIndexing): support float16 * feat(TensorIndexing): support bfloat16 * skip bfloat16 test when cuda version less than 11000 Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Add cudnn handle pool (#9243) * add_cudnn_handle_queue * deal normalization_kernel * refine * refine * reslove comment * minor fix * refine * auto format by CI * fix static check Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Added error message for CUDA device incompatibility (#9250) * Added error message for CUDA device incompatibility * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix autograd.Function memory leak (#9249) * fix(AutogradFunction): fix memory leak * add ptr check for AutogradState data * test(AutogradFunction): ensure PyAutogradFunctionState released * test(AutogradFunction): decrease memory * register __dict__ function * refine code * fix state release test bug * refine error message * Feat speed up mem reuse (#9210) * Use HashSet instead of vector * O(n^3) -> O(n^2) * Compute offset for memory-first algorithm only * Remove explicit exclusion relationship * Revert print out information * Speed up exclusion judgement * Switch HashMap to vector * Code clean up * life time -> lifetime * mem_reused_regst: HashSet -> std::vector regst_desc_id2regst_desc -> mem_chain2regst_desc_id2reuse_regst_desc * Re-implement MemReusedAlgorithm_TimeLineAlgo and comment out useless code * Make allocate and free timeline local and HashSet -> std::vector * Eliminate a lot of Hash stuffs * Revert "Eliminate a lot of Hash stuffs" This reverts commit abfb86df57b13074cb50ca9dc080a1333cd46802. * Important comment * Address comments * auto format by CI * Remove magic number -1 * Address comment and rename Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug: segfult when argmax has 0 size tensor as input (#9242) * fix_half_check_of_reduce_mean (#9014) * fix_half_check_of_reduce_mean * refine * Support float16 for initializer operators (#9253) * feat(*): support float16 for initializer operators * refine test * Add half clamp (#9241) * Register half * register fp16 in clamp kernel, add check for fp16 in functor, update unittest for more dtype * format code * add macro WITH_CUDA Co-authored-by: WangYi <buaawangyi03@gmail.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CUDA]CheckVersionCompatibility (#9257) * [CUDA]CheckVersionCompatibility * Add CUDA 10.2 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: monkeypatching pytorch (#9256) * update custom meta path finder * update test commands * print warning if `torch` is already imported * rename to `mock` * update tests * private attribute cannot be imported with import * * split testcase Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support destory_rdma (#9246) * support destory_rdma * refine * auto format by CI * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add bincount (#9156) * add bincount * add docs, use atomic add in cuda kernel, add unittest * add minlength param, fix bug of memset in kernel * refine code * refine code * convert to local when input is global, add global test * auto format by CI * refine code * refine docstr, reduce doc length in one line * register fp16, add tensor function and unittest * add docs for tensor.bincount * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * ONEFLOW_STREAM_ENABLE_H2D_STREAM (#9205) * Modify generator.manual_seed to return generator rather than None (#9262) generator.manual_seed return generator rather than None Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add tensor bernoulli (#9261) * add tensor.bernoulli * add docs * Update tensor.py * Update tensor.py * Update tensor.py * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Multi tensor update (#9252) * fix multi_tensor_sgd segfault * enable learning_rate_val to replace learning_rate Tensor * support adam and adamw * support epsilon for adam and adamw Co-authored-by: songyicheng <int.rejoice@gmail.com> * fix a typo in readme (#9268) * support nested asyncs.thread (#9270) * OneEmbedding add smart decay sparse adam (#9176) * add sparse adam * smart decay sparse adam * address review * fix * mv smart_decay to one_embedding namespace * upgrade clang-tidy used in ninja of_tidy (#9263) upgrade clang-tidy in ninja of_tidy Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/compile time count (#9245) * add graph compile time count * refine compile log * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix random_normal (#9274) Co-authored-by: Juncheng <liujuncheng1022@gmail.com> * Flip and upsample bilinear support fp16 (#9284) * slice update cpu kernel multi_thread loop * refine * upsample bilinear and flip register fp16 cuda kernel * fix commnet * revert Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix PruneAmpWhiteIdentityOpPass (#9276) * fix * fix dup del * ref algorithm * fix dup mut * simple impl * rm useless code * fix * fix typo * fix typo Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support api flow.randn_like (#9283) * support api flow.randn_like * refine * remove dry run, add sanitizers to ci (#8670) * fix some data races in c++ api and SteadyVector Signed-off-by: daquexian <daquexian566@gmail.com> * skip self copy in MutShapeView::ToShape Signed-off-by: daquexian <daquexian566@gmail.com> * remove dry run, add sanitizers to ci Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update gh action * skip lit Signed-off-by: daquexian <daquexian566@gmail.com> * suppress ubsan error in llvm Signed-off-by: daquexian <daquexian566@gmail.com> * disable ubsan for now Signed-off-by: daquexian <daquexian566@gmail.com> * fix ci path Signed-off-by: daquexian <daquexian566@gmail.com> * update test manylinux docker Signed-off-by: daquexian <daquexian566@gmail.com> * restore dry run rpc manager Signed-off-by: daquexian <daquexian566@gmail.com> * run tsan for 3 times Signed-off-by: daquexian <daquexian566@gmail.com> * do not find initializer order bug Signed-off-by: daquexian <daquexian566@gmail.com> * fix merge conflict Signed-off-by: daquexian <daquexian566@gmail.com> * skip sanitizer test in cuda misc Signed-off-by: daquexian <daquexian566@gmail.com> * sleep Signed-off-by: daquexian <daquexian566@gmail.com> * suppress by __attribute__((no_sanitize_address)) Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * revert suppression * fix heap-use-after-free found by asan * auto format by CI * bash -c Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add build config for RTX 40xx GPUs (#9290) * Bool support for triu (#9291) * Refix PruneAmpWhiteIdentityOpPass (#9294) fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix concat #8833 (#9275) * fix conat #8833 * support multi-none-input * test and global test * auto format by CI * format license Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support half for masked_fill (#9292) * Fix BatchNorm performance (#9298) * slice update cpu kernel multi_thread loop (#9264) * slice update cpu kernel multi_thread loop * refine * try to fix bug * auto format by CI * deleteusless headfile Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix inplace bug in `tensor.masked_fill_` (#9295) * fix: bind tensor.masked_fill_ to inplace version, fix bug in unittest * refine unittest * fix_inplace_copy_bug (#9301) * FusedMultiHeadAttentionInference (#9287) * FusedMultiHeadAttentionInference * auto format by CI * cmake * fix graph * auto format by CI * fix cmake for mlir * rm duplicated install * fix align * support float * support causal * support causal * test global property * fix * disable clang * skip cpu test * skil all test Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: jackalcooper <jackalcooper@gmail.com> * Fix compile warnings (#9302) * Fix comiple warnings * fix * Set the default value of CUDA_STATIC to OFF when CUDA version is greater than or equal to 11.8 (#9306) * Reduce pass time cost (#9281) * batch del in PrunePinnedIdentityOpPass * add log * fix and refine fuse add_n * add new line * avoid op graph create * add op graph cost cnt and fix boxing log * fix ndsbp csv str * fix multi add same add_n * auto format by CI * rm debug log * auto format by CI * to cont ref * rm useless * refine auto modifier * rm useless * hack to debug * hack to debug * hack to debug * hack to debug * hack to debug ci * hack to debug ci * fix test case env var * fix env var set * revert to const ref * auto format by CI * sync to make sure tensor are created Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor get sbp signature (#9304) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Add hierarchy value * Address comments * parallel num j-> hierarchy value for reshape op * Static analysis * refine * Update user_op.cpp * Update operator.cpp * auto format by CI * Revert Update operator.cpp This commit revert 64832e43196067d67f70094a8d35664a805a5891 Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix type error for entering a single tensor using concat op (#9316) * modify tensorprocessor * remove blank line * remove blank line * modify CheckHasDifferentInputDType func * Update oneflow/core/functional/tensor_processor.cpp * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add more sbp signature print functions for log and debug (#9293) * debug code * ReshapeOp::GetSBP use hierarchy dim instead of parallel_num * comment debug log * revert debug code * auto format by CI * rm NdSbpSignatureListAsString * rm 1d sbp signature print functions Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Release/nightly cu118 (#9308) * update action * 116->118 * preserve 116 Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix different dtype in slice_update (#9331) * fix(SliceUpdate): fix different dtype in slice_update close #9330 * test(SliceUpdate): enhance test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix FlattenOp GetSbp (#9322) * fix flatten GetSbp * rm flatten op * update group stat * rm mlir test * fix * more strictly check * add reshape converion Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor ONEFLOW_MLIR_PREFER_NHWC to support more ops (#9335) * use bn as gn * hack gn as relu * refine * support concat * ScalarDivOp * fix * move files * refine * fix bn * try fix * fix concat * fix * DRY * refactor * refactor * fix * workaound * add baseclass * rm hack * auto format by CI * minor refine * refine * add more Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * distributions.Categorical support logits not None (#9332) * avoid extra gpu memory usage in flow.save (#9328) * boxing to cpu first in flow.save Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Use primitive to replace Ndarray::BroadcastBinary (#9311) * Use primitive to replace Ndarray::BroadcastBinary * refine * fix * negative * refine * refine * Block forward support modification (#9336) * block forward support modification * add test * fix format * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add log sum exp api (#9333) * add_log_sum_exp_api * refine * add logsumexp to tensor * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: isclose and allclose (#9280) * add allclose op in tablegen * add isclose & allclose op in functional layer * use existing framework to implement `isclose` * import isclose & allclose * compose isclose and other op to form allclose in python * typo * add doc & test files * add default arg * curly braces between one stmt * generate one random data, the other is perturbation * update test * comment for ndarray bin func * add ref from torch * Refactor random op with consistent data (#9299) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * fix(RandomSeed): fix parallel_num==1 * move normal functor to random_functor.cpp * test(RandomOp): refine test * add comment for random_seed getter function * remove special judgement for 1n1d * fix random_seed parallel_num==1 * fix cuda generator index bug * fix test function name bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * bool tensor slice_update use masked_fill when possible (#9324) * bool tensor slice_update use masked_fill when possible * refine * auto format by CI * fix comment * auto format by CI * Update oneflow/api/python/framework/tensor_functions.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refine * auto format by CI * except partial sum test * add todo Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Move tensor apis to cpython (#9303) * move tensor.is_floating_point to c++ * refine * move tensor.split to c++ * move tensor.flip to c++ * auto format by CI * Update oneflow/api/python/framework/tensor.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refactor flip * refine * auto format by CI * fix free(): invalid pointer Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Add gelu_tanh op and kernel (#9343) * gelu_tanh * rename GeluTanh -> FastGelu * regulate constant and increase precision * instantiate and reg backward * reg grad fn * address review * address review * format * update test * refine_test_maxpool2d_channel_last (#9344) * refine * auto format by CI * add skip * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor normal initializer (#9307) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * refactor(Initializer): refactor normal with oneflow kernel * fix(RandomSeed): fix parallel_num==1 * test(initializer): add initializer data test * format code * move normal functor to random_functor.cpp * test(RandomOp): refine test * add trunc_normal and relax mean/std precision * fix conflict * fix merge conflict Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support fp16 in constant folding (#9337) * support fp16 * format * clean * refine * auto format by CI * refine test * clean * refine * refine Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix exp overflow with minus max trick (#9353) * Fix occasional bug in random_op data test (#9354) fix(RandomOp): fix occasional bug in random_op data test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add gumbel softmax (#9208) * regis gumbel_softmax * add: gumel_noise, attr-hard, next: log, one-hot, grad * add(fail): exp_dist * add: gumbel, grad on cpu, next: cuda * add: cuda & test bug: Synchronize() * add: docs, test_hrad, test_grad * add: format code * fix: TmpSize * fix: review * format, try to add * add: functor * format & half of rand * remove ops & kernels * support half of argmax & dim_scatter * fix review * add gumbel softmax docs * fix review * remove gumbel_softmax_grad_functor * remove grad in yaml * fix: raise half no util error * auto format by CI * auto format by CI * fix: make * fix: static Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix the inconsistent behavior of slice update (#9321) * modify tensor_index.cpp * modify * support scalar tensor indexing * support scalar * modify tensor_util * modify tensor_index * add macro definition * add support type * refine getitemscalartensor * Update oneflow/core/framework/tensor_util.cpp * modify macro * modify macro and test * modify test * modify function parameter * modify tensor_index ("uint8" is regarded as "bool") Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * enable autocast for that op which has nocast arguments (#9362) * fix autocast * fix * Add NHWC format for group norm (#9368) * group * nhwc * test_case * ir * fix * refine * Enable ZeRO with auto parallel (#9288) * Enable ZeRO with auto parallel in the first setting and speed up * Remove compute_cost parameter from Initialization of copy cost * Move the addition of wait time into sbp_node * Remove transfer cost since it is merged into the GetTransferCost() * Rename mainstem to trunk * Update warning Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat unbalanced split nd sbp (#9310) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Use the same physical shape as eager did * Remove the difference between eager and lazy for physical shape * Update the filter * Revert "Use the same physical shape as eager did" This reverts commit f20e222327e21166d5b5325e37c3cbe9ca4f4ac6. * Compute range for each rank * Compute position for range * Remove the difference between eager and lazy * Allow unbalanced split for variables * Add test script and print out information * Pass 2d test cases * Resolve conflict * Can not merge some split * Reduce in and out sbp simultaneously * Speed up for 1d sbp Package up the function for replacing hierarchy * Reduced simultaneously with the same hierarchy * Deal with 1to2d and 2to1d in InOutParallelDimReduce() * Pass 1to2d and 2to1d test cases * Remove the old code * Revert "Add test script and print out information" This reverts commit 58cdfb40b6536eb74c02174d3a69409676da374f. * Add the check for split questionary back * Feat speed up cost computation (#9355) * Compilation speed up * Speed up compilation for cost between 1d sbp * fix comment typeo * Address comment Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add upsample_nearest_2d to amp clear list (#9366) * fix cuda integral type closeness computation (#9346) * fix cuda integral type computation * remove include Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add fused linear (#9369) * Support fp16 on some cpu operators (#9374) support fp16 cpu triu * Scalar math kernels support inplace (#9372) * Scalar math kernels support inplace * type * fix * Optimize GroupNorm NHWC with FastDivmod (#9373) * GradAcc Mem V5: Part 0-4 (#8961) * default nccl use compute stream in grad acc * rm sharable mem block graph * half implement of LogicalChains * part-0 : Logical Chain * fix compile * logical chain runnable * fix bug of logical chain dp * Part 1 : AfterGradAccChain * fix bug of crush in acc chain infer * AccCtrlTick Op/Task/Actor/Pass * tmp * AccCtrlTick runnable * rename group boxing identity and model diff scale op name * stric order by acc tick * merge mem block by logical chain id group * fix user op register * fix GLOG error when no grad acc * Inplace repeat variable * Inplace repeat support consumed/produced ctrl regst * Part-4: merge acc op in to chain for reuse memory acc input (#9071) LogicalChain can merge acc op in to chain for reuse memory acc input 实测 GPT 的显存与 part-3 一致。 bert 与 t5 大部分的显存都略低于 part-3 https://github.com/Oneflow-Inc/OneTeam/issues/1670#issuecomment-1240468576 * find first source/sink op in acc chain which can be insert ctrl * TryMergeAfterAccLogicalChainToFirstLogicalChain * remove debug log * rm old version repeat kernel * fix format * MergeChainByLogicalChainId/PhysicalTaskGraph * IsValidChainId * rm useless file * remove note * fix clang-tidy * more IsValidChainId * rm debug log * rm note * fix bug of cpu repeat inplace var bug * fix bug of memory reuse for 0-size regst in time line algo * fix bug of acc chain merge mem guard * reuse cast to tick op * fix bug of acc different stream hint cause sync backward compute * actor name log * fix for review * remove log * fix note * fix bug of connect to cast to tick op * refine code for review * fix for review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix the bug of fill_tensor_ of support fp16 & autocast (#9375) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Allocate in instruction computation (#9282) * allocate memory in InstructionPolicy::Compute * remove unused methods of VirtualMachineEngine. * backup code * UnimplementedAllocator * prepare allocators for each cpu stream. * allocator for ccl stream * init AllocateTensorInstructionPolicy::output_dependences_ * only sync current rank in oneflow._oneflow_internal.eager.Sync * Update oneflow/core/vm/allocate_tensor_instruction_policy.cpp Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Disable conv algorithm search in eager mode (#9376) * Disable conv algorithm search in eager mode * refine * Add FusedGroupNormSilu (#9387) * Update fmt (#9392) update fmt * FusedConvBias (#9395) * FusedConvBias * fix * fix batchnorm infer dtype failed in half inference (#9388) * fix batchborm infer dtype failed in half inference * refine * inplace update moving_mean and moving_variance * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix_logsumexp_overflow_error (#9385) * fix_logsumexp_overflow_error * fix static check error * skip big val test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor uniform initializer (#9384) * refactor uniform * add test * fix * add test and modify uniform * remove register_initalizer * support uniform_int * modify * fix * remove * auto format by CI * fix * fix global bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat module to local (#9400) * add test * module to_local * Revert "add test" This reverts commit f1913b90c905c3465bff90463ad77a3ac7d5267f. * add test of module to local * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Update tensor contrustor to fix issue #9403 (#9404) * fix issue #9403 Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Optimize fast_gelu half specialization (#9408) * Optimize fast_gelu half specialization * use CUDA_VERSION * Impl of fused_bias_add_scale_mask_softmax_dropout (#9401) * function op * check drop_mask shape and dtype * cuda kernel * fix typo * add to amp list * rm unexpected line * mv op source * mv grad function * fix typo * rm template instants * fix grad function * update test * fix header micro-lock * update test * comment * move to fused_softmax namespace * auto format by CI * static assert msg Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix bug when autograd.grad meet tensor.grad is not None (#9402) * fix(Autograd): fix bug when autograd.grad meet tensor.grad is not None * fix(Autograd): fix autograd.backward bug * fix(Autograd): fix autogra.grad need_execute calculate bug * add comment * fix(Autograd): fix need_execute bug * test(Autograd): add test for backward/grad in same tensor * refactor(Tensor): move retain_grad judgement to Tensor * fix(Tensor): fix retain_grad judgement * fix(Autograd): fix capture index bug * fix(Autogard): autograd.grad same output's grad_fn executed bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Optimize UpsampleNearest2D 2X (#9415) * Add MaxUnpool op (#9309) * add: op defination * dev: add cpu max_unpool1d kernel, and return right output * refine: delete useless max_unpool_1d code * dev: add grad op, delete useless code, delete data_format * dev: fix bug in backward kernel, finish grad_func * add: unittest * refine code * refine code * add: docs * add skip in doctest * use template for functor, fix typo in op definition, support int64 and cpu float16, fix overflow bug in cuda kernel, refine unittest * add global unittest * refine template name * refine & clean * remove useless header file * add bfloat16, fix bug in kernel, remove template NDIMS in kernel * add profile in unittest * auto format by CI * add if macro for nv_bfloat16 Co-authored-by: mosout <mosout@qq.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * bypass StopIteration error in dataloader delete_shm (#9393) bypass StopIteration error in dataloader rebuild_shm Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Impl of fused_fast_gelu_mul (#9397) * fix arg name * function api * grad function * amp list * fix function * instantiate * revert auto complete * fix fast_gelu amp list * op kernel * forward kernel * revert binary op * fix * update test * update test * fix grad function * constexpr * format * fix test * fuse x_diff and multiplier_diff * performant FusedFastGeluMulGradCudaKernel * fix computation * update test * reduce template param * device prefix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add autograd engine debug graph (#9412) * fix(Autograd): fix bug when autograd.grad meet tensor.grad is not None * fix(Autograd): fix autograd.backward bug * fix(Autograd): fix autogra.grad need_execute calculate bug * add comment * fix(Autograd): fix need_execute bug * test(Autograd): add test for backward/grad in same tensor * refactor(Tensor): move retain_grad judgement to Tensor * fix(Tensor): fix retain_grad judgement * feat(Autograd): add autograd engine debug graph * style(*): use fmt refine string concat * fix(*): fix autograd debug graph * fix(Autograd): fix capture index bug * refine debug string * refine code Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Optimize transpose identity (#9416) * Optimize fmha transpose (#9417) * Fix the usage of argument end_factor in LinearLR (#9421) * fix end_factor * fix indent Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix lazy scalar tensor indexing (#9420) fix(Indexing): fix lazy scalar tensor indexing Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * GroupedMatmulBias (#9413) * grouped matmul bias * fiux * grouped_matmul * Amp list * fix cu102 * fix * Optim upsample backward (#9424) * optimize upsample nearest2d backward * refine * revert * pack dy * fix comment * fix comment * fix comment * fix comment * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Speed up the training (#9278) * Move the "-expand" and "-cast" ops backward * Hard-coding for stable diffusion, maximize overlaps * Use op_tyep_name instead of visual string * Change transfer nodes to tributary nodes * Rename tributary to overlap * Prepare to test different decide parameters * Prepare to print and test * {7, 5} seems to be one of the best as before * Find the best straighten mode 973 for stable diffusion * Put cpu nodes into overlap node list * Disable overlap between cpu and gpu if no cpu nodes * Update API * Remove magical number * Update comment * Remove std log message * Remove debug code * Static analysis * Variable op still have activation time in cpu * Rename (address comment) * Profiling item (#9394) * profiling tensor.item * SyncAccessInstructionPolicy * FastCopy supports 128-bit data_types * address static analyzer complaints * revert changes about tensor.numpy() * Stream::CheckSizeAndGetTmpSmallPinnedMemPtr * disable busy wait in SyncAccessSmallMem * auto format by CI Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * KernelPriority (#9427) * KernelPriority * concat * fix empty tensor * address review * Graph rename v2 (#9351) * rename block to GraphBlock * rename attr of graph block to avoid name colision with origin * rename * add test * revert rename * revert old and test pass * rename BlockConfig to GraphModuleConfig * add test of module property * rename to_graph to trace * refactor block with GraphModule SubGraph gt * refact and test graph pass * all test passed * fix typo of auto_parallel_mainstream_algo * refine ModuleBlock repr * revert auto_parallel_mainstream_algo * auto format by CI * support mixin * auto format by CI * add test of mixin property * auto format by CI * fix auto test error * fix doctest of graph.py * refine doc * fix outdated api and typo * address review * auto format by CI * fix * auto format by CI * rename block to proxy * auto format by CI * avoid use GraphBlock in to * auto format by CI * fix style * add import * fix doc * use new style * support graph tensor set_stage * auto format by CI * update libai commit * Revert "update libai commit" This reverts commit d000c1ad1e2d2b9cad3257b1028cc84cc419c547. * update libai commit * fix comm barrier * format * auto format by CI * echo oneface commit id * add log * Update test.yml * Update test.yml * Update test.yml * Update test.yml * Update test.yml * add pytest * Update test.yml * Update test.yml * use pytest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Cherry-pick IR changes (#9430) * pull in files * port changes on td * refine file check * refine file check * allow opt to run properly * auto format by CI * refine case * refine * gn+silu * add flag * Update oneflow/ir/lib/OneFlow/Transform/CSEWithAttributesIgnored.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * Update oneflow/ir/lib/OneFlow/Transform/CSEWithAttributesIgnored.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * Update oneflow/ir/oneflow-opt/oneflow-opt.cpp Co-authored-by: Peihong Liu <mosout@qq.com> * auto format by CI * update description Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [hotfix] remove cuda half unittest in maxunpool (#9436) remove cuda half unittest in maxunpool * Fix checkpoint v2 (#9437) fix checkpointv2 * fix to pass compile * rm check to pass test * fix TaskGraph init Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: ZZK <359521840@qq.com> Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Ping Zhu <58718936+reygu@users.noreply.github.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: Shiyuan Shangguan <shiyuan@oneflow.org> Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com> Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Zhimin Yang <76760002+small1945@users.noreply.github.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Dongche Zhang <zhang2000dc@gmail.com> Co-authored-by: leaves-zwx <kunta0932@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: Peihong Liu <mosout@qq.com> Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Liang Depeng <liangdepeng@gmail.com> Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: WangYi <buaawangyi03@gmail.com> Co-authored-by: rejoicesyc <47683675+rejoicesyc@users.noreply.github.com> Co-authored-by: songyicheng <int.rejoice@gmail.com> Co-authored-by: QI JUN <qijun1994@hotmail.com> Co-authored-by: zhaoyongke <zhaoyongke@yeah.net> Co-authored-by: JiaKui Hu <hjk1938927583@163.com> Co-authored-by: cheng cheng <472491134@qq.com>
Commit: | ab9d76c | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Speed up the training (#9278) * Move the "-expand" and "-cast" ops backward * Hard-coding for stable diffusion, maximize overlaps * Use op_tyep_name instead of visual string * Change transfer nodes to tributary nodes * Rename tributary to overlap * Prepare to test different decide parameters * Prepare to print and test * {7, 5} seems to be one of the best as before * Find the best straighten mode 973 for stable diffusion * Put cpu nodes into overlap node list * Disable overlap between cpu and gpu if no cpu nodes * Update API * Remove magical number * Update comment * Remove std log message * Remove debug code * Static analysis * Variable op still have activation time in cpu * Rename (address comment)
Commit: | 900497d | |
---|---|---|
Author: | Yipeng Li |
Merge remote-tracking branch 'origin/release/debug-sd-conv-gn-geglu-silu' into release/feat-speed_up-straighten
Commit: | 1667cdf | |
---|---|---|
Author: | Zhimin Yang | |
Committer: | GitHub |
Refactor uniform initializer (#9384) * refactor uniform * add test * fix * add test and modify uniform * remove register_initalizer * support uniform_int * modify * fix * remove * auto format by CI * fix * fix global bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | daf76ca | |
---|---|---|
Author: | Yipeng Li |
Find the best straighten mode 973 for stable diffusion
Commit: | 32ad8c1 | |
---|---|---|
Author: | small1945 |
modify
Commit: | 90d39a7 | |
---|---|---|
Author: | cheng cheng | |
Committer: | GitHub |
GradAcc Mem V5: Part 0-4 (#8961) * default nccl use compute stream in grad acc * rm sharable mem block graph * half implement of LogicalChains * part-0 : Logical Chain * fix compile * logical chain runnable * fix bug of logical chain dp * Part 1 : AfterGradAccChain * fix bug of crush in acc chain infer * AccCtrlTick Op/Task/Actor/Pass * tmp * AccCtrlTick runnable * rename group boxing identity and model diff scale op name * stric order by acc tick * merge mem block by logical chain id group * fix user op register * fix GLOG error when no grad acc * Inplace repeat variable * Inplace repeat support consumed/produced ctrl regst * Part-4: merge acc op in to chain for reuse memory acc input (#9071) LogicalChain can merge acc op in to chain for reuse memory acc input 实测 GPT 的显存与 part-3 一致。 bert 与 t5 大部分的显存都略低于 part-3 https://github.com/Oneflow-Inc/OneTeam/issues/1670#issuecomment-1240468576 * find first source/sink op in acc chain which can be insert ctrl * TryMergeAfterAccLogicalChainToFirstLogicalChain * remove debug log * rm old version repeat kernel * fix format * MergeChainByLogicalChainId/PhysicalTaskGraph * IsValidChainId * rm useless file * remove note * fix clang-tidy * more IsValidChainId * rm debug log * rm note * fix bug of cpu repeat inplace var bug * fix bug of memory reuse for 0-size regst in time line algo * fix bug of acc chain merge mem guard * reuse cast to tick op * fix bug of acc different stream hint cause sync backward compute * actor name log * fix for review * remove log * fix note * fix bug of connect to cast to tick op * refine code for review * fix for review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 65fd4d9 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Enable ZeRO with auto parallel (#9288) * Enable ZeRO with auto parallel in the first setting and speed up * Remove compute_cost parameter from Initialization of copy cost * Move the addition of wait time into sbp_node * Remove transfer cost since it is merged into the GetTransferCost() * Rename mainstem to trunk * Update warning Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | a3841f5 | |
---|---|---|
Author: | Yinggang Wang | |
Committer: | GitHub |
Refactor normal initializer (#9307) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * refactor(Initializer): refactor normal with oneflow kernel * fix(RandomSeed): fix parallel_num==1 * test(initializer): add initializer data test * format code * move normal functor to random_functor.cpp * test(RandomOp): refine test * add trunc_normal and relax mean/std precision * fix conflict * fix merge conflict Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 45080d4 | |
---|---|---|
Author: | chengtbf |
merge master
Commit: | f97f09f | |
---|---|---|
Author: | guo ran | |
Committer: | GitHub |
OneEmbedding add smart decay sparse adam (#9176) * add sparse adam * smart decay sparse adam * address review * fix * mv smart_decay to one_embedding namespace
Commit: | c8ea5f9 | |
---|---|---|
Author: | chengtbf |
merge master
Commit: | 13876f4 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Feat speed up mem reuse (#9210) * Use HashSet instead of vector * O(n^3) -> O(n^2) * Compute offset for memory-first algorithm only * Remove explicit exclusion relationship * Revert print out information * Speed up exclusion judgement * Switch HashMap to vector * Code clean up * life time -> lifetime * mem_reused_regst: HashSet -> std::vector regst_desc_id2regst_desc -> mem_chain2regst_desc_id2reuse_regst_desc * Re-implement MemReusedAlgorithm_TimeLineAlgo and comment out useless code * Make allocate and free timeline local and HashSet -> std::vector * Eliminate a lot of Hash stuffs * Revert "Eliminate a lot of Hash stuffs" This reverts commit abfb86df57b13074cb50ca9dc080a1333cd46802. * Important comment * Address comments * auto format by CI * Remove magic number -1 * Address comment and rename Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 635453f | |
---|---|---|
Author: | chengtbv |
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into dev_cc_acc_mem_v5
Commit: | be986e2 | |
---|---|---|
Author: | ZZK | |
Committer: | GitHub |
Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209) * export to graph config * refine or Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 4aae0e7 | |
---|---|---|
Author: | chengtbv |
fix conflicts and merge master
Commit: | a6e826e | |
---|---|---|
Author: | strint |
refine world size get
Commit: | a6349bd | |
---|---|---|
Author: | strint |
rm global of world size
Commit: | 1b2879c | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Feat straighten compress memory (#9094) * An initial inplementation of linear programming primal matrix * Coding for the revised simplex method * Finish coding for the phase 1 * Fix bug. Now we can get a corrent x for the initial basic feasible solution * Drive the artificial variables out in phase 1 * Bland's rule and bug fix * Adjust the mapping between the basic variables and compact columns * No columns removed while driving artificial variables out. Terminates the code if positive optimal cost found in auxiliary problem. * Implement the phase 2 of the revised simplex method. Remove columns of the inverse base matrix. * Update is_solved status and original problem recovery. * Rows and artificial columns activation * An initial implementation of mix integer programming * Try to assemble the original problem but fial due to the massive exclusion * Steal initial position from current setting * Compute the optimal cost from the compact relationship * Move to a neighbor status and compute the cost * Find the smallest cost and actually move to that status * Check conflit after the adjustment. Adaptively cost reduce * Generate a compact position from nothing * Straighten for memory * Update the offset * Add a demo for using the revised simplex method * Remove the linear programming part * Recompute the compact relationship after moving to a new status * Rename * Code clean up * Set the tag for the straighten algorithm * Code clean up * An attemp to explore the dependency between consumer nodes of a register * Revert "An attemp to explore the dependency between consumer nodes of a register" This reverts commit f219851fb85943d07d28b84c45e5c4bae80872a0. * Compute the lower bound and only execute the adjustment 2 for those cases with possible reduction in memory * Pre-compute and store the memory size for registers * Use pre-stored total register num * Limit the maximum iteration step * Use VLOG(3) instead of std::cout * Change interface * Package up memory share strategy interfaces * Address comments * Address comments * Of format * Fix bug lower bound = 0 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Commit: | 66bb990 | |
---|---|---|
Author: | duck7216 |
resnet50
Commit: | 93a7947 | |
---|---|---|
Author: | lixinqi |
cut boxing_task_graph by rank
Commit: | ede3cd2 | |
---|---|---|
Author: | lixinqi |
remove Plan::fake_consumed_regst_desc_id
Commit: | f67ff82 | |
---|---|---|
Author: | Yipeng Li | |
Committer: | GitHub |
Auto Parallel (#8891) * add auto_parallel code add auto_parallel pass * Feat ap remove hierarchy cast (#7919) * feat(AutoParallel): support remove parallel_cast ops * feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops * format code * Fix add conv grad cost (#7972) * feat(Conv): add grad computation cost * fix ConvDataGrad computation cost * update conv grad cost * refine * Auto parallel/fast collector (#7958) * Try to speed up sbp collector. However, throughput drop * Shrink the parallel candidates for the proxy node * Print out some information and then refine * Store the sbp set for each consumer * Update binary set intersection * Remove impossible parallel candidates from sbp proxy * Refine binary set * Add a Clear() in binary set * Filter out those proxy candidates containing two sbps from the same unique group * refine * Check spells * Clip useless edges * AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033) * feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge * use if instead std::max * fix(AutoParallel): fix pooling computation cost function bug (#8147) * [WIP] Fix auto parallel dump uniform sbp bug (#8330) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * refine source op judgement * update auto_parallel config (#8356) * Refactor dump nd sbp for auto parallel (#8353) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf * refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn * rename Global to Singleton * Refactor SbpEdge (#8684) * refactor(AP): refactor SbpEdge * Rename variables * Add const for some functions Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> * Refactor auto parallel sbp node (#8712) * Rename * Code clean up * Code clean up * Code clean up and package up * Rename * Add const for some functions * Refactor auto parallel sbp graph (#8722) * Code clean up * Package up * Code clean up and package up in SbpNode and SbpEdge * Rename * Rename * Rename mainstem to trunk * Typo, small bugs and rename * Rename and of format * Refactor auto parallel rest (#8731) * Package up SbpCollector * Add const for SbpGraph * Add const for SbpNode * Add const for SbpEdge * Add const for SbpCollector * Add const, rename, and package up for BinarySet * Rename for BinarySet * Rename for SbpCollector * Rename for SbpCollector * Rename for algorithm utils * Fix a bug for an unused function AddEntries() * Rename for BinarySet * Rename for SbpConstructor * Rename for BoxingCollector * Add const for sbp utils * fix merge conflict * Remove template for sbp signature (#8787) * Remove template for sbp signature * Remove _H_ from cpp files * Remove namespace specifier oneflow:: * Remove namespace specifier oneflow:: * Of format * Move the inline functions to cpp files * Can not add inline specifier? * Update oneflow/core/auto_parallel/sbp_graph.h Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Of format Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor auto parallel class object stuff (#8835) * Delete copy/move constructor/operator * Move the deconstructor of SbpEdge to the cpp file * Equal by address for Sbp data structor * Replace sbp_sig_list_ with sbp_sig_obj_list_ * Fix auto parallel copy cost infer2 (#8788) * Check the output shape for operator in auto parallel * Return infinity for different sbps while is_mutable * Update oneflow/core/auto_parallel/sbp_constructor.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Update oneflow/core/operator/operator.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * with output -> check output Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor prune identity as much as possible (#8849) * Prune a line of parallel cast ops * Avoid repeated pruning * Code clean up * Remove identity op * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Fix auto parallel low throughput (#8876) * Speed up after pruning identity * Slight changes * Refactor auto parallel final check (#8887) * Of format * Use const auto & * Of format and rename * Re-compute cost if steals sbp signatures * Docs auto parallel doc (#8896) * doc(AutoParallel): add auto parallel document framework * docs(AutoParallel): add document * fix typo * refine document * refine documentation * Test alexnet for auto_parallel (#8917) * test(AutoParallel): test alexnet for auto_parallel * test(AutoParallel): test model add auto_parallel config * Fix get sbp bug (#8939) * Fix the bug of missing sbp for uniform op * Speed up * Add the mising sbp for optional input UserSourceOpTickInput * Remove the repeated all-B sbp signature * Add sbp for undefined UserSourceOpTickInput * Resolve confits while merging master * Recompute cost with time shape (#9009) * Address comments * fix merge conflict * Address comments * Disabled ZeRO when enabled AutoParallel (#9087) fix(AutoParallel): disabled ZeRO when enabled AutoParallel * Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp * Address comments * Address comment. GetComputationCostFn -> GetComputationCost * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * New interface for pr#9018 * Static analysis * Fix ones like sbp bug and fix test import error in CI (#9123) fix(AutoParallel): skip 1n1d sbp agreement check * auto format by CI * test(AutoParallel): skip acc check * Address comments * rename source op set nd_sbp function and add check * fix typo * Feat full auto parallel (#9140) * Use B for inplace op and remove the check for sbp while truning the auto prallelism on * Slight change * Not using B as the constrain * Address comments * add debugg log for non-deleted cast ops * update prune parallel cast op log * rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>