These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)
Commit: | bb39d37 | |
---|---|---|
Author: | Nathan Luehr | |
Committer: | Nathan Luehr |
Merge branch 'master-tf1-cuda_malloc_async_fixes' into 'master-tf1' [TF1] Fix cudaMallocAsync See merge request dl/tensorflow/tensorflow!605 (cherry picked from commit 68d8ef4ba479fe9ebf531a373a1003cf187c9a7c) 456352bd [Crash fix] Other part of TF need the stats for GPU allocator. 5988bca4 Add a cudaMallocAsync test 624d6d1f [Crash fix] Handle correctly the passed stream. bcdfad70 Use a static cast df76191f Make the new options experimental and add it to the golden list of public API. 56cdf5d7 Use the new option b7097a7d Variable/function renaming. Also make a counter atomic to be more future... 0afa4788 Fix cherry-pick
The documentation is generated from this commit.
Commit: | 434922f | |
---|---|---|
Author: | Ayan Moitra | |
Committer: | Nathan Luehr |
Enable cudnn batchnorm based levels rather than bool Also squashes: Build failure fix Address Bas's comments Remaining changes Test changes
Commit: | 187f048 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Updated hlo_to_llvm_ir to also emit PTX and re-enable tests using it. See merge request dl/tensorflow/tensorflow!555
Commit: | 1decc05 | |
---|---|---|
Author: | Bastiaan Aarts | |
Committer: | Nathan Luehr |
XLA persistent compilation cache Store results of llvm->ptx and ptx->cubin compilations for subsquent executions. Also Squashes: Include ptxas options in hash when generating key for persisten cache fix botched TF2->TF1 integration
Commit: | 051457f | |
---|---|---|
Author: | Kaixi Hou | |
Committer: | Nathan Luehr |
Integrate CUDNN frontend API for convolution Also squashes: Add cudnn_frontend patch and change macros Apply filter over fallback list and add winograd filter Remove redundant headers Add flags Add env var for cudnn frontend APIs Fix empty fallback lists
Commit: | 9055fdc | |
---|---|---|
Author: | Ayan Moitra | |
Committer: | Nathan Luehr |
[skip ci] XLA enables BN+Act and BN+Add+Act fusion when implementing kBatchNormTraining HLOInstruction as cudnn custom call
Commit: | b594840 | |
---|---|---|
Author: | Bastiaan Aarts | |
Committer: | Nathan Luehr |
XLA cudnn Softmax
Commit: | 917de25 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
[XLA/GPU] Size-constrained buffer allocation
Commit: | e5feec9 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Phase3 backport upstream XLA See MR 289. Squashes: Remove deprecated variants of DynamicSlice and DynamicUpdateSlice builders Remove the old tensorflow/compiler/{xla,mlir} directory Add the new XLA/MLIR directory Replace the LLVM/MLIR config directory Update LLVM version structure BUMP LLVM version Automatic fixup Disable new XLA call to not-present new MIOpen interface Change `LocalClient::Compile` to support returning multiple executables (one per partition). [TF2XLA] Preserve the dynamic dimension (-1) when building a reshape. [XLA] Respect set-dimension-size in dynamic dimension inference. Support fill op with (bounded) dynamic shape input. Add missing header file for the TF1/TF2 bridge. Better ptxutil TF1/TF2 bridge Add missing include Disable the same tests as what is disabled in gitlab/master Add another TF1/TF2 bridge header Update tests and HLO printing to be valid JSON. Disable tests that shouldn't be run in oss. Fix disabled tests handling Comment broken license include Fix missing call in TF1. Plumb exponential_average_factor through stream executor to cudnnBatchNormalizationForwardTraining and the equivalent ROCM function. Remove vestigial code from cuDNN 4 in stream executor. Manual fixup of FusedBatchNorm conflict TF1/TF2 Bridge. Add dummy function to respect the new API. Undo some change as new the TF1/TF2 bridge cover this. DISABLE non OSS tests. Disable a tests at the right place. [TF:XLA] Move XLA tests that depend on contrib in preparation for TF 2.0 [XLA] Split testParameterizedTruncatedNormalIsInRange to avoid timeout. Implement F64 scalar addition for XLA TPU backends. Implement F64 scalar multiplication for XLA TPU backends. [XLA] Add some F64 tests for MatMul. [XLA] Slightly change some test inputs/tolerances [XLA] Brings F64 error thresholds in rough accordance with F32 Add test for unstack op. Add test for Expm1 for small parameter regime of complex numbers. Annotate a test with tf_cuda_tests_tags(). Manual fixup. Run the test multiple time. This should trigger the compilation. Try more runs. Add Debug print to help debug in case of problem. Try to fix the test. Add a comment Fix checkpoint reading. The compression was disabled. Fix the copy bug. Add a test NFC: rename a variable [XLA] Extend the Algebraic Simplier to convert Pow(x, 3) -> x*x*x. This is faster.
Commit: | b4c8f06 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
[XLA] Backport a few missing PRs See MR 315. Cherry-picks upstream commits for: * vectorize row reduction for even row size * Extra VLOG for PTXAS calls Squashes: Merge pull request #38136 from nouiz:fbastien_xla_reduce_sm_v5_push_vec_rb2_manual_cherry-pick Better error message. Now we print part of the module name that caused the error, so it is easier to find the right file. Use EmitWriteArrayElement that add annotation. This could help LLVM to vectorize. add -DDEBUG_BUILD to dbg profile Backport PR: 39734, add an XLA_FLAGS parameter Add back a fix that was lost in the cherry-pick due to a change of order.
Commit: | dc9e499 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Create an TF1/TF2 internal bridge Phase2 MR for XLA backport. See MR 284. Squashes: Introduce new RngBitGenerator HLO Update blocking_counter setup to the upstream version. Add rules dynamic_annotations and mutex that exist upstream. Add define to the new TF name: TF_GUARDED_BY, PT_GUARDED_BY, TF_EXCLUSIVE_LOCKS_REQUIRED, TF_LOCKS_EXCLUDED Add rules for bfloat16 and numeric_types Backport some bazel rules: tf_exec_properties, tf_grpc_dependency and tf_grpc_cc_dependency Backport tensorflow/core/protobuf/tpu/compile_metadata.proto Add conversion file for asm_compiler Add build/include to convert between the new/old interface. Add new dependency platform_port, status and redzone_allocator. redzone Continue TF API bridge CompileGpuAsm Move all new bazel rules at the end of the file. Add a rules numbers at the new place. Add dlpack Bridge TF1/TF2 add an header. Fix some tests build Include the right dependency. Otherwise, it cause protobuf init error.
Commit: | f4bf2be | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040
Commit: | 54af36a | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | 95db533 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Rename proto key to not conflict with upstream. Start our own range of IDS.
Commit: | 5e0d58b | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 33cc429 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | 76a5325 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist. Also squashes: Add a test for Scatter without atomics. Update comment Add documentation and tests change the new parameter name. Fix doc typo and add a test.
Commit: | f79719a | |
---|---|---|
Author: | Bastiaan Aarts | |
Committer: | Nathan Luehr |
XLA persistent compilation cache Store results of llvm->ptx and ptx->cubin compilations for subsquent executions.
Commit: | 4672906 | |
---|---|---|
Author: | Kaixi Hou | |
Committer: | Nathan Luehr |
[no cache] Integrate CUDNN frontend API for convolution
Commit: | 4f1caa1 | |
---|---|---|
Author: | Ayan Moitra | |
Committer: | Nathan Luehr |
[skip ci] XLA enables BN+Act and BN+Add+Act fusion when implementing kBatchNormTraining HLOInstruction as cudnn custom call
Commit: | c4fde80 | |
---|---|---|
Author: | Bastiaan Aarts | |
Committer: | Nathan Luehr |
XLA cudnn Softmax
Commit: | ca1a91a | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
[XLA/GPU] Size-constrained buffer allocation
Commit: | eee6256 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | 31841ec | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Add the new XLA/MLIR directory
Commit: | 565f769 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Remove the old tensorflow/compiler/{xla,mlir} directory
Commit: | 5d2b142 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport tensorflow/core/protobuf/tpu/compile_metadata.proto
Commit: | e66a611 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056
Commit: | 7351c09 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040
Commit: | 71ab214 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | 58e262c | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Rename proto key to not conflict with upstream.
Commit: | db351d0 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Start our own range of IDS.
Commit: | a24cac7 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 6fd48b7 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | 90c4ada | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
change the new parameter name.
Commit: | 1a07380 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.
Commit: | d89aee0 | |
---|---|---|
Author: | Bastiaan Aarts | |
Committer: | Nathan Luehr |
XLA cudnn Softmax
Commit: | fb03c27 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
[XLA/GPU] Size-constrained buffer allocation
Commit: | b6f54ea | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040
Commit: | 7302e94 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | fbb6ca9 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Start our own range of IDS.
Commit: | 38482b0 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Rename proto key to not conflict with upstream.
Commit: | 479554a | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 061c0ea | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | 5c2083d | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
change the new parameter name.
Commit: | abd036e | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.
Commit: | 8f8e8bb | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | 06192ff | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Add the new XLA/MLIR directory
Commit: | bcc6cee | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Remove the old tensorflow/compiler/{xla,mlir} directory
Commit: | 7cd4f2d | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport tensorflow/core/protobuf/tpu/compile_metadata.proto
Commit: | 6c782c0 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056
Commit: | 91fd4cc | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040
Commit: | 8040706 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | eac3c47 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Start our own range of IDS.
Commit: | 38bd2fe | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Rename proto key to not conflict with upstream.
Commit: | b6062d9 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | e06baf8 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | f40b355 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
change the new parameter name.
Commit: | 9b75e86 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.
Commit: | aaf6efc | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | bfc6055 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Remove the old tensorflow/compiler/{xla,mlir} directory
Commit: | 21c06ea | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Add the new XLA/MLIR directory
Commit: | c79646a | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport tensorflow/core/protobuf/tpu/compile_metadata.proto
Commit: | cc36579 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056
Commit: | 80dd74e | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | Nathan Luehr |
Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040
Commit: | 1ffca23 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Backport PR: 39734, add an XLA_FLAGS parameter
Commit: | e624a1d | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Start our own range of IDS.
Commit: | 0044399 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
Rename proto key to not conflict with upstream.
Commit: | 5abecd2 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 18218f4 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | fccacaf | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
change the new parameter name.
Commit: | e1c567e | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.
Commit: | 3ff89d7 | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 438850d | |
---|---|---|
Author: | Trent Lo | |
Committer: | Nathan Luehr |
Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
Commit: | 1703e6e | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | Nathan Luehr |
Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce
Commit: | 9e20406 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
change the new parameter name.
Commit: | a1e9f14 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Nathan Luehr |
[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.
Commit: | 439c0df | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Introduce `indices_are_sorted` attribute for gather/scatter HLO If the attribute is set to `true` then the backend can assume that the gather/scatter indices supplied by the user are sorted what should enable the generation of more efficient code. If the attribute is set to `true` but the indices are not sorted then the behavior is implementation defined. PiperOrigin-RevId: 264574093
Commit: | 3bda036 | |
---|---|---|
Author: | Martin Wicke | |
Committer: | TensorFlower Gardener |
Implementing RFC#126: Allow Op names of the form RepoName>OpName. PiperOrigin-RevId: 264491560
Commit: | 93c2f4c | |
---|---|---|
Author: | Tim Shen | |
Committer: | TensorFlower Gardener |
Blacklist convolutions also based on cuBLAS version. PiperOrigin-RevId: 264480644
Commit: | 1d825cf | |
---|---|---|
Author: | Zhenyu Tan | |
Committer: | TensorFlower Gardener |
Enable equality split for UpdateEnsembleV2. PiperOrigin-RevId: 263835411
Commit: | 44a0f07 | |
---|---|---|
Author: | Frank Chen | |
Committer: | TensorFlower Gardener |
Add go headers for config.proto PiperOrigin-RevId: 263674612
Commit: | eb30271 | |
---|---|---|
Author: | Tim Shen | |
Committer: | TensorFlower Gardener |
Rename the convolution blacklist to a generic name that's not only for convolutions. PiperOrigin-RevId: 263672491
Commit: | 4365817 | |
---|---|---|
Author: | Xiao Yu | |
Committer: | TensorFlower Gardener |
Avoid blocking SendTensor rpc request. This change includes: 1. Move RemoteSendTensor logic from execute.cc to remote_copy_node.cc and allow calling it in async mode. 2. Use EagerService::Enqueue to handle SendTensor request. This can allow us use streaming enqueue in the future. PiperOrigin-RevId: 263606880
Commit: | 77078f1 | |
---|---|---|
Author: | Alexander Belyaev | |
Committer: | TensorFlower Gardener |
[XLA:CPU] Add a flag to disable `afn`. PiperOrigin-RevId: 263148244
Commit: | b78d23c | |
---|---|---|
Author: | Edward Loper | |
Committer: | TensorFlower Gardener |
Update the TensorInfo protobuf message with an encoding for composite tensors; and update SavedModel to use this new encoding. PiperOrigin-RevId: 262639435
Commit: | ca86b3e | |
---|---|---|
Author: | Tim Shen | |
Committer: | TensorFlower Gardener |
Log cuBLAS version. PiperOrigin-RevId: 262453329
Commit: | 53ae951 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Added new internal optimization algorithm for embeddings. PiperOrigin-RevId: 262414963
Commit: | b160851 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
[TF:XLA] Logging of known cases of performance deficiencies using XLA with Tensorflow. PiperOrigin-RevId: 262104815
Commit: | 3ecfbf5 | |
---|---|---|
Author: | Tim Shen | |
Committer: | TensorFlower Gardener |
Add a blacklist mechanism for avoiding listed cudnn convolutions. PiperOrigin-RevId: 261392669
Commit: | 941b892 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Two changes preparing for lazy input tensor copy for remote multi-device functions (make target devices pull remote tensors to avoid redundant copies): - Make master eager service be able to handle requests from remote workers. - Support serializing local TensorHandles and deserializing non-exist RemoteTensorHandles in RemoteMgr. PiperOrigin-RevId: 261382629
Commit: | a036d54 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Allow to hardcode initial splits for each of the boosted trees. PiperOrigin-RevId: 261332340
Commit: | d6411a4 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Tensor tracer: Adding summary writing capability to tensor tracer. PiperOrigin-RevId: 261213113
Commit: | 4966204 | |
---|---|---|
Author: | Edward Loper | |
Committer: | TensorFlower Gardener |
Improve error reporting when loading unknown TypeSpec types. PiperOrigin-RevId: 261201200
Commit: | fb7da35 | |
---|---|---|
Author: | Shanqing Cai | |
Committer: | TensorFlower Gardener |
[tfdbg] Improve compatiblity with Grappler - Make tensors from Grappler-created nodes visible to tfdbg. - To this end, add a wildcard node name to the DebugTensorWatch proto. - Add unit test based on tfdbg's filesystem dump mode. - A few unit tests are updated to account for the fact that additional tensors get watched (mainly under GPU tests) with all runtime graph nodes now being watched. PiperOrigin-RevId: 260997920
Commit: | 9ad01d4 | |
---|---|---|
Author: | TensorFlower Gardener | |
Committer: | TensorFlower Gardener |
Merge pull request #30759 from nouiz:give_ptx_code PiperOrigin-RevId: 260917915
Commit: | 7ab1171 | |
---|---|---|
Author: | Davide Libenzi | |
Committer: | TensorFlower Gardener |
Allow new RPC channels to be assign to new TCP streams instead of sharing a single one. PiperOrigin-RevId: 260726097
Commit: | 8f498f0 | |
---|---|---|
Author: | A. Unique TensorFlower | |
Committer: | TensorFlower Gardener |
Added a flag to ExecutionProfile messages to indicate whether the profile was drawn from a cache. PiperOrigin-RevId: 260623284
Commit: | 9f0d801 | |
---|---|---|
Author: | Edward Loper | |
Committer: | TensorFlower Gardener |
Implement CompositeTensor support in nested_structure_coder.py PiperOrigin-RevId: 259787059
Commit: | a9fee43 | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Frederic Bastien |
Add the XLA_FLAGS xla_gpu_ptx_code to allow specifing the PTX code to use.
Commit: | c705cba | |
---|---|---|
Author: | Frederic Bastien | |
Committer: | Frederic Bastien |
Fix many of the comments.