Proto commits in NVIDIA/tensorflow

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:bb39d37
Author:Nathan Luehr
Committer:Nathan Luehr

Merge branch 'master-tf1-cuda_malloc_async_fixes' into 'master-tf1' [TF1] Fix cudaMallocAsync See merge request dl/tensorflow/tensorflow!605 (cherry picked from commit 68d8ef4ba479fe9ebf531a373a1003cf187c9a7c) 456352bd [Crash fix] Other part of TF need the stats for GPU allocator. 5988bca4 Add a cudaMallocAsync test 624d6d1f [Crash fix] Handle correctly the passed stream. bcdfad70 Use a static cast df76191f Make the new options experimental and add it to the golden list of public API. 56cdf5d7 Use the new option b7097a7d Variable/function renaming. Also make a counter atomic to be more future... 0afa4788 Fix cherry-pick

The documentation is generated from this commit.

Commit:434922f
Author:Ayan Moitra
Committer:Nathan Luehr

Enable cudnn batchnorm based levels rather than bool Also squashes: Build failure fix Address Bas's comments Remaining changes Test changes

Commit:187f048
Author:Frederic Bastien
Committer:Nathan Luehr

Updated hlo_to_llvm_ir to also emit PTX and re-enable tests using it. See merge request dl/tensorflow/tensorflow!555

Commit:1decc05
Author:Bastiaan Aarts
Committer:Nathan Luehr

XLA persistent compilation cache Store results of llvm->ptx and ptx->cubin compilations for subsquent executions. Also Squashes: Include ptxas options in hash when generating key for persisten cache fix botched TF2->TF1 integration

Commit:051457f
Author:Kaixi Hou
Committer:Nathan Luehr

Integrate CUDNN frontend API for convolution Also squashes: Add cudnn_frontend patch and change macros Apply filter over fallback list and add winograd filter Remove redundant headers Add flags Add env var for cudnn frontend APIs Fix empty fallback lists

Commit:9055fdc
Author:Ayan Moitra
Committer:Nathan Luehr

[skip ci] XLA enables BN+Act and BN+Add+Act fusion when implementing kBatchNormTraining HLOInstruction as cudnn custom call

Commit:b594840
Author:Bastiaan Aarts
Committer:Nathan Luehr

XLA cudnn Softmax

Commit:917de25
Author:Trent Lo
Committer:Nathan Luehr

[XLA/GPU] Size-constrained buffer allocation

Commit:e5feec9
Author:Frederic Bastien
Committer:Nathan Luehr

Phase3 backport upstream XLA See MR 289. Squashes: Remove deprecated variants of DynamicSlice and DynamicUpdateSlice builders Remove the old tensorflow/compiler/{xla,mlir} directory Add the new XLA/MLIR directory Replace the LLVM/MLIR config directory Update LLVM version structure BUMP LLVM version Automatic fixup Disable new XLA call to not-present new MIOpen interface Change `LocalClient::Compile` to support returning multiple executables (one per partition). [TF2XLA] Preserve the dynamic dimension (-1) when building a reshape. [XLA] Respect set-dimension-size in dynamic dimension inference. Support fill op with (bounded) dynamic shape input. Add missing header file for the TF1/TF2 bridge. Better ptxutil TF1/TF2 bridge Add missing include Disable the same tests as what is disabled in gitlab/master Add another TF1/TF2 bridge header Update tests and HLO printing to be valid JSON. Disable tests that shouldn't be run in oss. Fix disabled tests handling Comment broken license include Fix missing call in TF1. Plumb exponential_average_factor through stream executor to cudnnBatchNormalizationForwardTraining and the equivalent ROCM function. Remove vestigial code from cuDNN 4 in stream executor. Manual fixup of FusedBatchNorm conflict TF1/TF2 Bridge. Add dummy function to respect the new API. Undo some change as new the TF1/TF2 bridge cover this. DISABLE non OSS tests. Disable a tests at the right place. [TF:XLA] Move XLA tests that depend on contrib in preparation for TF 2.0 [XLA] Split testParameterizedTruncatedNormalIsInRange to avoid timeout. Implement F64 scalar addition for XLA TPU backends. Implement F64 scalar multiplication for XLA TPU backends. [XLA] Add some F64 tests for MatMul. [XLA] Slightly change some test inputs/tolerances [XLA] Brings F64 error thresholds in rough accordance with F32 Add test for unstack op. Add test for Expm1 for small parameter regime of complex numbers. Annotate a test with tf_cuda_tests_tags(). Manual fixup. Run the test multiple time. This should trigger the compilation. Try more runs. Add Debug print to help debug in case of problem. Try to fix the test. Add a comment Fix checkpoint reading. The compression was disabled. Fix the copy bug. Add a test NFC: rename a variable [XLA] Extend the Algebraic Simplier to convert Pow(x, 3) -> x*x*x. This is faster.

Commit:b4c8f06
Author:TensorFlower Gardener
Committer:Nathan Luehr

[XLA] Backport a few missing PRs See MR 315. Cherry-picks upstream commits for: * vectorize row reduction for even row size * Extra VLOG for PTXAS calls Squashes: Merge pull request #38136 from nouiz:fbastien_xla_reduce_sm_v5_push_vec_rb2_manual_cherry-pick Better error message. Now we print part of the module name that caused the error, so it is easier to find the right file. Use EmitWriteArrayElement that add annotation. This could help LLVM to vectorize. add -DDEBUG_BUILD to dbg profile Backport PR: 39734, add an XLA_FLAGS parameter Add back a fix that was lost in the cherry-pick due to a change of order.

Commit:dc9e499
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Create an TF1/TF2 internal bridge Phase2 MR for XLA backport. See MR 284. Squashes: Introduce new RngBitGenerator HLO Update blocking_counter setup to the upstream version. Add rules dynamic_annotations and mutex that exist upstream. Add define to the new TF name: TF_GUARDED_BY, PT_GUARDED_BY, TF_EXCLUSIVE_LOCKS_REQUIRED, TF_LOCKS_EXCLUDED Add rules for bfloat16 and numeric_types Backport some bazel rules: tf_exec_properties, tf_grpc_dependency and tf_grpc_cc_dependency Backport tensorflow/core/protobuf/tpu/compile_metadata.proto Add conversion file for asm_compiler Add build/include to convert between the new/old interface. Add new dependency platform_port, status and redzone_allocator. redzone Continue TF API bridge CompileGpuAsm Move all new bazel rules at the end of the file. Add a rules numbers at the new place. Add dlpack Bridge TF1/TF2 add an header. Fix some tests build Include the right dependency. Otherwise, it cause protobuf init error.

Commit:f4bf2be
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040

Commit:54af36a
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:95db533
Author:Frederic Bastien
Committer:Nathan Luehr

Rename proto key to not conflict with upstream. Start our own range of IDS.

Commit:5e0d58b
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:33cc429
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:76a5325
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist. Also squashes: Add a test for Scatter without atomics. Update comment Add documentation and tests change the new parameter name. Fix doc typo and add a test.

Commit:f79719a
Author:Bastiaan Aarts
Committer:Nathan Luehr

XLA persistent compilation cache Store results of llvm->ptx and ptx->cubin compilations for subsquent executions.

Commit:4672906
Author:Kaixi Hou
Committer:Nathan Luehr

[no cache] Integrate CUDNN frontend API for convolution

Commit:4f1caa1
Author:Ayan Moitra
Committer:Nathan Luehr

[skip ci] XLA enables BN+Act and BN+Add+Act fusion when implementing kBatchNormTraining HLOInstruction as cudnn custom call

Commit:c4fde80
Author:Bastiaan Aarts
Committer:Nathan Luehr

XLA cudnn Softmax

Commit:ca1a91a
Author:Trent Lo
Committer:Nathan Luehr

[XLA/GPU] Size-constrained buffer allocation

Commit:eee6256
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:31841ec
Author:Frederic Bastien
Committer:Nathan Luehr

Add the new XLA/MLIR directory

Commit:565f769
Author:Frederic Bastien
Committer:Nathan Luehr

Remove the old tensorflow/compiler/{xla,mlir} directory

Commit:5d2b142
Author:Frederic Bastien
Committer:Nathan Luehr

Backport tensorflow/core/protobuf/tpu/compile_metadata.proto

Commit:e66a611
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056

Commit:7351c09
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040

Commit:71ab214
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:58e262c
Author:Frederic Bastien
Committer:Nathan Luehr

Rename proto key to not conflict with upstream.

Commit:db351d0
Author:Frederic Bastien
Committer:Nathan Luehr

Start our own range of IDS.

Commit:a24cac7
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:6fd48b7
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:90c4ada
Author:Frederic Bastien
Committer:Nathan Luehr

change the new parameter name.

Commit:1a07380
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.

Commit:d89aee0
Author:Bastiaan Aarts
Committer:Nathan Luehr

XLA cudnn Softmax

Commit:fb03c27
Author:Trent Lo
Committer:Nathan Luehr

[XLA/GPU] Size-constrained buffer allocation

Commit:b6f54ea
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040

Commit:7302e94
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:fbb6ca9
Author:Frederic Bastien
Committer:Nathan Luehr

Start our own range of IDS.

Commit:38482b0
Author:Frederic Bastien
Committer:Nathan Luehr

Rename proto key to not conflict with upstream.

Commit:479554a
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:061c0ea
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:5c2083d
Author:Frederic Bastien
Committer:Nathan Luehr

change the new parameter name.

Commit:abd036e
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.

Commit:8f8e8bb
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:06192ff
Author:Frederic Bastien
Committer:Nathan Luehr

Add the new XLA/MLIR directory

Commit:bcc6cee
Author:Frederic Bastien
Committer:Nathan Luehr

Remove the old tensorflow/compiler/{xla,mlir} directory

Commit:7cd4f2d
Author:Frederic Bastien
Committer:Nathan Luehr

Backport tensorflow/core/protobuf/tpu/compile_metadata.proto

Commit:6c782c0
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056

Commit:91fd4cc
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040

Commit:8040706
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:eac3c47
Author:Frederic Bastien
Committer:Nathan Luehr

Start our own range of IDS.

Commit:38bd2fe
Author:Frederic Bastien
Committer:Nathan Luehr

Rename proto key to not conflict with upstream.

Commit:b6062d9
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:e06baf8
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:f40b355
Author:Frederic Bastien
Committer:Nathan Luehr

change the new parameter name.

Commit:9b75e86
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.

Commit:aaf6efc
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:bfc6055
Author:Frederic Bastien
Committer:Nathan Luehr

Remove the old tensorflow/compiler/{xla,mlir} directory

Commit:21c06ea
Author:Frederic Bastien
Committer:Nathan Luehr

Add the new XLA/MLIR directory

Commit:c79646a
Author:Frederic Bastien
Committer:Nathan Luehr

Backport tensorflow/core/protobuf/tpu/compile_metadata.proto

Commit:cc36579
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Introduce new RngBitGenerator HLO The new instruction has the same signature as xla::BitGeneratorTy what takes a key and a state and returns uniformly distributed random bits and a new value for the state. Its aim is to enable backend specific lowering for the various random bit generator algorithms what should unlock optimization opportunities. PiperOrigin-RevId: 293569472 Change-Id: I4f69d4f9858378fb1241435032ef75657933c056

Commit:80dd74e
Author:A. Unique TensorFlower
Committer:Nathan Luehr

Adds new protobuf message HloExecutionProfileData, which describes HloExecutionProfile. PiperOrigin-RevId: 269313040

Commit:1ffca23
Author:Frederic Bastien
Committer:Nathan Luehr

Backport PR: 39734, add an XLA_FLAGS parameter

Commit:e624a1d
Author:Frederic Bastien
Committer:Nathan Luehr

Start our own range of IDS.

Commit:0044399
Author:Frederic Bastien
Committer:Nathan Luehr

Rename proto key to not conflict with upstream.

Commit:5abecd2
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:18218f4
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:fccacaf
Author:Frederic Bastien
Committer:Nathan Luehr

change the new parameter name.

Commit:e1c567e
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.

Commit:3ff89d7
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:438850d
Author:Trent Lo
Committer:Nathan Luehr

Add AsyncOut support for XLA. This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra

Commit:1703e6e
Author:TensorFlower Gardener
Committer:Nathan Luehr

Merge pull request #35621 from bas-aarts:bas_auto_tune_level PiperOrigin-RevId: 291051983 Change-Id: I9518cca1eb07e44be7857b0c3d87c94de420fdce

Commit:9e20406
Author:Frederic Bastien
Committer:Nathan Luehr

change the new parameter name.

Commit:a1e9f14
Author:Frederic Bastien
Committer:Nathan Luehr

[XLA] add the Scatter option to not do atomic operation. This should keep the backward compatibility when this option didn't exist.

Commit:439c0df
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Introduce `indices_are_sorted` attribute for gather/scatter HLO If the attribute is set to `true` then the backend can assume that the gather/scatter indices supplied by the user are sorted what should enable the generation of more efficient code. If the attribute is set to `true` but the indices are not sorted then the behavior is implementation defined. PiperOrigin-RevId: 264574093

Commit:3bda036
Author:Martin Wicke
Committer:TensorFlower Gardener

Implementing RFC#126: Allow Op names of the form RepoName>OpName. PiperOrigin-RevId: 264491560

Commit:93c2f4c
Author:Tim Shen
Committer:TensorFlower Gardener

Blacklist convolutions also based on cuBLAS version. PiperOrigin-RevId: 264480644

Commit:1d825cf
Author:Zhenyu Tan
Committer:TensorFlower Gardener

Enable equality split for UpdateEnsembleV2. PiperOrigin-RevId: 263835411

Commit:44a0f07
Author:Frank Chen
Committer:TensorFlower Gardener

Add go headers for config.proto PiperOrigin-RevId: 263674612

Commit:eb30271
Author:Tim Shen
Committer:TensorFlower Gardener

Rename the convolution blacklist to a generic name that's not only for convolutions. PiperOrigin-RevId: 263672491

Commit:4365817
Author:Xiao Yu
Committer:TensorFlower Gardener

Avoid blocking SendTensor rpc request. This change includes: 1. Move RemoteSendTensor logic from execute.cc to remote_copy_node.cc and allow calling it in async mode. 2. Use EagerService::Enqueue to handle SendTensor request. This can allow us use streaming enqueue in the future. PiperOrigin-RevId: 263606880

Commit:77078f1
Author:Alexander Belyaev
Committer:TensorFlower Gardener

[XLA:CPU] Add a flag to disable `afn`. PiperOrigin-RevId: 263148244

Commit:b78d23c
Author:Edward Loper
Committer:TensorFlower Gardener

Update the TensorInfo protobuf message with an encoding for composite tensors; and update SavedModel to use this new encoding. PiperOrigin-RevId: 262639435

Commit:ca86b3e
Author:Tim Shen
Committer:TensorFlower Gardener

Log cuBLAS version. PiperOrigin-RevId: 262453329

Commit:53ae951
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Added new internal optimization algorithm for embeddings. PiperOrigin-RevId: 262414963

Commit:b160851
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

[TF:XLA] Logging of known cases of performance deficiencies using XLA with Tensorflow. PiperOrigin-RevId: 262104815

Commit:3ecfbf5
Author:Tim Shen
Committer:TensorFlower Gardener

Add a blacklist mechanism for avoiding listed cudnn convolutions. PiperOrigin-RevId: 261392669

Commit:941b892
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Two changes preparing for lazy input tensor copy for remote multi-device functions (make target devices pull remote tensors to avoid redundant copies): - Make master eager service be able to handle requests from remote workers. - Support serializing local TensorHandles and deserializing non-exist RemoteTensorHandles in RemoteMgr. PiperOrigin-RevId: 261382629

Commit:a036d54
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Allow to hardcode initial splits for each of the boosted trees. PiperOrigin-RevId: 261332340

Commit:d6411a4
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Tensor tracer: Adding summary writing capability to tensor tracer. PiperOrigin-RevId: 261213113

Commit:4966204
Author:Edward Loper
Committer:TensorFlower Gardener

Improve error reporting when loading unknown TypeSpec types. PiperOrigin-RevId: 261201200

Commit:fb7da35
Author:Shanqing Cai
Committer:TensorFlower Gardener

[tfdbg] Improve compatiblity with Grappler - Make tensors from Grappler-created nodes visible to tfdbg. - To this end, add a wildcard node name to the DebugTensorWatch proto. - Add unit test based on tfdbg's filesystem dump mode. - A few unit tests are updated to account for the fact that additional tensors get watched (mainly under GPU tests) with all runtime graph nodes now being watched. PiperOrigin-RevId: 260997920

Commit:9ad01d4
Author:TensorFlower Gardener
Committer:TensorFlower Gardener

Merge pull request #30759 from nouiz:give_ptx_code PiperOrigin-RevId: 260917915

Commit:7ab1171
Author:Davide Libenzi
Committer:TensorFlower Gardener

Allow new RPC channels to be assign to new TCP streams instead of sharing a single one. PiperOrigin-RevId: 260726097

Commit:8f498f0
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Added a flag to ExecutionProfile messages to indicate whether the profile was drawn from a cache. PiperOrigin-RevId: 260623284

Commit:9f0d801
Author:Edward Loper
Committer:TensorFlower Gardener

Implement CompositeTensor support in nested_structure_coder.py PiperOrigin-RevId: 259787059

Commit:a9fee43
Author:Frederic Bastien
Committer:Frederic Bastien

Add the XLA_FLAGS xla_gpu_ptx_code to allow specifing the PTX code to use.

Commit:c705cba
Author:Frederic Bastien
Committer:Frederic Bastien

Fix many of the comments.