Proto commits in alpa-projects/tensorflow-alpa

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:e0059bb
Author:Xin Zhou
Committer:TensorFlower Gardener

[XLA] Add support for int4 types in literal. PiperOrigin-RevId: 515099842

Commit:5e9700e
Author:Adrian Kuegel
Committer:TensorFlower Gardener

Remove xla_gpu_softmax_fusion flag. The softmax fusion feature has been deleted. PiperOrigin-RevId: 514986409

Commit:248bb25
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

[XLA] Update partition assignment with an overrideable partitioning algorithm selector method. PiperOrigin-RevId: 514939043

Commit:8ad80b6
Author:Songyi Han
Committer:TensorFlower Gardener

Support gather for all quantization schemes This enables optimized weight-only for all schemes in XLA opset. PiperOrigin-RevId: 513824366

Commit:c9d1771
Author:Dan Suh
Committer:TensorFlower Gardener

Populate SaverDef from TF Quantizer's cpp layer. PiperOrigin-RevId: 513729502

Commit:29ca3c1
Author:Reed Wanderman-Milne
Committer:TensorFlower Gardener

Rename BFloat16Normalization to FloatNormalization and support other types. BFloat16Normalization has been renamed to FloatNormalization, and FP8 is now also supported in addition to BF16. A subsequent change will use FloatNormalization to support FP8 inputs/outputs in all instructions that currently support FP16 inputs/outputs. Similarly, BFloat16Support is renamed to FloatSupport. This is a very mechanical change. The vast majority of changes are renaming classes, methods, local variables, BUILD target names, and filenames to not refer to the type BF16. The only changes in logic are (1) adding a low_precision_type_ field to FloatSupport, (2) adding methods LowPrecisionType() and HighPrecisionType() to FloatSupport, and (3) changing FloatNormalization to use such methods instead of hardcoding BF16 and F32. PiperOrigin-RevId: 513657280

Commit:0f1066e
Author:Jie Sun
Committer:TensorFlower Gardener

add a tag to indicate which source this InputPipelineAnalysisResult is from. PiperOrigin-RevId: 513621717

Commit:38a4470
Author:He Jiang
Committer:TensorFlower Gardener

Let minibenchmark support in-memory model. The in-memory model will be passed to subprocess through pipe. PiperOrigin-RevId: 513573577

Commit:d9da53c
Author:Alan Kelly
Committer:TensorFlower Gardener

assertProtoEqual calls checkFloatEqAndReplace which recurses the proto and compares all floats/doubles using relative tolerance. The current version's comparison is not valid for all floats, for example we probably want 50000 and 49999.9 to compare equal but they won't in the original version as the float format strings cut digits, it doesn't round so could result in 50000 and 49999, depending on the formatting provided. This means that tests will be less brittle as a single different bit will not cause a failure if rtol is used. PiperOrigin-RevId: 512999626

Commit:98c639b
Author:Dan Suh
Committer:TensorFlower Gardener

Find the name of the "file_prefix" tensor from the cpp layer of TF Quantizer. Instead of finding "file_prefix" tensor from the python layer, find the name from the cpp layer so that the information relayed by the `ExportedModel` is more complete and the utilities for saving the model aren't fragmented. PiperOrigin-RevId: 512921226

Commit:231de16
Author:Xin Zhou
Committer:TensorFlower Gardener

[XLA] Add int4 types: U4/S4. PiperOrigin-RevId: 512150267

Commit:dea1165
Author:Matt Kreileder
Committer:TensorFlower Gardener

This CL introduces a new message, `CompilationCachingSettings`, in the acceleration config proto file and adds a field of this message type into the `TFLiteSettings` message. `CompilationCachingSettings` defines standardised compilation caching fields. Specifically we give (binary) stable delegates a set of fields that can be used for compilation caching. See also tensorflow/lite/core/experimental/acceleration/configuration/c/stable_delegate.h and an example stable delegate tensorflow/lite/delegates/utils/experimental/sample_stable_delegate. PiperOrigin-RevId: 512038131

Commit:3f43b4e
Author:Dan Suh
Committer:TensorFlower Gardener

Support quantizing models that use asset files to initialize resources. This change allows TF Quantizer to quantize and export models that use asset files to initialize resources like hash tables. A canonical example is a model that uses [`tf.lookup.TextFileInitializer`](https://www.tensorflow.org/api_docs/python/tf/lookup/TextFileInitializer). `AssetFileDef`s should be added to the collection in order to identify the string tensor to which the file names should be passed. Corresponding `AssetFileDef`s are extracted from the export passes of the TF Quantizer. Asset files are copied to the output Saved Model directories. PiperOrigin-RevId: 511932899

Commit:5eaf871
Author:A. Unique TensorFlower
Committer:Michael Hudgins

Merged commit includes the following changes: 511837617 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 509256232. 511836298 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA:CPU] Outline fusion regions in the presence of implicit constant-like operands -- 511832706 by A. Unique TensorFlower<gardener@tensorflow.org>: Remove decorator usage for parallel devices as a first step in removing `control_flow_ops.py`'s dependency on `def_function.py` to eliminate circular dependencies. -- 511832154 by A. Unique TensorFlower<gardener@tensorflow.org>: [StableHLO to MHLO] Handle bounds of Gather op Based on: https://github.com/openxla/stablehlo/pull/908 -- 511832050 by A. Unique TensorFlower<gardener@tensorflow.org>: Specify `save_type='checkpoint'` when calling `trackable_children`. -- 511830259 by A. Unique TensorFlower<gardener@tensorflow.org>: Uses the same clustering algorithm in the TF2XLA bridge for TAC passes. They have better results than the current clustering in the number of clusters. -- 511830057 by A. Unique TensorFlower<gardener@tensorflow.org>: Add an option to strip location information from ops. These end up becoming huge during large fusions (like TpuRewritePass) and in practice don't help much with readability, as locations for TF are typically indicated by the op name. -- 511826780 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA] Add all-gather-start/done multi-operand shape inference tests -- 511815596 by A. Unique TensorFlower<gardener@tensorflow.org>: Describe potential problem in case if the LLVM toolchain is used. -- 511810373 by A. Unique TensorFlower<gardener@tensorflow.org>: Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/1ed8df8df17f431936090acde5122456c5eed394. -- 511801672 by A. Unique TensorFlower<gardener@tensorflow.org>: Plumb CollectiveAllToAllV2 to MLIR. Also fix a inconsistency regarding CollectiveGatherV2. Follow up of https://github.com/tensorflow/tensorflow/pull/59598 -- 511799604 by A. Unique TensorFlower<gardener@tensorflow.org>: Integrate LLVM at llvm/llvm-project@219ba2fb7b0a Updates LLVM usage to match [219ba2fb7b0a](https://github.com/llvm/llvm-project/commit/219ba2fb7b0a) -- 511797844 by A. Unique TensorFlower<gardener@tensorflow.org>: Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/9b50571c5b76f32103ad5099b0993581af3d0592. -- 511793598 by A. Unique TensorFlower<gardener@tensorflow.org>: Update type-checks of `DatasetV1` and `DatasetV2` to use abstract types. Changes usages of the internal `DatasetV1` and `DatasetV2` types to use the `tensorflow.types.data` versions instead of the concrete implementations. This helps reduce the tendency for cyclic dependencies involving the `dataset_ops.py` module. Usages of the concrete type (e.g. instantiation, member access) are not affected by this change. -- 511765379 by A. Unique TensorFlower<gardener@tensorflow.org>: [GmlSt] Remove 'distribute' attribute from ParallelOp tiling params. -- 511760944 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 510948939. 511756866 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 482514900. 511749935 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA:CPU Next] Disable tiling of linalg.generic. -- 511746972 by A. Unique TensorFlower<gardener@tensorflow.org>: [GmlSt] Do not tile linalg.generic if it was tiled already. The "tiled labels" are not populated correctly. Theoretically, we shouldn't check for the parent. -- 511745314 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA:CPU Next] Fix scalarization of scf.for. If the type of the output is not specified, then FromElementsOp creates a 1D tensor void FromElementsOp::build(OpBuilder &builder, OperationState &result, ValueRange elements) { assert(!elements.empty() && "expected at least one element"); Type resultType = RankedTensorType::get( {static_cast<int64_t>(elements.size())}, elements.front().getType()); build(builder, result, resultType, elements); } The test that i had before in scalarization.mlir was 1D by coincidence and therefore worked. -- 511741158 by A. Unique TensorFlower<gardener@tensorflow.org>: FIll calculates the output tensor if all the information required is available during Prepare. This means that the output will be available for subsequent operator's Prepare and Eval will be free -- 511737032 by A. Unique TensorFlower<gardener@tensorflow.org>: [GmlSt] Remove gml_st.materialize. -- 511736776 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA:CPU Next] Disable scf.if vectorization. -- 511734615 by A. Unique TensorFlower<gardener@tensorflow.org>: [DelegatePerformance] Removed a strengthened precondition and changed the metric value type from float to double. The change removes the strengthened precondition from the inherited method computeModelReport() in the derived class ModelBenchmarkReport. It also changes the metric value type from float to double to avoid potential calculation errors. -- 511730756 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA:CPU Next] Add a pattern to tile linalg.generic to 1 and fuse greedily. -- 511726589 by A. Unique TensorFlower<gardener@tensorflow.org>: [GmlSt] Remove useless peeling label. -- 511719057 by A. Unique TensorFlower<gardener@tensorflow.org>: Transforming grouped convolution to depth wise when possible. -- 511716904 by A. Unique TensorFlower<gardener@tensorflow.org>: PR #59775: Fix tensor_or_memref build error without math.h Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/59775 https://github.com/tensorflow/tensorflow/pull/59536#issuecomment-1439039311 Since pevious PR has been rolled back, this one adopted reviewer's feedback to fix Mac build error. Thanks @reedwm for the feedback and suggestion! Copybara import of the project: -- 4f9f856e0159f5dd7c4e8d0e3b5232e55795f700 by Chao Chen <cchen104@amd.com>: fix tensor_or_memref build error without math.h Merging this change closes #59775 -- 511715187 by A. Unique TensorFlower<gardener@tensorflow.org>: compat: Update forward compatibility horizon to 2023-02-23 -- 511715185 by A. Unique TensorFlower<gardener@tensorflow.org>: Update GraphDef version to 1416. -- 511712677 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA] Further speedup HloModule::Print by using integer key for CanonicalNameMap. -- 511712612 by A. Unique TensorFlower<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. -- 511708539 by A. Unique TensorFlower<gardener@tensorflow.org>: Update ops-related pbtxt files. -- 511706378 by A. Unique TensorFlower<gardener@tensorflow.org>: Add SegmentMaxV2op with num_segments as additional input. The only difference with SegmentMax is the additional input `num_segment`. This helps in evaluating the output shape in compile time. -- 511703897 by A. Unique TensorFlower<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. -- 511703121 by A. Unique TensorFlower<gardener@tensorflow.org>: PR #59616: ReLU Epilogue Fusion for FP8 GEMMs in XLA Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/59616 Enables the epilogue fusion of ReLU activations for FP8 GEMMs. Copybara import of the project: -- 8813bb2940dd4e5d53bb4092c1a1ee2a2f00a13b by Philipp Hack <phack@nvidia.com>: Epilogue fusion of ReLU activations for FP8 GEMMs. -- c9ee1d5869eaafa7e89842e9a460b96c9b912481 by Philipp Hack <phack@nvidia.com>: Epilogue fusion of ReLU activations for FP8 GEMMs. Merging this change closes #59616 -- 511693497 by A. Unique TensorFlower<gardener@tensorflow.org>: Add SegmentMinV2op with num_segments as additional input. The only difference with SegmentMin is the additional input `num_segment`. This helps in evaluating the output shape in compile time. -- 511692851 by A. Unique TensorFlower<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. -- 511690064 by A. Unique TensorFlower<gardener@tensorflow.org>: [jax2tf] Use CUDA and ROCM instead of GPU for XlaCallModuleOp platforms JAX is moving to using ROCM and CUDA instead of the generic GPU platform type and it is already supporting separate lowerings for ROCM and CUDA. To keep up with this functionality, we move the XlaCallModuleOp to supporting ROCM and CUDA platforms. -- 511687577 by A. Unique TensorFlower<gardener@tensorflow.org>: VLOG(1) upon fallback in phase 2. Whether the new or old bridge ran is usually the first thing to find out when debugging. So when fallback from new bridge to old bridge happens, it should be reported with log level 1. -- 511684159 by A. Unique TensorFlower<gardener@tensorflow.org>: Cast BF16 Depthwise conv2D ops to f32 ops. This change is to cast BF16 Depthwise Conv2d ops to f32 to make it ready for quantization. But, as the Depthwise conv2D quantization is disabled due to performance improvement issue for now, this change does not guarantee the BF16 Depthwise Conv2D quantization. -- 511677183 by A. Unique TensorFlower<gardener@tensorflow.org>: Added a util function to process einsum -- 511676949 by A. Unique TensorFlower<gardener@tensorflow.org>: #tf-data Retry empty repetitions when repeating data service dataset. The ForeverRepeat op assumes if the first repetition produces no data, all future repetitions will produce no data. That is not always true. For example, when using tf.data service, different repetitions may produce different numbers of elements, and empty repetitions should be retried. -- 511657402 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA] More speedups to HloModule::Print. -- 511647352 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 511611390. 511644926 by A. Unique TensorFlower<gardener@tensorflow.org>: Refactor TfrtPipelineOptions to a separate file so that it can be reused. -- 511642132 by A. Unique TensorFlower<gardener@tensorflow.org>: Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/410c7f3b07f1d1170b13242a2cf9af4ec7edd33f. -- 511640480 by A. Unique TensorFlower<gardener@tensorflow.org>: Update `DimsAre` matcher to be polymorphic over both `TfLiteTensor` and `TfLiteIntArrays`. -- 511639990 by A. Unique TensorFlower<gardener@tensorflow.org>: Fix Relayout handling under XLA SPMD. According to the plans by samuelslee@, now that chensunx@ contributed the pass to lower Relayout to Identity. -- 511639775 by A. Unique TensorFlower<gardener@tensorflow.org>: Integrate LLVM at llvm/llvm-project@a7b6978285c1 Updates LLVM usage to match [a7b6978285c1](https://github.com/llvm/llvm-project/commit/a7b6978285c1) -- 511637257 by A. Unique TensorFlower<gardener@tensorflow.org>: Include full node_def info in activity watcher. -- 511637139 by A. Unique TensorFlower<gardener@tensorflow.org>: Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/bd588322b3d6660903ee0df9b55d2f589aa5dc8a. -- 511632436 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 511597567. 511611390 by A. Unique TensorFlower<gardener@tensorflow.org>: PR #57956: Performance Enhancements for Sparse Embedding Lookups Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/57956 Introduces performance options for sparse embedding lookups that can appreciably speed up the training of recommendation systems. Sparse lookups alternatively accept inputs described by RaggedTensors which are more memory efficient. Performance is further increased by the optional use of a simplified and typically faster embedding lookup. In the sparse embedding micro benchmarks in tensorflow/python/eager/benchmarks_test.py, the number of examples per second on a DGX A100 system increases from approx. 1,300 with SparseTensor and without simplified lookup to approx. 11,200 with RaggedTensor inputs and simplified lookup (+760%). The combination of SparseTensor inputs and simplified lookup yields approx. 3,000 examples per second (+130%). Copybara import of the project: -- 00ee1a9deea66c511b4e6b32f961c7598a299db3 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- fe6eb184ab75dfb7399919e617014d4e61388221 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 47157c3e57ef7700044e6627563c813eafc7ba9c by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 8826a9b387e32223f59397e7047c9f29f3fea674 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 4e770724eddc7450b82c58f89a56968c1ec0aa99 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 3d1096aafcecbdf83d700272e088c2225ec058d6 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- c115b80151098e3c574c52625cee97c4c41af40f by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 842591cef9cffa40b32e176ded14453bfbc96678 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 2a307f3f09a88b26586ad81685685a78ccfebcd1 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- ee90704c034d0c04557e7870670f10c17a999290 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 9e216e3d3f541d07951b00b457e7f764300b5da8 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- c393302d3804b35dacedba125448da080f68af8b by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 5165f071ebf6d0d7568789e6f5e9034707149b61 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- f3e88c4c6d75910c3cecbe310d779048cc84c5a6 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 83847914532dbab5fb2ac052983fdc85cde81c49 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 497f9c39efe136ba9539f46f1625996a1a81b8f4 by Philipp Hack <phack@nvidia.com>: Adds performance enhancements for sparse embedding lookups. -- 2166e5defae2f93b66bee8544f544c033e0ec66b by Philipp Hack <phack@nvidia.com>: Performance enhancements for sparse embedding lookups. -- 13dd2c877d75f043be1ce014ceae1027a9876d41 by Philipp Hack <phack@nvidia.com>: Performance enhancements for sparse embedding lookups. -- a83f4d4890ca7f0ad3982a8d224a15be1ee21ee9 by Philipp Hack <phack@nvidia.com>: Performance enhancements for sparse embedding lookups. Merging this change closes #57956 -- 511609283 by A. Unique TensorFlower<gardener@tensorflow.org>: #tf-data Ramp down `autotune_buffer_optimization` experiment. -- 511606842 by A. Unique TensorFlower<gardener@tensorflow.org>: Avoid double counting graph building time for nested functions. -- 511606056 by A. Unique TensorFlower<gardener@tensorflow.org>: PR #59619: [NVIDIA TF] Throw error only for non-empty function definitions. Imported from GitHub PR https://github.com/tensorflow/tensorflow/pull/59619 During Function Serialization and Deserialization with Saved Models there can be Node ops with type `func` with default values that have no associated name. In such cases, we shouldn't look for undefined empty strings in Function Library. This is a trivial change and helps with how Saved Models are loaded. Copybara import of the project: -- e0fe60f9a9db8886cf423949bbd2e5da796397ae by Pavani Majety <pmajety@nvidia.com>: [BugFix] Throw error only for non-empty function definitions. Add reason for the required change. Add comment to both locations. Merging this change closes #59619 -- 511605660 by A. Unique TensorFlower<gardener@tensorflow.org>: Fix typo in comment -- 511604096 by A. Unique TensorFlower<gardener@tensorflow.org>: #tf-data-service Put distributed_save_test and snapshot_ft_test together. -- 511603520 by A. Unique TensorFlower<gardener@tensorflow.org>: update tracking bug now that the step id is propagated correctly. -- 511602286 by A. Unique TensorFlower<gardener@tensorflow.org>: add a size() member function to HloProtoMap. -- 511599929 by A. Unique TensorFlower<gardener@tensorflow.org>: gpu_delegate: Update to support Mali G715 -- 511597567 by A. Unique TensorFlower<gardener@tensorflow.org>: [tf-lite] Enable parallel transpose. -- 511596662 by A. Unique TensorFlower<gardener@tensorflow.org>: add an aggregated stats for per step result. the goal is to remove all_reduce_db_per_core. we will lose the capabilities of separate compute time and synchronization time for TPU only. but I think it is fine, most of collective in TPU now are async ops, so these algorithm is no longer very informative. we will count all async collectives as synchronization time for tpu. for gpu, this doesn't apply. -- 511590747 by A. Unique TensorFlower<gardener@tensorflow.org>: [PJRT C API] Add a README file to provide communication channel and resources. -- 511585380 by A. Unique TensorFlower<gardener@tensorflow.org>: Automated rollback of changelist 511565925. 511578143 by A. Unique TensorFlower<gardener@tensorflow.org>: [XLA] Minor fixes in ShardingPropagation. - Moves the misplaced comment block of replicate_on_last_tile_dim_. - Uses existing variable root_instr to eliminate redundant accesses. - Replaces bitwise and with logical and. - Uses reverse iterator instead of reversing the list. -- 511569312 by A. Unique TensorFlower<gardener@tensorflow.org>: Remove `type_spec` direct dependency on `framework/ops.py`. Changes `type_spec` references to `tf.Tensor` to use an abstract base type for `isinstance` checks. Uses the conversion function in `tensor_conversion_registry` instead of its wrapper in `ops`. While `tensor_conversion_registry` has an indirect dependency on `ops.py`, there is future work planned to remove that dependency. -- 511568795 by A. Unique TensorFlower<gardener@tensorflow.org>: Check the shape of indices matches the shape of dense_shape in sparse_fill_empty_rows op -- 511567163 by A. Unique TensorFlower<gardener@tensorflow.org>: [xla-next][mlir][sparse] add mhlo sparsity rewriting to pipeline -- 511565925 by A. Unique TensorFlower<gardener@tensorflow.org>: Update visibility to fix OSS build -- PiperOrigin-RevId: 511837617

Commit:8d163fb
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Introduce a new flag to enable TPU model quantization support. As the TPU model quantization support is unstable, we introduce a flag to trigger the feature for other users to avoid unexpected errors. PiperOrigin-RevId: 511338951

Commit:ec1f4e4
Author:Matt Callanan
Committer:TensorFlower Gardener

Removed unused import. PiperOrigin-RevId: 511321244

Commit:188e9f9
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Add extension field in GoogleEdgeTpuSettings. PiperOrigin-RevId: 511237828

Commit:3c2afe0
Author:David Rim
Committer:TensorFlower Gardener

Add experimental flag to disable layer norm and other mul->fc fusing. PiperOrigin-RevId: 510440994

Commit:5a10d92
Author:Wilsin Gosti
Committer:TensorFlower Gardener

#tf-data Save gap times used by stage based autotune in model proto. PiperOrigin-RevId: 510257921

Commit:e027bb7
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Fix collective shape mismatch error on GPU and CPU. The key for InstanceRec is extended to also include the step_id, such that collective instances from different steps become independent from each other, as they conceptually are. This aligns the treatment of Collective Ops and Send/Recv Ops (via the Rendezvous) under DTensor. A (hacky but effective) special case filtering is added to only enable this under DTensor without changing the V2 Collective Op signatures. MWMS does not (always) guarantee steps from different workers have the same step_id, and cannot be enrolled into this change. The test case demonstrates the typical failure pattern. Note the test case fails on TPU for a different reason (fixing on TPU needs dynamic shape support). PiperOrigin-RevId: 510237456

Commit:688b63d
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Define GoogleEdgeTpuSettings in configuration.proto PiperOrigin-RevId: 510178940

Commit:342af17
Author:Johannes Reifferscheid
Committer:TensorFlower Gardener

Add a flag for experimental deallocation. PiperOrigin-RevId: 510105479

Commit:9b4ed39
Author:Matt Callanan
Committer:TensorFlower Gardener

Source data transfer addresses from the worker server. This doesn't change any existing functionality but will support users being able to opt of a forthcoming data transfer experiment. Overview of changes: - The selection of a default data transfer protocol now happens in `DataServiceClient` rather than `DataServiceDatasetOp`. - Servers now convey to clients a list of (protocol, address) pairs rather than a single address, and clients choose from this list using the user inputted protocol. PiperOrigin-RevId: 509993158

Commit:eee393b
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Update flatbuffer generated configuration_generated.h.oss PiperOrigin-RevId: 509876768

Commit:3a39e69
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Extends `MetricsHookInterface` with the ability to export more granular compilation metrics. 1. Extends `CompilationLogEntry` with `pass_metrics` defined by a new `PassMetrics` proto. 2. Renames `RecordStageLatency` into more generic `RecordCompilationMetrics`. PiperOrigin-RevId: 509867105

Commit:e57e25d
Author:Ilia Sergachev
Committer:TensorFlower Gardener

[XLA:GPU] Add a flag to use Triton-based GEMM emitter for any GEMM that it supports. Also move Triton-GEMM-related code closer together. PiperOrigin-RevId: 509781425

Commit:d3509a4
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

[tf.data] Implement `warm_start` feature for all the asynchronous operations. PiperOrigin-RevId: 509533594

Commit:b49cc3f
Author:Swachhand Lokhande
Committer:TensorFlower Gardener

Add a compiled_using_pjrt field to persistent cache key. This is to differentiate the executables that would be built by XLA and PJRT clients when they are being serialized to disk for persistence. There would be no changes to the filenames of XLA serialized executables. PJRT serialized executables will have '__pjrt' appended to their filenames. This also adds the SerializedExecutable function back to PjRtDeviceCompilerClient. Removes XLA dependencies for `//tensorflow/lite/delegates/flex:delegate_data` A unit test indirectly depends on `//tensorflow/core/ops:string_ops_op_lib` under `//tensorflow/compiler/jit:xla_kernel_creator`. This adds the dependency to the test target directly instead. PiperOrigin-RevId: 509277222

Commit:90f4292
Author:Jun Xu
Committer:TensorFlower Gardener

Asynchronously remove functions in remote workers. Before this CL, context.remove_function only removes function_defs from local context. After this CL, function_defs in remote contexts should also be removed. This should prevent duplicated function_defs error when run transformed functions in distributed environment. This should also fix a memory leak in remote devices since now when an EagerDefinedFunction gets deleted, it will also be deleted from remote eager context. PiperOrigin-RevId: 509259962

Commit:f687b71
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

[DelegatePerformance] Replace initialization_max_regression_percentage_allowed and average_warm_up_max_regression_percentage_allowed with startup_overhead_max_regression_percentage_allowed. startup overhead time = initialization time + average warmup time - average inference time PiperOrigin-RevId: 509218901

Commit:f775935
Author:Dan Suh
Committer:TensorFlower Gardener

Remove unused variable shared_names from TF Quantizer's `ExportedModel`. Since TF quantizer now uses `SaverDef` to restore / save variables, `variable_shared_names` of `ExportedModel` is no longer used from the Python side. This change cleans up the production & use of `variable_shared_names` in the TF Quantizer's c++ API. PiperOrigin-RevId: 509160886

Commit:e692db8
Author:Dan Suh
Committer:TensorFlower Gardener

Use Saver to save or restore variables for TF Quantizer. This change fixes a workaround for restoring variable by using `Saver`. In the previous workaround, empty variables with the same `shared_name` were created to restore from checkpoint, but this caused problems especially when the variable's node name and shared name differed. Using `Saver` follows the saved model semantics more closely and is the method used from within the saved model library. Additionally, to make the saving and restoring behavior more correct, it now sets the location of the `VarHandleOp` the same as their shared_names from `InsertRestoreOpPass`. The name of the VarHandleOp should be set to its shared_name. This is required when the exported model is re-imported as MLIR. The MLIR import process looks at the variable's location name (== node name) before looking at the shared_name to restore variables. When the two names are different, the session cannot find the node from the graph and get the variables' tensor values. PiperOrigin-RevId: 509112645

Commit:3e24055
Author:Sergey Kozub
Committer:TensorFlower Gardener

Add support for emitting the int8x32 cuDNN convolution reordering custom calls 1) Add a debug flag `xla_gpu_enable_cudnn_int8x32_convolution_reordering` (disabled by default). 2) If the flag is set, add reordering custom call during the vectorization pass. 3) Update convolution layout normalization callback to make sure both the input and the output to the reordering custom call has NCHW_VECT_C physical layout. 4) Correctly set convolution and filter handle attributes in the DNN implementation. PiperOrigin-RevId: 508678175

Commit:c9f8e0f
Author:Sergey Kozub
Committer:TensorFlower Gardener

Add `reordered_int8_nchw_vect` flag to convolution backend proto. This is necessary to disambiguate layouts that could not be otherwise detected by XlaConvShapesToStreamExecutorLayouts, in this case int8x32 reordered filter and bias. PiperOrigin-RevId: 508586274

Commit:a5dedd2
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Internal change PiperOrigin-RevId: 508522408

Commit:7ee0a6b
Author:Swachhand Lokhande
Committer:TensorFlower Gardener

Add a compiled_using_pjrt field to persistent cache key. This is to differentiate the executables that would be built by XLA and PJRT clients when they are being serialized to disk for persistence. There would be no changes to the filenames of XLA serialized executables. PJRT serialized executables will have '__pjrt' appended to their filenames. This also adds the SerializedExecutable function back to PjRtDeviceCompilerClient. Removes XLA dependencies for `//tensorflow/lite/delegates/flex:delegate_data` PiperOrigin-RevId: 508512797

Commit:f8ffe16
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Added hardware_cluster_ids to the EdgeTPU settings. PiperOrigin-RevId: 508445790

Commit:bfd2853
Author:Justin Lebar
Committer:TensorFlower Gardener

Remove ConfigProto.experimental.friendly_name in favor of experimental.session_metadata.name. In cl/507687578 I added ConfigProto.experimental.friendly_name and used it to set the name on XLA clusters created by autoclustering. I didn't realize that we already had experimental.session_metadata.name, which is basically the same thing. So this CL removes friendly_name and converts autoclustering to use session_metadata.name. Very sorry for the churn. PiperOrigin-RevId: 508434449

Commit:19f20fd
Author:Justin Lebar
Committer:TensorFlower Gardener

Add experimental `friendly_name` config option. Today when you use XLA autoclustering, the XLA module names have the form e.g. cluster_15111732669523428041_0__XlaCompiledKernel_true__XlaHasReferenceVars_false__XlaNumConstantArgs_0__XlaNumResourceArgs_0_.2393 If you have a program with many TensorFlow nets, it is hard to tell which cluster_1234 belongs to which net. With this change, you can set a name in the TF session config and the name gets propagated down to the XLA cluster name: my_friendly_name_15111732669523428041_0__<snip> For now this only works for nets that enable XLA via autoclustering. I'd like to do something similar for nets that use tf.function and inference-converter. PiperOrigin-RevId: 507687578

Commit:4941c8c
Author:Roman Dzhabarov
Committer:TensorFlower Gardener

[xla/metrics] Clarify comment on the task_index in the metrics.proto. PiperOrigin-RevId: 507599269

Commit:f158f88
Author:Ilia Sergachev
Committer:TensorFlower Gardener

[XLA:GPU] Add triton-based matmul emitter. PiperOrigin-RevId: 507498160

Commit:cf48610
Author:Matt Callanan
Committer:TensorFlower Gardener

Handle snapshot and stream completion in tf.data service dispatcher. PiperOrigin-RevId: 507023442

Commit:03ea9a8
Author:Marcello Maggioni
Committer:TensorFlower Gardener

[XLA] Add way to allow propagation to output only to a subset of root instruction tuple shardings. PiperOrigin-RevId: 506935285

Commit:5a86ef6
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

[XLA] Create skeleton for a partition assignment pass, which annotates the given module with (good) shardings, by adding: - an HLO pass: PartitionAssignment - a base class: PartitioningAlgorithm, - a no-op derived class extending PartitioningAlgorithm: NoopPartitioning, and - a flag to determine the algorithm (kind/type): xla_partitioning_algorithm. PiperOrigin-RevId: 506423268

Commit:846ebbb
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Remove time (AKA time_fraction) field, since it's no longer used. We now compute this in the frontend to avoid storing this redundant field in the protobuf. PiperOrigin-RevId: 506372540

Commit:8c77288
Author:Matt Callanan
Committer:TensorFlower Gardener

Manage snapshot streams assignments in tf.data service dispatcher. Related changes: - Added `DispatcherService::GetSnapshotStreams`, a new readonly API for seeing the state of snapshot stream assignments from the dispatcher's perspective. - Made `DispatcherConfig.worker_timeout_ms` configurable. PiperOrigin-RevId: 506287683

Commit:4ffdd6f
Author:Ian Hua
Committer:TensorFlower Gardener

Fix build breakage for DPB. PiperOrigin-RevId: 506261904

Commit:c31e086
Author:Yang Chen
Committer:TensorFlower Gardener

#tf-data-service SnapshotSplitProvider verifies local split indexes. It compares the expected next split index with the one received from the dispatcher. If not equal, it reads the split from files. If equal, it will return the result from the dispatcher. PiperOrigin-RevId: 505797858

Commit:7456b69
Author:Yishuang Pang
Committer:TensorFlower Gardener

Add a compiled_using_pjrt field to persistent cache key. This is to differentiate the executables that would be built by XLA and PJRT clients when they are being serialized to disk for persistence. There would be no changes to the filenames of XLA serialized executables. PJRT serialized executables will have '__pjrt' appended to their filenames. PiperOrigin-RevId: 505721399

Commit:7d5c1fe
Author:Yang Chen
Committer:TensorFlower Gardener

#tf-data-service GetSnapshotSplit returns the split indexes. The worker needs it to validate the split it receives and the split it reads from files do not have gaps. For example, in the following sequence: worker requests the kth split -> dispatcher receives request -> worker dies -> worker restarts -> worker reads from files and processes (k-1) splits -> dispatcher writes the kth split -> worker requests the next split -> dispatcher returns the (k+1)th split In the above sequence, the worker misses the kth split. If the worker knows the split index, it will wait for the kth split to appear in the file system when it receives the (k+1)th split. PiperOrigin-RevId: 505144776

Commit:54bb01d
Author:Matt Callanan
Committer:TensorFlower Gardener

Input tf.data service worker address into snapshot split providers. PiperOrigin-RevId: 505136715

Commit:e3cbf21
Author:Ilia Sergachev
Committer:TensorFlower Gardener

[XLA][NFC] Fix mistypes in comments and strings. PiperOrigin-RevId: 505097297

Commit:dfe994a
Author:Matt Callanan
Committer:TensorFlower Gardener

Add tf.data service worker readonly API for getting the progress of snapshot tasks. PiperOrigin-RevId: 505091618

Commit:8421f7a
Author:Catherine Payne
Committer:TensorFlower Gardener

Removing unused enum values for tfxla bridge flag from TF Public API PiperOrigin-RevId: 504986360

Commit:34d5ec1
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Create Quantization Options for StableHLO. PiperOrigin-RevId: 504519209

Commit:e780748
Author:Matt Callanan
Committer:TensorFlower Gardener

Add functionality to detect lost workers in tf.data service dispatcher. PiperOrigin-RevId: 504407156

Commit:50cdf7a
Author:Dan Suh
Committer:TensorFlower Gardener

Preserve function aliases through the TF quantizer. Function aliases associate user-defined names to TF functions. This information is in the `MetaGraphDef` and is used to identify specific functions of interest (the actual function names change every time the model is saved as SavedModel). This change preserves the function aliases through the TF functions, so that the same functions (quantized, after the quantization passes) are still associated with the same aliases. This is done by preventing the aliased functions from being inlined by `MarkFunctionsNoinlinePass`. The pass receives a list of names that should not be inlined, so the names should be passed along from the quantizer's python API. PiperOrigin-RevId: 504396824

Commit:952a5ed
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Remove aggregated memory BW utilization - server. This is the server-side change, the parent CL already removed this information from the frontend. Previously, we were reporting sum(bws)/sum(peak_bws). But now we report bw/peak_bw for each different memory separately. The aggregated stat doesn't mean much, since it's combining apples and oranges. Removing redundant and misleading information. PiperOrigin-RevId: 504375462

Commit:5760ad4
Author:Swachhand Lokhande
Committer:TensorFlower Gardener

Add a compiled_using_pjrt field to persistent cache key. This is to differentiate the executables that would be built by XLA and PJRT clients when they are being serialized to disk for persistence. There would be no changes to the filenames of XLA serialized executables. PJRT serialized executables will have '__pjrt' appended to their filenames. PiperOrigin-RevId: 504338570

Commit:232354c
Author:Johannes Reifferscheid
Committer:TensorFlower Gardener

Tool for executing IR at different compilation stages. PiperOrigin-RevId: 504216243

Commit:a450771
Author:Jie Sun
Committer:TensorFlower Gardener

deprecation unused computation_name PiperOrigin-RevId: 504129492

Commit:aee9c7f
Author:Matt Callanan
Committer:TensorFlower Gardener

In tf.data service heartbeat proto, change snapshot task type from repeated to map. This enables a cleaner interaction between the dispatcher and snapshot manager and helps document that a worker can have only one active task per snapshot. PiperOrigin-RevId: 504071976

Commit:c5da75a
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Fix typos in comments describing FP8 dtypes. Number of mantissa bits was incorrect. PiperOrigin-RevId: 503512825

Commit:9b5d7c6
Author:Matt Callanan
Committer:TensorFlower Gardener

Move distributed snapshot task protos from dispatcher.proto to common.proto. These make more sense to be accessed from common.proto in a forthcoming CL testing worker<->dispatcher interactions. Also `TaskDef` is already in common.proto so maybe `SnapshotTaskDef` should be there, too. PiperOrigin-RevId: 503484754

Commit:48f9d3d
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Adds BACKEND_PASSES stage to the `CompilationLogEntry` proto. PiperOrigin-RevId: 503476779

Commit:b08ccd4
Author:Parker Schuh
Committer:TensorFlower Gardener

Move xplane_to_trace_events and trace_events_to_json to tsl. PiperOrigin-RevId: 503320621

Commit:e70ef6c
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Propagate `tensorflow.SessionMetadata` in `EagerContext` PiperOrigin-RevId: 503214970

Commit:d872675
Author:Yang Chen
Committer:TensorFlower Gardener

#tf-data-service Distributed snapshot task def supports compression. PiperOrigin-RevId: 502978855

Commit:eb08bba
Author:Matt Callanan
Committer:TensorFlower Gardener

Create class to manage tf.data service dispatcher snapshots. This is mostly a 1:1 restructuring with the following changes: 1) Added simple snapshot recovery from on-disk state. 2) Removed all members tracking snapshot, stream, and source completion. I think these may have been structured incorrectly, and either way they weren't tested or used. I'll reevaluate when stream completion is implemented. 3) Removed some validations that weren't tested and/or were related to #1. Will add back after addressing #1. 4) Renamed directory -> path. PiperOrigin-RevId: 502934739

Commit:50f28e7
Author:Peter Buchlovsky
Committer:TensorFlower Gardener

[LatencyHidingScheduler] Save core frequency in schedule proto. PiperOrigin-RevId: 502719264

Commit:a6c5ea0
Author:Yang Chen
Committer:TensorFlower Gardener

#tf-data-service Use dynamic sharding for distributed snapshots. PiperOrigin-RevId: 502694864

Commit:e97a908
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Use proto array to track memory bytes accessed. Also report BWs for each BW type, not just HBM. We track on-chip read, on-chip write, off-chip read+write, and all bytes. Also remove overall BW utilization, since it combines different BWs and peak BWs together, so it doesn't give a good sense of where the memory bottlenecks are. PiperOrigin-RevId: 502640430

Commit:2f9d51d
Author:Yang Chen
Committer:TensorFlower Gardener

#tf-data-service Run `SnapshotStreamWriter` in `WorkerImpl`. The workers will run SnapshotStreamWriter when they receive requests in the dispatcher heartbeat responses. PiperOrigin-RevId: 501966318

Commit:c377e93
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

deprecate computation name which is redundant. PiperOrigin-RevId: 501941984

Commit:74c86fa
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Clean up server-side bw fields that the frontend no longer uses. The frontend no longer uses these fields, so remove them from the server side. PiperOrigin-RevId: 501928865

Commit:0e5a06d
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Adds CODE_GENERATION stage to the `CompilationLogEntry` proto. PiperOrigin-RevId: 501923237

Commit:dcb1ee6
Author:Jie Sun
Committer:TensorFlower Gardener

deprecate computation name which is redundant. PiperOrigin-RevId: 501919730

Commit:a81e0c0
Author:Rahul Joshi
Committer:TensorFlower Gardener

[XLA/GPU] Introduce latency hiding scheduler in XLA/GPU pipeline - Plug in a default 'NOP' model where all instructions have the same latency. - Add a flag to enable the latency hiding scheduler, it will be disabled by default. PiperOrigin-RevId: 501916365

Commit:203a572
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Frontend: Add on-chip memory BW reports to the op profile xprof view. This is the follow-on CL from the previous server-side changes. Previously we only reported overall and hbm memory utilization. However, each memory has its own bandwidth and utilization, so it's better to report them separately instead of in aggregate. Eventually, we should probably remove the aggregate utilization, which is the only utilization we originally had. PiperOrigin-RevId: 501913756

Commit:17f90f2
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

deprecate computation name which is redundant. PiperOrigin-RevId: 501909535

Commit:7a6509d
Author:Jie Sun
Committer:TensorFlower Gardener

deprecate computation name which is redundant. PiperOrigin-RevId: 501888732

Commit:1541e58
Author:TensorFlower Gardener

Merge pull request #59098 from Tixxx:tixxx/disable_mlir_pretty_print PiperOrigin-RevId: 501769378

Commit:b8bb4f4
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Handle cross-program prefetch for multiple buffers PiperOrigin-RevId: 501712511

Commit:bae6b22
Author:Anlun Xu
Committer:TensorFlower Gardener

[XLA:GPU] Enable AOT autotuning for GEMMs PiperOrigin-RevId: 501709191

Commit:9bdb20f
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Add `CompilationEnvironments` to `ExecutableBuildOptions` PiperOrigin-RevId: 501676104

Commit:1ddac25
Author:TJ

rebase and fix compile error

Commit:36d1a91
Author:TJ
Committer:TJ

Change disable_pretty_form to enable_pretty_form

Commit:dd499a2
Author:Mangpo Phothilimthana
Committer:TensorFlower Gardener

Remove flag_config_ from HloModuleConfig. PiperOrigin-RevId: 501472408

Commit:20487ed
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Add comment about HBM memory space. PiperOrigin-RevId: 501427230

Commit:fce0f2c
Author:Daniel Chen
Committer:TensorFlower Gardener

Add GeneratedCodeInfo annotator to tf ops python wrapper generator. PiperOrigin-RevId: 501375027

Commit:6e0da4f
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Server-side: Add on-chip memory BW reports to the op profile xprof view. This is the server-side change, the child CL will take care of the front-end changes. Previously we only reported overall and hbm memory utilization. However, each memory has its own bandwidth and utilization, so it's better to report them separately instead of in aggregate. Eventually, we should probably remove the aggregate utilization, which is the only utilization we originally had. PiperOrigin-RevId: 501330504

Commit:678188d
Author:Alexander Belyaev
Committer:TensorFlower Gardener

[XLA:CPU Next] Add tiling and fusion transformations to XLA:CPU Next pipeline. The pipeline is disabled by default. If `enable_tiling_and_fusion` flag is true, then tiling/fusion/peeling/vectorization transformations will be used instead of linalg-elementwise fusion. PiperOrigin-RevId: 501206180

Commit:e02a9a6
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Add proto serialization/deserialization support for `xla::CompilationEnvironments` PiperOrigin-RevId: 501188397

Commit:85cc017
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Adds HLO_PASSES stage to the `CompilationLogEntry` proto. PiperOrigin-RevId: 501126130

Commit:474a2ca
Author:Tres Popp
Committer:TensorFlower Gardener

Remove xla_cpu_enable_mlir_lowering This is now replaced with --xla_cpu_use_xla_runtime PiperOrigin-RevId: 500955199

Commit:2d0b4f2
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Add offset counter for tf ops python wrapper generator. The offset counter will calculate the byte offsets of where the REGISTER_OP is called and output the result in a text file. PiperOrigin-RevId: 500811592

Commit:2039ea2
Author:Jie Sun
Committer:TensorFlower Gardener

add a timeline string in memory view preprocess proto to 1. hold neato graph processed/generated by the tool. will be populated in future cl. 2. in internal code, we use graphviz to render the neato graph and save graphviz url into the same field (replacing). allow a timeline graph url to be show in frontend of memory viewer tools. PiperOrigin-RevId: 500776195

Commit:e34f775
Author:A. Unique TensorFlower
Committer:TensorFlower Gardener

Text format and parser for dynamic_shape_metadata_prefix_bytes PiperOrigin-RevId: 500764229

Commit:922f519
Author:Peter Buchlovsky
Committer:TensorFlower Gardener

[LatencyHidingScheduler] Add HloModule to latency-hiding schedule proto. PiperOrigin-RevId: 500687813

Commit:a99f386
Author:He Jiang
Committer:TensorFlower Gardener

Mark is_precission_loss_allowed as obsolete based on https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/delegate_options.h PiperOrigin-RevId: 500669676