Get desktop application:
View/edit binary Protocol Buffers messages
This method requests the cached proto that the TPU execute op has been instructed to execute.
Response for GetTpuProgram RPC.
Whether the program is empty, which could be true for sharding/unsharding entries.
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L933
Used in:
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L1634
Used in:
(message has no fields)
The Adam optimizer does not implement hyper-parameter update due to hardware limitations; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam https://github.com/tensorflow/tensorflow/blob/ab51450c817674c8ff08a7ae4f8ac50cdc4bed8b/tensorflow/python/training/adam.py#L32 Note that the code by default implements the lazy version of Adam (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LazyAdamOptimizer) unless the use_non_lazy_adam parameter is set, in which case it implements the normal version of Adam that updates all parameters in the embedding table, even for entries that are not used in the current minibatch (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/AdamOptimizer). If use_non_lazy_adam is enabled, gradient accumulation is also required to be enabled in order to get correct results; a warning will be printed otherwise (which may change to an error in the future). If use_sum_inside_sqrt is set, the Adam variable update formula will be changed from m / (sqrt(v) + epsilon) to m / sqrt(v + epsilon**2); this option improves the performance of TPU training and is not expected to harm model quality.
Used in:
Algorithm in http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
Used in:
Whether to use the updated or the old value of the accumulator when computing the effective learning rate. When update_accumulator_first is set to True, the updated value of the accumulator is used.
The max_var_update value to use. Set value to 0 (default) to disable using max_var_update to clip the gradient.
The maximum value of the accumulator. Set max_accumulator to 0 (default) to disable using max_accumulator to clip the accumulator.
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L4358
Used in:
Used in:
-inf if not set
+inf if not set
Target type for compilation cache fetch operation.
Used in:
Describes the result of a TPU compilation.
The error message, if any, returned during compilation.
HLO proto.
Dynamic learning rate specification in the TPUEmbeddingConfiguration. The actual learning rates are provided as a scalar input list to the SendTPUEmbeddingGradients Op indexed by their tag specified through the following proto.
Used in:
For tables where learning rates are dynamically computed and communicated to the TPU embedding program, a tag must be specified for the learning rate. The tag must be a non-negative integer. The total number of unique tags must be less than or equal to the number of tables in the TPU embedding configuration (a table does not specify any tag if it uses a constant learning rate, and specifies exactly one tag if it uses dynamic learning rates). All tags in the range [0, number_of_unique_tags) must be present in the TPU embedding configuration, i.e. a tag cannot be skipped if a different tag numerically greater than it is used in the configuration. If multiple tables specify the same tag, they *MUST* have the same dynamic learning rate, for example, their dynamic learning rate could be computed by the same TensorFlow sub-graph. The partitioning of the embedding layer would be more optimal if the number_of_unique_tags is as *LOW* as possible, i.e., if many tables share the same tag. The learning_rate input of the SendTPUEmbeddingGradients op is used to communicate dynamic learning rates to the TPU embedding program. The learning_rate input is a list of scalars where the size of the list is equal to the number of unique tags. The learning rate associated with a particular tag is specified by populating its corresponding index in the list of learning_rate scalars.
Estimator for the frequency of updates to a lookup table. It maintains an array (tf.Variable) D, where each element records the average number of global steps between two consecutive batches that hit the corresponding bucket. Once an item with bucket id i is sampled, D[i] is updated by: D[i] <- D[i] * (1 - tau) + delta[i] * tau, where tau is a learning rate between 0 and 1 (exclusive), and delta[i] = current global step - last step i is sampled. The estimated frequency (sampling rate in a batch) is thus 1 / D[i]. Elements in D are initialized with a large value max_delta. delta[i] will also be capped by this value. The exact sequence of operations used in the optimizer is shown below. last_hit_step[i] is a tf.Variable that holds the last global step at which i was sampled. delta = global_step - last_hit_step[i] clipped_delta = min(delta, params.max_delta) is_outlier = (delta >= params.outlier_threshold * D[i]) D[i] <- is_outlier ? clipped_delta : D[i] * (1 - params.tau) + clipped_delta * params.tau last_hit_step[i] <- global_step
Used in:
Learning rate between (0, 1) that is used to update the array D.
Maximum value of delta: difference between the current global step and the last global step at which the row was sampled.
Threshold used to determine whether the current update is an outlier.
The weight exponent used to transform the estimated delta into weights. The transformation function is: (delta / max_delta) ^ (weight_exponent)
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L2646 The hyperparameters for FTRL are the same as for the Keras implementation, with some additions. The "beta" parameter matches the behavior described in the second link above; "beta" / (2 * learning rate) should be added to "l2" to get equivalent behavior in the other TensorFlow implementations of this optimizer. When the multiply_linear_by_lr field is set to true, a modified formula is used for FTRL that treats the "linear" accumulator as being pre-multiplied by the learning rate (i.e., the accumulator named "linear" actually stores "linear * learning_rate"). Other than checkpoint compatibility, this is mathematically equivalent for a static learning rate; for a dynamic learning rate, it is nearly the same as long as the learning rate does not change quickly. The benefit of setting multiply_linear_by_lr to true is that the modified formula handles zero and near-zero learning rates without producing NaNs, improving flexibility for learning rate ramp-up. The allow_zero_accumulator parameter changes some internal formulas to allow zero and near-zero accumulator values at the cost of some performance; this only needs to be set if you are using an initial accumulator value of zero, which is uncommon.
Used in:
Used in:
Status of using gradient accumulation (doing two passes over the input gradients: one to accumulate them into a temporary array and another to apply them using the actual optimization algorithm). The extra message is to wrap the enum for scoping.
(message has no fields)
if UNSPECIFIED (default), gradient accumulation is ENABLED.
Used in:
Configuration proto for hot ID optimization. This is an experimental feature that is currently disabled (by default).
Used in:
Whether to enable or disable hot ID optimization. If UNSPECIFIED (default), hot ID optimization is DISABLED.
Used in:
Source of learning rate to use.
Used in:
Variant of algorithm in http://proceedings.mlr.press/v44/shamir15.pdf
Used in:
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L3068
Used in:
The online Yogi optimizer does not implement hyper-parameter update; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization.pdf plus some extensions based on FTRL. Note that the code by default implements the lazy version of online Yogi.
Used in:
The L1 regularization parameter (used analogously to the one in FTRL).
The L2 regularization parameter (used analogously to the one in FTRL).
\beta_2 from Algorithm 2 in the paper.
Used in:
Learning rate used for updating the embedding layer parameters.
Limits to which to clip the weight values after the backward pass; not present means no limits are applied.
Limits to which to clip the backward pass gradient before using it for updates; not present means no limits are applied.
Amount of weight decay to apply; see weight_decay_optimizers.py for details. Almost all optimizers are supported with this option (MDL Adagrad Light does not work, and SGD does not behave as expected if it is enabled). Although there is no check, users who want weight decay will probably also want to enable gradient accumulation as well so that the decay will happen once per minibatch.
If true, the weight decay factor is multiplied by the current learning rate before use; this is to match the note in DecoupledWeightDecayExtension in weight_decay_optimizers.py.
Status of using gradient accumulation (doing two passes over the input gradients: one to accumulate them into a temporary array and another to apply them using the actual optimization algorithm).
Configuration proto for hot ID replication. This is an experimental feature that is currently disabled (by default).
Optimization algorithm parameters; which field is selected determines which algorithm to use.
A mapping between the dynamic shape dimension of an input and the arg that represents the real shape.
Used in:
Input arg index with dynamic shapes.
The dynamic shape dimension index.
The arg index that dynamic dimension maps to, which represents the value of the real shape.
https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalAdagradOptimizer https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L1961
Used in:
The online Yogi optimizer does not implement hyper-parameter update; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization.pdf plus some extensions based on FTRL. Note that the code by default implements the lazy version of proximal Yogi.
Used in:
The L1 regularization parameter.
The L2 regularization parameter.
The exponential decay rate for the 1st moment estimates.
The exponential decay rate for the 2nd moment estimates.
A constant trading off adaptivity and noise.
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L4229
Used in:
Specification of an optimization algorithm's state variables (both the main value vector and any extra accumulators, etc.). This proto is only used internally by the TPU software and is not exposed directly to the TF model.
Parameter name for the state variable.
Usage type of this state variable.
A state variable that should be filled with a constant and normally hidden from users (used for intermediate gradients being accumulated, for example).
Used in:
A normal state variable that should be saved and restored in checkpoints and used as an input or output to non-debug TensorFlow ops.
Used in:
For padding embedding rows, this field specifies the initial value to be used. Separate initial values need to be specified for the embeddings and any extra accumulators. The initial values should be specified so as to maintain two invariants during model training: (1) The embedding vector multiplied by zero returns a vector containing all zeros. To maintain this invariant, the embedding values should never be NaNs or +-infinity. (2) Repeatedly applying the optimizer using a gradient vector of all zeros does not cause the embeddings or slot variables to become NaNs or +-infinity. The padding row is looked up when no embedding IDs are present for a feature. The semantics of embedding lookup dictate that the output must be zero under this scenario.
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L629
Used in:
(message has no fields)
This is an experimental proto used in the TF/XLA bridge to store metadata to a compile op (e.g. _TPUCompileMlir). TODO(lyandy): Deprecate proto once generic metadata proto is created.
Used in:
Number of replicas of the computation and number of cores in each replica. TODO(b/140721404): it may not be necessary to state the number of cores per replica here. Reconsider when replicated model-parallelism is implemented in XLA.
A fingerprint of the function library. Ensures that any functions called by the computation have matching definitions.
Unique session identifier. Can be empty.
Fingerprint of guaranteed_const value. The fingerprint computation inside tpu_compile_op may be slow. The computation can be avoided by setting the fingerprint value here.
The location of step markers that XLA compile will instrument.
Minimum number of batches run through the XLA graph before XLA fusion autotuner is enabled. Default value of zero disables the autotuner. The XLA fusion autotuner can improve performance by executing a heuristic search on the compiler parameters.
Enables TPU compiler to add partitioning policies for inputs/outputs to the XLA computation for model parallelism.
Whether to use XLA's SPMD or MPMD partitioner when compiler partitioning is requested.
Description of the types and shapes of the arguments to a computation.
Used in:
The cross-core sharding of this input within each replica, e.g., assigning to one core, or replicate across all cores.
Whether this argument will receive the same data across all replicas.
Whether to allow XLA to produce separate programs to shard/unshard this argument. Requires this arg to be an on-device Kind::VARIABLE, or a Kind::PARAMETER. For Kind::PARAMETER, it represents the initial value of a variable, and retval_index_for_sharding must be specified for the corresponding updated value.
If XLA sharding is allowed on a Kind::PARAMETER, this field is used to specify the corresponding updated value in the return values. Use -1 for variables that are not updated.
Whether this argument is placed on fast memory or not.
Whether to let XLA to decide the layout during compilation, as opposed to using a fixed layout determined by the shape.
Name of the node that the arg comes from.
Used in:
Sharding is allowed if host training loop exists.
Used in:
These are args which have been guaranteed to be constants during the session lifetime by the use of the GuaranteeConstOp (or ConstantOp).
Description of the return values from a computation.
Used in:
The cross-core sharding of this return value within each replica, e.g., assigning to one core, or replicate across all cores.
Number of samples in each batch of embedding layer activations sent to the TensorCore.
Number of TPU hosts used for inference/training.
Number of TensorCore used for inference/training.
This parameter determines if the execution of the sparse core will be pipelined with that of the TensorCore. This parameter only affects results when mode=TRAINING. If mode=INFERENCE or BACKWARD_PASS_ONLY, this parameter does not affect execution and hence, is a don't care value. false: The execution of the sparse core is not pipelined with that of the TensorCore. The forward pass of every step on the sparse core is executed only after the backward pass of the previous step is complete. And the backward pass on the sparse core is executed only after the embedding gradients have been computed on the TensorCore on every step. This ensures that the activations on every step observe the gradient updates from the previous step on both the sparse core and the TensorCore. true: The execution of the sparse core is pipelined with that of the TensorCore. The forward pass of every step on the sparse core can be executed after the forward pass of the previous step is complete without waiting for the backward pass. This improves the utilization of the sparse core allowing it to process step N+1 while the embedding gradients for step N are computed on the TensorCore. The backward pass of every step on the sparse core is executed directly after the forward pass for the next step is complete. The drawback is that embedding activations for step N+1 do not observe the embedding gradient updates from step N. This could affect model quality if step N and N+1 involve the same set of embedding IDs. However, since the embedding updates are sparse, this is generally not considered a problem.
Extended output layout information; deprecated and now ignored.
Mode. Should the embedding layer program be run for inference (just forward pass), training (both forward and backward pass) or just the backward_pass.
Used in:
Sharding strategy of the embedding tables among the hosts. If the sharding_strategy is "mod", each id is assigned to host "id % num_hosts". For instance, 13 ids are split across 5 hosts as: [[0, 5, 10], [1, 6, 11], [2, 7, 12], [3, 8], [4, 9]]. If the sharding_strategy is "div", ids are assigned to hosts in a contiguous manner. In this case, 13 ids are split across 5 hosts as: [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]]. In both the strategies, if the id space does not evenly divide the number of hosts, each of the first "table_descriptor.vocabulary_size % num_hosts" hosts will be assigned one more id. This partitioning strategy exactly follows that in the embedding_lookup TensorFlow function at tensorflow/python/ops/embedding_ops.py.
Used in:
Description of the various embedding tables.
Used in:
Name of the table.
Size of the vocabulary (i.e., number of rows) in the table.
The embedding dimension (i.e., the width of the embedding table).
Number of features mapped to this table.
Details of the learning algorithm used to update the embedding parameters.
Used in:
Output locations for each feature of each table.
Shape and layout information for each tensor.
Format information for a single output tensor.
Used in:
Description of the output placement for one feature.
Used in:
Typically, only one copy of each feature is used, but multiple are allowed and the same data will be copied to all of them (with the gradients summed in the backward pass).
Location of one copy of the feature's data.
Used in:
Which output tensor this copy of the feature will go into. Must be between 0 and layout.output_size().
Offset in dimension 0 for this feature copy. Must be between 0 and layout.output(tensor_index).dim0_size_per_sample().
Offset in dimension 1 for this feature copy. Must be between 0 and layout.output(tensor_index).dim1_size() - table width; repeated or partially/fully overlapping values are allowed and results in the same range will be summed (with the gradients replicated in the backward pass).
Description of the output placement for features of one table.
Used in:
Output locations for each feature loaded from this table.
Size and layout information for 2-D tensors.
Used in:
Multiplier for output dimension 0 size; used to match legacy format that stacks features within a sample in dimension 0.
The size (in dimension 1) of this output tensor.
Describes the geometry of a TPU mesh.
The dimensions of the TPU topology, in cores. Typically, this is a 3D topology [x, y, core], where the major dimensions correspond to TPU chips, and the minor dimension describes the number of cores on a multicore chip.
Number of TensorFlow tasks in the cluster.
Number of TPU devices per task.
A flattened rank 3 int32 array with shape [num_tasks, num_tpu_devices_per_task, len(mesh_shape)]. `tasks` is the number of tasks in the TPU cluster, `devices` is the number of TPU devices per task, and the minor dimension corresponds to a position in the TPU mesh topology. Each entry [task, device, axis] gives the `axis`-th coordinate in the topology of a task/device pair.
TPU compilation request for compiling computations into XLA HLO IR and build TPU programs.
A flag reserved for using experimental version of the compilation. By default the value should be false.
Use mlir to lower computation(s) to Hlo.
If true, returns hlo metadatas.
If true, unloads cache on session close.
Compilation metadata.
Computation argument shapes.
Input tensor that gives const guarantee to the TF runtime.
MLIR module definition.
A set of named functions used as the input to lowering to Hlo when mlir is not used.
The version of the graph definition used to lower TF function to Hlo.
Function containing the computation to compile.
Used in:
A user-defined optimizer. The contained HLO program must take the following arguments in the following order: 1. gradients 2. table weights 3. slot variables 4. an optional scalar input that is passed in via the dynamic learning rate mechanism. It must return/end in a tuple op that contains the following values in the following order: 1. new table values 2. new slot variable value The program must have shape (1,1) with dtype float32 throughout and only use HLO that operate elementwise (e.g., no reduce, no variables, no control flow and no broadcasting outside of the single scalar input). The HLO program should be written as if it were a dense update. It will be called on each row that needs an update and will applied elementwise.
Used in:
Padding values for the parameter and the slots, see StateVariableSpecification.padding_initial_value below for more details on how this should be set. One value is needed for the weights and one for each slot.