package tensorflow.tpu

Get desktop application:
View/edit binary Protocol Buffers messages

rpc GetTpuProgram (GetTpuProgramRequest, GetTpuProgramResponseExternal)
tpu_compilation_cache.proto:40
This method requests the cached proto that the TPU execute op has been instructed to execute.
message GetTpuProgramRequest
tpu_compilation_cache_common.proto:32
- oneof key_oneof
  - string key = 1
  - TpuCompilationUidAndIndex uid_and_index = 2
- CompilationCacheFetchTarget fetch_target = 3
message GetTpuProgramResponseExternal
tpu_compilation_cache.proto:23
Response for GetTpuProgram RPC.
- optional GetTpuProgramResponseExternal.Blob proto = 1
- optional tf2xla.HostComputeMetadata host_compute_metadata = 2
- bool may_modify_variables = 3
- optional GetTpuProgramResponseExternal.Blob compiler_metadata = 4
- bool is_empty = 5
  Whether the program is empty, which could be true for sharding/unsharding entries.

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L933

Used in: OptimizationParameters

float rho = 1
float epsilon = 2

This optimizer combines the Adagrad and Momentum update rules. accum(new) = beta2 == 1.0 ? accum(old) + grad^2 : beta2 * accum(old) + (1 - beta2) * grad^2 accum_with_exponent = (accum(new) + epsilon)^(-1.0 / exponent) mom_accum(new) = momentum * mom_accum(old) + accum_with_exponent update = use_nesterov ? momentum * mom_accum(new) + accum_with_exponent : mom_accum(new) var(new) = var(old) - lr * grad * update Algorithm described in https://arxiv.org/abs/2002.11803.

Used in: OptimizationParameters

float momentum = 1
Moving average parameter for the momentum accumulator.
bool use_nesterov = 2
Whether to use the Nesterov variant of momentum.
float exponent = 3
Exponent for the gradient^2 accumulator.
float beta2 = 4
Moving average parameter for the gradient^2 accumulator.
float epsilon = 5
Offset added to the Adagrad accumulator.

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L1634

Used in: OptimizationParameters

(message has no fields)

The Adam optimizer does not implement hyper-parameter update due to hardware limitations; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam https://github.com/tensorflow/tensorflow/blob/ab51450c817674c8ff08a7ae4f8ac50cdc4bed8b/tensorflow/python/training/adam.py#L32 Note that the code by default implements the lazy version of Adam (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LazyAdamOptimizer) unless the use_non_lazy_adam parameter is set, in which case it implements the normal version of Adam that updates all parameters in the embedding table, even for entries that are not used in the current minibatch (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/AdamOptimizer). If use_non_lazy_adam is enabled, gradient accumulation is also required to be enabled in order to get correct results; a warning will be printed otherwise (which may change to an error in the future). If use_sum_inside_sqrt is set, the Adam variable update formula will be changed from m / (sqrt(v) + epsilon) to m / sqrt(v + epsilon**2); this option improves the performance of TPU training and is not expected to harm model quality.

Used in: OptimizationParameters

float beta1 = 3
float beta2 = 4
float epsilon = 5
bool use_non_lazy_adam = 8
bool use_sum_inside_sqrt = 10

Optimizer that just sets the variable to the value of the gradient. To be correct, this requires either gradient accumulation (to sum the values of a computed expression across the samples) or to deduplicate IDs within a single host (to assign the value from an arbitrary sample).

Used in: OptimizationParameters

(message has no fields)

Algorithm in http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

Used in: OptimizationParameters

bool update_accumulator_first = 1
Whether to use the updated or the old value of the accumulator when computing the effective learning rate. When update_accumulator_first is set to True, the updated value of the accumulator is used.
float max_var_update = 2
The max_var_update value to use. Set value to 0 (default) to disable using max_var_update to clip the gradient.
float max_accumulator = 3
The maximum value of the accumulator. Set max_accumulator to 0 (default) to disable using max_accumulator to clip the accumulator.

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L4358

Used in: OptimizationParameters

float rho = 1
float momentum = 2
float epsilon = 3

Used in: OptimizationParameters, SimulatedQuantization

optional google.protobuf.FloatValue lower = 1
-inf if not set
optional google.protobuf.FloatValue upper = 2
+inf if not set

Target type for compilation cache fetch operation.

Used in: GetTpuProgramRequest

INVALID = 0
MAIN = 1
SHARDING = 2
UNSHARDING = 3

Describes the result of a TPU compilation. This is also used as TPU compilation result status payload. URI: "type.googleapis.com/tensorflow.tpu.CompilationResultProto"

error.Code status_code = 1
The error message, if any, returned during compilation.
string status_error_message = 2
repeated xla.HloProto hlo_protos = 3
HLO proto.
CompilationResultProto.ErrorCode error_code = 4

Used in: CompilationResultProto

UNKNOWN = 0
OUT_OF_MEMORY = 1

Dynamic learning rate specification in the TPUEmbeddingConfiguration. The actual learning rates are provided as a scalar input list to the SendTPUEmbeddingGradients Op indexed by their tag specified through the following proto.

Used in: LearningRate

int32 tag = 1
For tables where learning rates are dynamically computed and communicated to the TPU embedding program, a tag must be specified for the learning rate. The tag must be a non-negative integer. The total number of unique tags must be less than or equal to the number of tables in the TPU embedding configuration (a table does not specify any tag if it uses a constant learning rate, and specifies exactly one tag if it uses dynamic learning rates). All tags in the range [0, number_of_unique_tags) must be present in the TPU embedding configuration, i.e. a tag cannot be skipped if a different tag numerically greater than it is used in the configuration. If multiple tables specify the same tag, they *MUST* have the same dynamic learning rate, for example, their dynamic learning rate could be computed by the same TensorFlow sub-graph. The partitioning of the embedding layer would be more optimal if the number_of_unique_tags is as *LOW* as possible, i.e., if many tables share the same tag. The learning_rate input of the SendTPUEmbeddingGradients op is used to communicate dynamic learning rates to the TPU embedding program. The learning_rate input is a list of scalars where the size of the list is equal to the number of unique tags. The learning rate associated with a particular tag is specified by populating its corresponding index in the list of learning_rate scalars.

Estimator for the frequency of updates to a lookup table. It maintains an array (tf.Variable) D, where each element records the average number of global steps between two consecutive batches that hit the corresponding bucket. Once an item with bucket id i is sampled, D[i] is updated by: D[i] <- D[i] * (1 - tau) + delta[i] * tau, where tau is a learning rate between 0 and 1 (exclusive), and delta[i] = current global step - last step i is sampled. The estimated frequency (sampling rate in a batch) is thus 1 / D[i]. Elements in D are initialized with a large value max_delta. delta[i] will also be capped by this value. The exact sequence of operations used in the optimizer is shown below. last_hit_step[i] is a tf.Variable that holds the last global step at which i was sampled. delta = global_step - last_hit_step[i] clipped_delta = min(delta, params.max_delta) is_outlier = (delta >= params.outlier_threshold * D[i]) D[i] <- is_outlier ? clipped_delta : D[i] * (1 - params.tau) + clipped_delta * params.tau last_hit_step[i] <- global_step

Used in: OptimizationParameters

float tau = 1
Learning rate between (0, 1) that is used to update the array D.
float max_delta = 2
Maximum value of delta: difference between the current global step and the last global step at which the row was sampled.
float outlier_threshold = 3
Threshold used to determine whether the current update is an outlier.
float weight_exponent = 4
The weight exponent used to transform the estimated delta into weights. The transformation function is: (delta / max_delta) ^ (weight_exponent)

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Ftrl https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L2646 The hyperparameters for FTRL are the same as for the Keras implementation, with some additions. The "beta" parameter matches the behavior described in the second link above; "beta" / (2 * learning rate) should be added to "l2" to get equivalent behavior in the other TensorFlow implementations of this optimizer. When the multiply_linear_by_lr field is set to true, a modified formula is used for FTRL that treats the "linear" accumulator as being pre-multiplied by the learning rate (i.e., the accumulator named "linear" actually stores "linear * learning_rate"). Other than checkpoint compatibility, this is mathematically equivalent for a static learning rate; for a dynamic learning rate, it is nearly the same as long as the learning rate does not change quickly. The benefit of setting multiply_linear_by_lr to true is that the modified formula handles zero and near-zero learning rates without producing NaNs, improving flexibility for learning rate ramp-up.

Used in: OptimizationParameters

float l1 = 1
float l2 = 2
float lr_power = 3
float beta = 7
bool multiply_linear_by_lr = 6
bool allow_zero_accumulator = 8
Previously, allow_zero_accumulator parameter changed some internal formulas to allow zero and near-zero accumulator values at the cost of some performance. The current implementation ignores this parameter; zero or near-zero accumulator values are now always supported.

Used in: GetTpuProgramResponseExternal

bytes data = 1

Status of using gradient accumulation (doing two passes over the input gradients: one to accumulate them into a temporary array and another to apply them using the actual optimization algorithm). The extra message is to wrap the enum for scoping.

(message has no fields)

if UNSPECIFIED (default), gradient accumulation is ENABLED.

Used in: OptimizationParameters

UNSPECIFIED = 0
ENABLED = 1
DISABLED = 2

Configuration proto for hot ID optimization. This is an experimental feature that is currently disabled (by default).

Used in: OptimizationParameters

HotIdReplicationConfiguration.Status status = 1

Whether to enable or disable hot ID optimization. If UNSPECIFIED (default), hot ID optimization is DISABLED.

Used in: HotIdReplicationConfiguration

UNSPECIFIED = 0
ENABLED = 1
DISABLED = 2

Source of learning rate to use.

Used in: OptimizationParameters

oneof learning_rate
- float constant = 1
- DynamicLearningRate dynamic = 2

There is one important limitation for this HBM packing though. When only a subset of rows in an 8-float chunk are accessed on a particular step, the adjoining rows in the same chunk are updated with zero gradients on the backward pass even if they are not touched. This is an artifact of the packing implementation. This operation is NOT functionally correct for optimizers where zero gradients change the embeddings/slot-variable values, e.g., momentum-based optimizers. Hence, this HBM packing cannot be enabled for embedding tables with such optimizers. The TPU software automatically recognizes that a zero gradient can modify state and turns off the low dimensional embedding packing in that scenario. However, for optimizers where a zero gradient is a NoOp, such as SGD, Adagrad, and FTRL, this packing optimization can be used. However, there are some important considerations: * Clipping limits: The initial values for such embeddings should fall within the clipping limits specified in the optimization parameters. Otherwise, a zero gradient will cause the embeddings to be clipped. This changes state and hence, is not a NoOp. * FTRL: The embedding vector is computed directly from the values of the accumulator and linear slot variables. Hence, the initial embedding values should match that computed from the initial values of the accumulator and linear slot variables. Note that in nearly all cases, the linear value is initialized to zero; this corresponds to an embedding value of zero. Performance: The TPU has to perform additional work when low dimensional packing is enabled. In certain situations when the vocabulary size is small, it may not make sense to turn on this packing since the total memory usage due to padding is extremely low. Hence, the TPU software automatically turns off the packing optimization in such scenarios.

(message has no fields)

if UNSPECIFIED (default), the low dimension packing status is DISABLED. This can change in future. if ENABLED, the low dimension packing is enabled only if the following three additional conditions are true: * The optimizer treats the zero gradient as a NoOp. * The embedding dimension is 1, 2, or 4. * The vocabulary size is large enough to avoid performance issues. if DISABLED, the low dimension packing is always disabled.

Used in: OptimizationParameters

UNSPECIFIED = 0
ENABLED = 1
DISABLED = 2

Variant of algorithm in http://proceedings.mlr.press/v44/shamir15.pdf

Used in: OptimizationParameters

float l2 = 1
float lr_power = 2
float min_servable_mdl_benefit = 3
float mdl_mix_in_margin = 4
float mdl_benefit_rampup_coeff = 5
float mdl_min_weight = 6
float benefit_revisit_scale = 7
float max_event_benefit = 8
float max_total_benefit = 9
float mdl_hard_limit = 10
bool hard_limit_min_benefit = 11
bool mdl_regularize = 12

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L3068

Used in: OptimizationParameters

float momentum = 1
bool use_nesterov = 2

The online Yogi optimizer does not implement hyper-parameter update; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization.pdf plus some extensions based on FTRL. Note that the code by default implements the lazy version of online Yogi.

Used in: OptimizationParameters

float l1 = 1
The L1 regularization parameter (used analogously to the one in FTRL).
float l2 = 2
The L2 regularization parameter (used analogously to the one in FTRL).
float beta2 = 3
\beta_2 from Algorithm 2 in the paper.

Used in: TPUEmbeddingConfiguration.TableDescriptor

optional LearningRate learning_rate = 13
Learning rate used for updating the embedding layer parameters.
optional ClippingLimits clipping_limits = 2
Limits to which to clip the weight values after the backward pass; not present means no limits are applied.
optional ClippingLimits gradient_clipping_limits = 7
Limits to which to clip the backward pass gradient before using it for updates; not present means no limits are applied.
float weight_decay_factor = 16
Amount of weight decay to apply; see weight_decay_optimizers.py for details. All optimizers except MDL Adagrad Light are supported with this option. Although there is no check, users who want weight decay will also want to ensure that gradient accumulation is enabled so that the decay will happen once per global batch.
bool multiply_weight_decay_factor_by_learning_rate = 22
If true, the weight decay factor is multiplied by the current learning rate before use; this is to match the note in DecoupledWeightDecayExtension in weight_decay_optimizers.py.
optional SimulatedQuantization simulated_quantization = 27
Configuration for simulated quantization which is used to reduce training/serving skew when the serving variables are quantized. The same quantization operations are executed during training to minimize differences with serving.
GradientAccumulationStatus.Status gradient_accumulation_status = 17
Status of using gradient accumulation (doing two passes over the input gradients: one to accumulate them into a temporary array and another to apply them using the actual optimization algorithm).
LowDimensionalPackingStatus.Status low_dimensional_packing_status = 28
Status of the low-dimensional embedding packing optimization. This controls whether to optimize the packing of 1-dimensional, 2-dimensional, and 4-dimensional embedding tables in memory.
optional HotIdReplicationConfiguration hot_id_replication_configuration = 18
Configuration proto for hot ID replication. This is an experimental feature that is currently disabled (by default).
oneof parameters
Optimization algorithm parameters; which field is selected determines which algorithm to use.
- AdagradParameters adagrad = 3
- AdagradMomentumParameters adagrad_momentum = 26
- BoundedAdagradParameters bounded_adagrad = 19
- StochasticGradientDescentParameters stochastic_gradient_descent = 4
- FtrlParameters ftrl = 5
- AdamParameters adam = 6
- MomentumParameters momentum = 8
- RmsPropParameters rms_prop = 9
- CenteredRmsPropParameters centered_rms_prop = 10
- MdlAdagradLightParameters mdl_adagrad_light = 11
- AdadeltaParameters adadelta = 12
- ProximalAdagradParameters proximal_adagrad = 14
- OnlineYogiParameters online_yogi = 20
- ProximalYogiParameters proximal_yogi = 21
- FrequencyEstimatorParameters frequency_estimator = 23
- UserDefinedProgramParameters user_defined_program = 24
- AssignParameters assign = 25

A mapping between the dynamic shape dimension of an input and the arg that represents the real shape.

Used in: TPUCompileMetadataProto

int32 arg_index = 1
Input arg index with dynamic shapes.
int32 shape_index = 2
The dynamic shape dimension index.
int32 padding_arg_index = 3
The arg index that dynamic dimension maps to, which represents the value of the real shape.

https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalAdagradOptimizer https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L1961

Used in: OptimizationParameters

float l1 = 1
float l2 = 2

Used in: OptimizationParameters

float l1 = 1
The L1 regularization parameter.
float l2 = 2
The L2 regularization parameter.
float beta1 = 3
The exponential decay rate for the 1st moment estimates.
float beta2 = 4
The exponential decay rate for the 2nd moment estimates.
float epsilon = 5
A constant trading off adaptivity and noise.

Used in: OptimizationParameters

float rho = 1
float momentum = 2
float epsilon = 3

Configuration for simulated quantization; simulated quantization is used to reduce training/serving skew when the serving variables are quantized. The same quantization operations are executed during training to minimize differences with serving. Simulated quantization inserts the following operations on the forward pass after gathering the embedding vector from HBM. The backward pass operations are unchanged. clipped_val = clip(input, clipping_limits) quantum = clipping_limits.range() / (num_buckets - 1) quantized_val = floor((clipped_val - clipping_limits.lower()) / quantum + .5) return quantized_val * quantum + clipping_limits.lower().

Used in: OptimizationParameters

bool enabled = 1
Whether simulated quantization is enabled.
optional ClippingLimits clipping_limits = 2
Minimum and maximum values of the range used for quantization.
int32 num_buckets = 3
Number of possible quantized values.

Specification of an optimization algorithm's state variables (both the main value vector and any extra accumulators, etc.). This proto is only used internally by the TPU software and is not exposed directly to the TF model.

string name = 1
Parameter name for the state variable.
oneof usage
Usage type of this state variable.
- StateVariableSpecification.UserDefined user_defined = 2
- StateVariableSpecification.FillWithConstant fill_with_constant = 3

A state variable that should be filled with a constant and normally hidden from users (used for intermediate gradients being accumulated, for example).

Used in: StateVariableSpecification

double initial_value = 1

A normal state variable that should be saved and restored in checkpoints and used as an input or output to non-debug TensorFlow ops.

Used in: StateVariableSpecification

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD https://github.com/tensorflow/tensorflow/blob/6b6471f3ffb7f1fefe42d814aa5fb9ab7a535b58/tensorflow/core/kernels/training_ops.cc#L629

Used in: OptimizationParameters

(message has no fields)

This is an experimental proto used in the TF/XLA bridge to store metadata to a compile op (e.g. _TPUCompileMlir). TODO(lyandy): Deprecate proto once generic metadata proto is created.

Used in: TpuCompilationRequestProto

repeated TPUCompileMetadataProto.Arg args = 1
repeated TPUCompileMetadataProto.Retval retvals = 2
int32 num_replicas = 3
Number of replicas of the computation and number of cores in each replica. TODO(b/140721404): it may not be necessary to state the number of cores per replica here. Reconsider when replicated model-parallelism is implemented in XLA.
int32 num_cores_per_replica = 4
optional xla.DeviceAssignmentProto device_assignment = 8
uint64 function_library_fingerprint = 6
A fingerprint of the function library. Ensures that any functions called by the computation have matching definitions.
string session_handle = 9
Unique session identifier. Can be empty.
string guaranteed_const_fingerprint = 10
Fingerprint of guaranteed_const value. The fingerprint computation inside tpu_compile_op may be slow. The computation can be avoided by setting the fingerprint value here.
repeated PaddingMap padding_maps = 11
xla.DebugOptions.StepMarkerLocation step_marker_location = 12
The location of step markers that XLA compile will instrument.
int64 xla_fusion_autotuner_thresh = 13
Minimum number of batches run through the XLA graph before XLA fusion autotuner is enabled. Default value of zero disables the autotuner. The XLA fusion autotuner can improve performance by executing a heuristic search on the compiler parameters.
bool enable_automatic_model_parallelism = 14
Enables TPU compiler to add partitioning policies for inputs/outputs to the XLA computation for model parallelism.
bool use_spmd_for_xla_partitioning = 15
Whether to use XLA's SPMD or MPMD partitioner when compiler partitioning is requested.
bool use_auto_spmd_for_xla_partitioning = 18
Whether to automatically generate XLA shardings for SPMD partitioner.
repeated int64 auto_spmd_mesh_shape = 19
Device mesh shape used to create the sharding search space when use_auto_spmd_partitioning=true.
repeated int64 auto_spmd_mesh_ids = 20
Device mesh ids compatible with the above mesh_shape used when use_auto_spmd_partitioning=true.
uint64 mlir_fingerprint = 17
A fingerprint generated by hashing the MLIR module content.
optional TPUCompileOptions compile_options = 21

Description of the types and shapes of the arguments to a computation.

Used in: TPUCompileMetadataProto

DataType dtype = 1
optional TensorShapeProto shape = 2
Arg.Kind kind = 3
optional xla.OpSharding sharding = 4
The cross-core sharding of this input within each replica, e.g., assigning to one core, or replicate across all cores.
bool is_same_data_across_replicas = 5
Whether this argument will receive the same data across all replicas.
Arg.EnableXlaSharding enable_xla_sharding = 6
Whether to allow XLA to produce separate programs to shard/unshard this argument. Requires this arg to be an on-device Kind::VARIABLE, or a Kind::PARAMETER. For Kind::PARAMETER, it represents the initial value of a variable, and retval_index_for_sharding must be specified for the corresponding updated value.
int32 retval_index_for_sharding = 8
If XLA sharding is allowed on a Kind::PARAMETER, this field is used to specify the corresponding updated value in the return values. Use -1 for variables that are not updated.
bool fast_mem = 7
Whether this argument is placed on fast memory or not.
bool unrestricted_layout = 9
Whether to let XLA to decide the layout during compilation, as opposed to using a fixed layout determined by the shape.
string name = 10
Name of the node that the arg comes from.
bool requires_xla_broadcast = 11
Whether to use XLA collectives to broadcast this parameter to all replicas, instead of using TensorFlow Send/Recv among the tasks.

Used in: Arg

DISALLOWED = 0
TENTATIVE = 1
Sharding is allowed if host training loop exists.
ALLOWED = 2

Used in: Arg

INVALID = 0
PARAMETER = 1
VARIABLE = 2
GUARANTEED_CONSTANT = 3
These are args which have been guaranteed to be constants during the session lifetime by the use of the GuaranteeConstOp (or ConstantOp).

Description of the return values from a computation.

Used in: TPUCompileMetadataProto

optional xla.OpSharding sharding = 1
The cross-core sharding of this return value within each replica, e.g., assigning to one core, or replicate across all cores.

Stable protobuf for TPU compilation options, suitable for persistent storage. This proto needs to be backward compatible under maintenance. TODO(timshen): investigate and migrate other options from TPUCompileMetadataProto.

Used in: TPUCompileMetadataProto

TPUCompileOptions.Precision matrix_unit_operand_precision = 1

Used in: TPUCompileOptions

DEFAULT = 0
BFLOAT16 = 1
FLOAT32 = 2
TENSOR_FLOAT32 = 3

repeated TPUEmbeddingConfiguration.TableDescriptor table_descriptor = 1
TPUEmbeddingConfiguration.Mode mode = 2
int32 batch_size_per_tensor_core = 3
Number of samples in each batch of embedding layer activations sent to the TensorCore.
int32 num_hosts = 4
Number of TPU hosts used for inference/training.
int32 num_tensor_cores = 5
Number of TensorCore used for inference/training.
TPUEmbeddingConfiguration.ShardingStrategy sharding_strategy = 6
bool pipeline_execution_with_tensor_core = 7
This parameter determines if the execution of the sparse core will be pipelined with that of the TensorCore. This parameter only affects results when mode=TRAINING. If mode=INFERENCE or BACKWARD_PASS_ONLY, this parameter does not affect execution and hence, is a don't care value. false: The execution of the sparse core is not pipelined with that of the TensorCore. The forward pass of every step on the sparse core is executed only after the backward pass of the previous step is complete. And the backward pass on the sparse core is executed only after the embedding gradients have been computed on the TensorCore on every step. This ensures that the activations on every step observe the gradient updates from the previous step on both the sparse core and the TensorCore. true: The execution of the sparse core is pipelined with that of the TensorCore. The forward pass of every step on the sparse core can be executed after the forward pass of the previous step is complete without waiting for the backward pass. This improves the utilization of the sparse core allowing it to process step N+1 while the embedding gradients for step N are computed on the TensorCore. The backward pass of every step on the sparse core is executed directly after the forward pass for the next step is complete. The drawback is that embedding activations for step N+1 do not observe the embedding gradient updates from step N. This could affect model quality if step N and N+1 involve the same set of embedding IDs. However, since the embedding updates are sparse, this is generally not considered a problem.
string profile_data_directory = 9
Directory where embedding lookup statistics are stored. These statistics summarize information about the inputs to the embedding lookup operation, in particular, the average number of embedding IDs per example and how well the embedding IDs are load balanced across the system. The lookup statistics are used during TPU initialization for embedding table partitioning. Collection of lookup statistics is done at runtime by profiling the embedding inputs: only 3% of input samples are profiled to minimize host CPU overhead. Once a suitable number of samples are profiled, the lookup statistics are saved to table-specific files in the profile data directory generally at the end of a TPU training loop. The filename corresponding to each table is obtained by hashing table specific parameters (e.g., table name and number of features) and global configuration parameters (e.g., sharding strategy and TPU worker task count). The same profile data directory can be shared amongst several models to reuse embedding lookup statistics.
repeated TPUEmbeddingConfiguration.FeatureDescriptor feature_descriptor = 10
If the feature_descriptor field is populated, the model should NOT populate TableDescriptor.num_features and batch_size_per_tensor_core. These two fields will be auto-populated by the TPUEmbedding rewrite passes.
optional TPUEmbeddingConfiguration.SpmdSharding spmd_sharding = 11

Description of different input features.

Used in: TPUEmbeddingConfiguration

string name = 1
Name of the input feature.
int32 table_id = 2
Index of the corresponding table in the TableDescriptor list.
repeated int32 input_shape = 3
Static shape of the inputs (excluding the reduction axis). Note that the shape of the actual inputs provided using the infeed op must be strictly smaller than input_shape. The outputs received at the TensorCore will have rank = input_shape.size() + 1. The innermost axis corresponds to the embedding dimension. If the input has shape [m, n, k] (excluding the reduction axis) and the embedding dimension is d, the output received at the TensorCore will have shape [m, n, k, d].

Mode. Should the embedding layer program be run for inference (just forward pass), training (both forward and backward pass) or just the backward_pass.

Used in: TPUEmbeddingConfiguration

UNSPECIFIED = 0
INFERENCE = 1
TRAINING = 2
BACKWARD_PASS_ONLY = 3

Sharding strategy of the embedding tables among the hosts. If the sharding_strategy is "mod", each id is assigned to host "id % num_hosts". For instance, 13 ids are split across 5 hosts as: [[0, 5, 10], [1, 6, 11], [2, 7, 12], [3, 8], [4, 9]]. If the sharding_strategy is "div", ids are assigned to hosts in a contiguous manner. In this case, 13 ids are split across 5 hosts as: [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]]. In both the strategies, if the id space does not evenly divide the number of hosts, each of the first "table_descriptor.vocabulary_size % num_hosts" hosts will be assigned one more id. This partitioning strategy exactly follows that in the embedding_lookup TensorFlow function at tensorflow/python/ops/embedding_ops.py.

Used in: TPUEmbeddingConfiguration

DIV_DEFAULT = 0
MOD = 1

SPMD (Single Program Multiple Data) sharding configuration for TPUEmbedding. When model parallelism is used on the TensorCore, the number of cores per replica must be passed to TPUEmbedding so that the right shapes can be computed in the TF/XLA bridge.

Used in: TPUEmbeddingConfiguration

bool enabled = 1
Whether SPMD sharding is enabled.
int32 num_cores_per_replica = 2
Number of cores per replica.

Description of the various embedding tables.

Used in: TPUEmbeddingConfiguration

string name = 1
Name of the table.
int64 vocabulary_size = 2
Size of the vocabulary (i.e., number of rows) in the table.
int32 dimension = 3
The embedding dimension (i.e., the width of the embedding table).
int32 num_features = 4
Number of features mapped to this table.
optional OptimizationParameters optimization_parameters = 5
Details of the learning algorithm used to update the embedding parameters.

A placeholder message that is used to define a unique Status payload URL for TPU embedding errors.

(message has no fields)

Describes features of a tpu.

Used in: TopologyProto

TPUHardwareFeature.EmbeddingFeature embedding_feature = 1

Embedding feature of a tpu.

Used in: TPUHardwareFeature

UNSUPPORTED = 0
No embedding lookup accelerator available on the tpu.
V1 = 1
Embedding lookup accelerator V1. The embedding lookup operation can only be placed at the beginning of computation. Only one instance of embedding lookup layer is allowed.
V2 = 2
Embedding lookup accelerator V2. The embedding lookup operation can be placed anywhere of the computation. Multiple instances of embedding lookup layer is allowed.

Describes the geometry of a TPU mesh.

repeated int32 mesh_shape = 1
The dimensions of the TPU topology, in cores. Typically, this is a 4D topology [x, y, z, core], where the major dimensions correspond to TPU chips, and the minor dimension describes the number of cores on a multicore chip.
int32 num_tasks = 2
Number of TensorFlow tasks in the cluster.
int32 num_tpu_devices_per_task = 3
Number of TPU devices per task.
repeated int32 device_coordinates = 4
A flattened rank 3 int32 array with shape [num_tasks, num_tpu_devices_per_task, len(mesh_shape)]. `tasks` is the number of tasks in the TPU cluster, `devices` is the number of TPU devices per task, and the minor dimension corresponds to a position in the TPU mesh topology. Each entry [task, device, axis] gives the `axis`-th coordinate in the topology of a task/device pair.
optional TPUHardwareFeature tpu_hardware_feature = 5
TPU supported features.

TPU compilation request for compiling computations into XLA HLO IR and build TPU programs.

bool use_experimental = 1
A flag reserved for using experimental version of the compilation. By default the value should be false.
bool use_mlir = 2
Use mlir to lower computation(s) to Hlo.
bool return_hlo_protos = 3
If true, returns hlo metadatas.
bool unload_cache_on_session_close = 4
If true, unloads cache on session close.
optional TPUCompileMetadataProto metadata = 5
Compilation metadata.
repeated TensorShapeProto arg_shapes = 6
Computation argument shapes.
repeated TensorProto guaranteed_constants = 7
Input tensor that gives const guarantee to the TF runtime.
string mlir_module = 8
MLIR module definition.
optional FunctionDefLibrary fdef_lib = 9
A set of named functions used as the input to lowering to Hlo when mlir is not used.
int32 graph_def_version = 10
The version of the graph definition used to lower TF function to Hlo.
optional NameAttrList function = 11
Function containing the computation to compile.

Used in: GetTpuProgramRequest

int64 uid = 1
int32 proto_index = 2

A user-defined optimizer. The contained HLO program must take the following arguments in the following order: 1. gradients 2. table weights 3. slot variables 4. an optional scalar input that is passed in via the dynamic learning rate mechanism. It must return/end in a tuple op that contains the following values in the following order: 1. new table values 2. new slot variable value The program must have shape (1,1) with dtype float32 throughout and only use HLO that operate elementwise (e.g., no reduce, no variables, no control flow and no broadcasting outside of the single scalar input). The HLO program should be written as if it were a dense update. It will be called on each row that needs an update and will applied elementwise.

Used in: OptimizationParameters

optional xla.HloModuleProto program = 1

package tensorflow.tpu

service TpuCompilationCacheServiceExternal

rpc GetTpuProgram (GetTpuProgramRequest, GetTpuProgramResponseExternal)

message GetTpuProgramRequest

oneof key_oneof

string key = 1

TpuCompilationUidAndIndex uid_and_index = 2

CompilationCacheFetchTarget fetch_target = 3

message GetTpuProgramResponseExternal

optional GetTpuProgramResponseExternal.Blob proto = 1

optional tf2xla.HostComputeMetadata host_compute_metadata = 2

bool may_modify_variables = 3

optional GetTpuProgramResponseExternal.Blob compiler_metadata = 4

bool is_empty = 5

message AdadeltaParameters

float rho = 1

float epsilon = 2

message AdagradMomentumParameters

float momentum = 1

bool use_nesterov = 2

float exponent = 3

float beta2 = 4

float epsilon = 5

message AdagradParameters

message AdamParameters

float beta1 = 3

float beta2 = 4

float epsilon = 5

bool use_non_lazy_adam = 8

bool use_sum_inside_sqrt = 10

message AssignParameters

message BoundedAdagradParameters

bool update_accumulator_first = 1

float max_var_update = 2

float max_accumulator = 3

message CenteredRmsPropParameters

float rho = 1

float momentum = 2

float epsilon = 3

message ClippingLimits

optional google.protobuf.FloatValue lower = 1

optional google.protobuf.FloatValue upper = 2

enum CompilationCacheFetchTarget

INVALID = 0

MAIN = 1

SHARDING = 2

UNSHARDING = 3

message CompilationResultProto

error.Code status_code = 1

string status_error_message = 2

repeated xla.HloProto hlo_protos = 3

CompilationResultProto.ErrorCode error_code = 4

enum CompilationResultProto.ErrorCode

UNKNOWN = 0

OUT_OF_MEMORY = 1

message DynamicLearningRate

int32 tag = 1

message FrequencyEstimatorParameters

float tau = 1

float max_delta = 2

float outlier_threshold = 3

float weight_exponent = 4

message FtrlParameters

float l1 = 1

float l2 = 2

float lr_power = 3

float beta = 7

bool multiply_linear_by_lr = 6

bool allow_zero_accumulator = 8

message GetTpuProgramResponseExternal.Blob

bytes data = 1

message GradientAccumulationStatus

enum GradientAccumulationStatus.Status

UNSPECIFIED = 0

ENABLED = 1

DISABLED = 2

message HotIdReplicationConfiguration

HotIdReplicationConfiguration.Status status = 1

enum HotIdReplicationConfiguration.Status

UNSPECIFIED = 0