package diplomacy.tensorflow.tpu

Get desktop application:
View/edit binary Protocol Buffers messages

https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L68

Used in: OptimizationParameters

float rho = 1
float epsilon = 2
float initial_accumulator = 3
float initial_update = 4

https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L151

Used in: OptimizationParameters

float initial_accumulator = 1

The Adam optimizer does not implement hyper-parameter update; use the dynamic learning rate feature instead, setting the learning rate to: user learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) Here, t is the current timestep. https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer https://github.com/tensorflow/tensorflow/blob/ab51450c817674c8ff08a7ae4f8ac50cdc4bed8b/tensorflow/python/training/adam.py#L54 Note that the code by default implements the lazy version of Adam (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/LazyAdamOptimizer) unless the use_non_lazy_adam parameter is set, in which case it implements the normal version of Adam that updates all parameters in the embedding table, even for entries that are not used in the current minibatch (https://www.tensorflow.org/api_docs/python/tf/contrib/opt/AdamOptimizer). If use_non_lazy_adam is enabled, use_gradient_accumulation is also required in order to get correct results; a warning will be printed otherwise (which may change to an error in the future). If use_sum_inside_sqrt is set, the Adam variable update formula will be changed from m / (sqrt(v) + epsilon) to m / sqrt(v + epsilon**2); this option improves the performance of TPU training and is not expected to harm model quality.

Used in: OptimizationParameters

float beta1 = 3
float beta2 = 4
float epsilon = 5
float initial_m = 6
float initial_v = 7
bool use_non_lazy_adam = 8
bool use_sum_inside_sqrt = 10

https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L372

Used in: OptimizationParameters

float rho = 1
float momentum = 2
float epsilon = 3
float initial_ms = 4
float initial_mom = 5
float initial_mg = 6

Used in: OptimizationParameters

optional google.protobuf.FloatValue lower = 1
-inf if not set
optional google.protobuf.FloatValue upper = 2
+inf if not set

Describes the result of a TPU compilation.

error.Code status_code = 1
The error message, if any, returned during compilation.
string status_error_message = 2
repeated xla.HloProto hlo_protos = 3
HLO proto.

A 'device' is a physical entity in the system and is comprised of several resources.

Used in: Trace

string name = 1
The name of the device.
uint32 device_id = 2
The id of this device, unique in a single trace.
map<uint32, Resource> resources = 3
The resources on this device, keyed by resource_id;

Get the learning rate from the parameters of the SendTPUEmbeddingGradients op.

Used in: LearningRate

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/train/FtrlOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L192

Used in: OptimizationParameters

float l1 = 1
float l2 = 2
float lr_power = 3
float initial_accum = 4
float initial_linear = 5

Source of learning rate to use.

Used in: OptimizationParameters

oneof learning_rate
- float constant = 1
- DynamicLearningRate dynamic = 2

Variant of algorithm in http://proceedings.mlr.press/v44/shamir15.pdf

Used in: OptimizationParameters

float l2 = 1
float lr_power = 2
float min_servable_mdl_benefit = 3
float mdl_mix_in_margin = 4
float mdl_benefit_rampup_coeff = 5
float mdl_min_weight = 6
float benefit_revisit_scale = 7
float max_event_benefit = 8
float max_total_benefit = 9
float mdl_hard_limit = 10
bool hard_limit_min_benefit = 11
bool mdl_regularize = 12
float initial_accumulator = 13
float initial_weight = 14
float initial_benefit = 15

https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L271

Used in: OptimizationParameters

float momentum = 1
bool use_nesterov = 2
float initial_accum = 3

Used in: TPUEmbeddingConfiguration.TableDescriptor

optional LearningRate learning_rate = 13
Learning rate used for updating the embedding layer parameters.
optional ClippingLimits clipping_limits = 2
Limits to which to clip the weight values after the backward pass; not present means no limits are applied.
optional ClippingLimits gradient_clipping_limits = 7
Limits to which to clip the backward pass gradient before using it for updates; not present means no limits are applied.
float weight_decay_factor = 16
Amount of weight decay to apply; see weight_decay_optimizers.py for details. Almost all optimizers are supported with this option (MDL Adagrad Light does not work, and SGD does not behave as expected if it is enabled). Although there is no check, users who want weight decay will probably also want to enable gradient accumulation as well so that the decay will happen once per minibatch.
bool use_gradient_accumulation = 15
Whether to use gradient accumulation (do two passes over the input gradients: one to accumulate them into a temporary array and another to apply them using the actual optimization algorithm). This feature is experimental -- it has not been fully verified and may cause training crashes and/or failures.
oneof parameters
Optimization algorithm parameters; which field is selected determines which algorithm to use.
- AdagradParameters adagrad = 3
- StochasticGradientDescentParameters stochastic_gradient_descent = 4
- FtrlParameters ftrl = 5
- AdamParameters adam = 6
- MomentumParameters momentum = 8
- RmsPropParameters rms_prop = 9
- CenteredRmsPropParameters centered_rms_prop = 10
- MdlAdagradLightParameters mdl_adagrad_light = 11
- AdadeltaParameters adadelta = 12
- ProximalAdagradParameters proximal_adagrad = 14

Used in: OptimizationParameters

float l1 = 1
float l2 = 2
float initial_accumulator = 3

A 'resource' generally is a specific computation component on a device. These can range from threads on CPUs to specific arithmetic units on hardware devices.

Used in: Device

string name = 1
The name of the resource.
uint32 resource_id = 2
The id of the resource. Unique within a device.

Used in: OptimizationParameters

float rho = 1
float momentum = 2
float epsilon = 3
float initial_ms = 4
float initial_mom = 5

Specification of an optimization algorithm's state variables (both the main value vector and any extra accumulators, etc.).

string name = 1
Parameter name for the state variable.
oneof usage
Usage type of this state variable.
- StateVariableSpecification.UserDefined user_defined = 2
- StateVariableSpecification.FillWithConstant fill_with_constant = 3

A state variable that should be filled with a constant and normally hidden from users (used for intermediate gradients being accumulated, for example).

Used in: StateVariableSpecification

double initial_value = 1

A normal state variable that should be saved and restored in checkpoints and used as an input or output to non-debug TensorFlow ops.

Used in: StateVariableSpecification

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer https://github.com/tensorflow/tensorflow/blob/c19e29306ce1777456b2dbb3a14f511edf7883a8/tensorflow/core/kernels/training_ops.cc#L423

Used in: OptimizationParameters

(message has no fields)

repeated TPUEmbeddingConfiguration.TableDescriptor table_descriptor = 1
TPUEmbeddingConfiguration.Mode mode = 2
int32 batch_size_per_tensor_core = 3
Number of samples in each batch of embedding layer activations sent to the TensorCore.
int32 num_hosts = 4
Number of TPU hosts used for inference/training.
int32 num_tensor_cores = 5
Number of TensorCore used for inference/training.
TPUEmbeddingConfiguration.ShardingStrategy sharding_strategy = 6
bool pipeline_execution_with_tensor_core = 7
This parameter determines if the execution of the sparse core will be pipelined with that of the TensorCore. This parameter only affects results when mode=TRAINING. If mode=INFERENCE or BACKWARD_PASS_ONLY, this parameter does not affect execution and hence, is a don't care value. false: The execution of the sparse core is not pipelined with that of the TensorCore. The forward pass of every step on the sparse core is executed only after the backward pass of the previous step is complete. And the backward pass on the sparse core is executed only after the embedding gradients have been computed on the TensorCore on every step. This ensures that the activations on every step observe the gradient updates from the previous step on both the sparse core and the TensorCore. true: The execution of the sparse core is pipelined with that of the TensorCore. The forward pass of every step on the sparse core can be executed after the forward pass of the previous step is complete without waiting for the backward pass. This improves the utilization of the sparse core allowing it to process step N+1 while the embedding gradients for step N are computed on the TensorCore. The backward pass of every step on the sparse core is executed directly after the forward pass for the next step is complete. The drawback is that embedding activations for step N+1 do not observe the embedding gradient updates from step N. This could affect model quality if step N and N+1 involve the same set of embedding IDs. However, since the embedding updates are sparse, this is generally not considered a problem.
optional TPUEmbeddingOutputLayout output_layout = 8
Extended output layout information; if not provided, a compatibility mode will use defaults that match the old layout. Providing a value for this field is EXPERIMENTAL and most ways of filling it will probably break. Do not set it unless you know what you are doing.

Mode. Should the embedding layer program be run for inference (just forward pass), training (both forward and backward pass) or just the backward_pass.

Used in: TPUEmbeddingConfiguration

UNSPECIFIED = 0
INFERENCE = 1
TRAINING = 2
BACKWARD_PASS_ONLY = 3

Sharding strategy of the embedding tables among the hosts. If the sharding_strategy is "mod", each id is assigned to host "id % num_hosts". For instance, 13 ids are split across 5 hosts as: [[0, 5, 10], [1, 6, 11], [2, 7, 12], [3, 8], [4, 9]]. If the sharding_strategy is "div", ids are assigned to hosts in a contiguous manner. In this case, 13 ids are split across 5 hosts as: [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10], [11, 12]]. In both the strategies, if the id space does not evenly divide the number of hosts, each of the first "table_descriptor.num_ids % num_hosts" hosts will be assigned one more id. This partitioning strategy exactly follows that in the embedding_lookup TensorFlow function at tensorflow/python/ops/embedding_ops.py.

Used in: TPUEmbeddingConfiguration

DIV_DEFAULT = 0
MOD = 1

Description of the various embedding tables.

Used in: TPUEmbeddingConfiguration

string name = 1
Name of the table.
int32 vocabulary_size = 2
Size of the vocabulary (i.e., number of rows) in the table.
int32 dimension = 3
The embedding dimension (i.e., the width of the embedding table).
int32 num_features = 4
Number of features mapped to this table.
optional OptimizationParameters optimization_parameters = 5
Details of the learning algorithm used to update the embedding parameters.

Used in: TPUEmbeddingConfiguration

repeated TPUEmbeddingOutputLayout.TableDescriptor table = 1
Output locations for each feature of each table.
repeated TPUEmbeddingOutputLayout.EmbeddingOutputTensor output = 2
Shape and layout information for each tensor.

Format information for a single output tensor.

Used in: TPUEmbeddingOutputLayout

oneof output_format
- TwoDOutputTensor two_d = 4

Description of the output placement for one feature.

Used in: TableDescriptor

repeated OutputLocation output_location = 1
Typically, only one copy of each feature is used, but multiple are allowed and the same data will be copied to all of them (with the gradients summed in the backward pass).

Location of one copy of the feature's data.

Used in: FeatureDescriptor

int32 tensor_index = 1
Which output tensor this copy of the feature will go into. Must be between 0 and layout.output_size().
int32 dim0_offset = 2
Offset in dimension 0 for this feature copy. Must be between 0 and layout.output(tensor_index).dim0_size_per_sample().
int32 dim1_offset = 3
Offset in dimension 1 for this feature copy. Must be between 0 and layout.output(tensor_index).dim1_size() - table width; repeated or partially/fully overlapping values are allowed and results in the same range will be summed (with the gradients replicated in the backward pass).

Description of the output placement for features of one table.

Used in: TPUEmbeddingOutputLayout

repeated FeatureDescriptor feature = 1
Output locations for each feature loaded from this table.

Size and layout information for 2-D tensors.

Used in: EmbeddingOutputTensor

int32 dim0_size_per_sample = 2
Multiplier for output dimension 0 size; used to match legacy format that stacks features within a sample in dimension 0.
int32 dim1_size = 1
The size (in dimension 1) of this output tensor.

Describes the geometry of a TPU mesh.

repeated int32 mesh_shape = 1
The dimensions of the TPU topology, in cores. Typically, this is a 3D topology [x, y, core], where the major dimensions correspond to TPU chips, and the minor dimension describes the number of cores on a multicore chip.
int32 num_tasks = 2
Number of TensorFlow tasks in the cluster.
int32 num_tpu_devices_per_task = 3
Number of TPU devices per task.
repeated int32 device_coordinates = 4
A flattened rank 3 int32 array with shape [num_tasks, num_tpu_devices_per_task, len(mesh_shape)]. `tasks` is the number of tasks in the TPU cluster, `devices` is the number of TPU devices per task, and the minor dimension corresponds to a position in the TPU mesh topology. Each entry [task, device, axis] gives the `axis`-th coordinate in the topology of a task/device pair.

A 'Trace' contains metadata for the individual traces of a system.

map<uint32, Device> devices = 1
The devices that this trace has information about. Maps from device_id to more data about the specific device.
repeated TraceEvent trace_events = 4
All trace events capturing in the profiling period.

Used in: Trace

uint32 device_id = 1
The id of the device that this event occurred on. The full dataset should have this device present in the Trace object.
uint32 resource_id = 2
The id of the resource that this event occurred on. The full dataset should have this resource present in the Device object of the Trace object. A resource_id is unique on a specific device, but not necessarily within the trace.
string name = 3
The name of this trace event.
uint64 timestamp_ps = 9
The timestamp that this event occurred at (in picos since tracing started).
uint64 duration_ps = 10
The duration of the event in picoseconds if applicable. Events without duration are called instant events.

package diplomacy.tensorflow.tpu

message AdadeltaParameters

float rho = 1

float epsilon = 2

float initial_accumulator = 3

float initial_update = 4

message AdagradParameters

float initial_accumulator = 1

message AdamParameters

float beta1 = 3

float beta2 = 4

float epsilon = 5

float initial_m = 6

float initial_v = 7

bool use_non_lazy_adam = 8

bool use_sum_inside_sqrt = 10

message CenteredRmsPropParameters

float rho = 1

float momentum = 2

float epsilon = 3

float initial_ms = 4

float initial_mom = 5

float initial_mg = 6

message ClippingLimits

optional google.protobuf.FloatValue lower = 1

optional google.protobuf.FloatValue upper = 2

message CompilationResultProto

error.Code status_code = 1

string status_error_message = 2

repeated xla.HloProto hlo_protos = 3

message Device

string name = 1

uint32 device_id = 2

map<uint32, Resource> resources = 3

message DynamicLearningRate

message FtrlParameters

float l1 = 1

float l2 = 2

float lr_power = 3

float initial_accum = 4

float initial_linear = 5

message LearningRate

oneof learning_rate

float constant = 1

DynamicLearningRate dynamic = 2

message MdlAdagradLightParameters

float l2 = 1

float lr_power = 2

float min_servable_mdl_benefit = 3

float mdl_mix_in_margin = 4

float mdl_benefit_rampup_coeff = 5

float mdl_min_weight = 6

float benefit_revisit_scale = 7

float max_event_benefit = 8

float max_total_benefit = 9

float mdl_hard_limit = 10

bool hard_limit_min_benefit = 11

bool mdl_regularize = 12

float initial_accumulator = 13

float initial_weight = 14

float initial_benefit = 15

message MomentumParameters

float momentum = 1

bool use_nesterov = 2

float initial_accum = 3

message OptimizationParameters

optional LearningRate learning_rate = 13

optional ClippingLimits clipping_limits = 2

optional ClippingLimits gradient_clipping_limits = 7

float weight_decay_factor = 16

bool use_gradient_accumulation = 15

oneof parameters

AdagradParameters adagrad = 3

StochasticGradientDescentParameters stochastic_gradient_descent = 4

FtrlParameters ftrl = 5

AdamParameters adam = 6

MomentumParameters momentum = 8

RmsPropParameters rms_prop = 9

CenteredRmsPropParameters centered_rms_prop = 10

MdlAdagradLightParameters mdl_adagrad_light = 11