Get desktop application:
View/edit binary Protocol Buffers messages
The active memory allocations at the peak memory usage.
Used in:
The index of a snapshot in the time-sorted list, used to fetch the MemoryActivityMetadata at front end from the memory_profile_snapshots list.
The index of MemoryActivityMetadata in the special_allocations list.
Number of occurrences for identical memory allocations.
Result database for all-reduce ops.
Used in:
Result proto for all -educe ops.
Used in:
Unique id for all-reduce ops.
The name of the hlo op. This field is no longer set by the profiler.
For all-reduce nodes from different modules, if they have the same all_reduce_id, they will be 'Allreduce'd'. If empty, AllReduce will not be applied across modules.
The start time in picoseconds of the op event.
The end time in picoseconds of the op event.
The size of the op in bytes.
Used in:
Name of this OP.
Number of instances that this OP occurred.
The time in microseconds spent in this OP (averaged across all of its occurrences).
Byte size of data transferred.
Replica groups.
Description (e.g. XLA expression).
Detail of a batch. Next ID: 13
Used in:
, , ,Batch id.
Start time of the batch in picosecs.
End time of the batch in picosecs.
The latency between "start time of the first request in this batch", and the time this batch is processed.
Request ids related to this batch.
Size of padding.
Size of a batch after padding.
Model ID of this batch. This is the same model_id as any of the request in this batch. All the requests from the same batch must share the same model_id.
Tensor event details.
Host index for this batch.
Percentile of this batch in all batches in the profile duration.
Total time in picosecs in this batch spent on device.
Batching parameters collected from TFstreamz.
Used in:
Number of batch threads.
How long a request can wait before being processed by a batch.
Maximum size of a batch.
Maximum number of enqueued batches.
Sizes that are allowed to form a batch. A list of integers separated by ","
Generic hardware bottleneck.
Percentage of step time that is spent on input.
Percentage of step time that is spent on output.
Percentage of step time that is idle for non-I/O-related reason.
Percentage of step time that is spent on compute.
Indicates if input is a bottleneck. Possible values: "host", "device", "both", or "unknown"
A human-readable description of the input bottleneck.
Indicates if kernel launching is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the kernel launching overhead.
Indicates if all other is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the all other overhead.
Indicates if device collective communication is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the device collective communication overhead.
Used in:
Describes the start / exclusive limit HLO program points for a given buffer lifetime, used for rendering a box on the plot.
Used in:
Next ID: 14 Information about a send and recv channel.
Used in:
Id of the channel.
Core ids of send ops.
Core ids of recv ops.
Byte size of the data transferred.
Duration from the beginning of send to the end of recv-done in microseconds.
Number of occurrences of a channel.
Percentage of the link BW utilized over the peak link BW.
A list of hlo names associated with this channel id.
Duration from the beginning of the recv-done to the beginning of send in microseconds. If the recv-done op starts after the beginning of the send op, the delay is zero.
Description (e.g. XLA expression).
TfDataStats of all hosts.
Whether it is input bound.
Summary of the analysis.
Bottleneck analysis result.
TfDataStats per host.
Next ID: 8
Used in:
unique within host, TPU core only
unique within chip per core type
unique within host
unique within mesh
unique within mesh, TPU core only
This proto is based on MegaScaleInfoProto and should be consistent with it.
The type of DCN transfer.
Groups of endpoints (in the form of slice id and device id) involved in `ALL_TO_ALL`, `REDUCE_SCATTER`, `ALL_REDUCE` and `ALL_GATHER` transfer.
Groups of endpoints (in the form of slice id and device id) involved in `ONE_TO_ONE` transfer.
Used in:
,Used in:
Used in:
Used in:
XLA AllToAll transfer. Needs `endpoint_groups`.
Peer-To-Peer DCN transfer from source to one destination. Needs one_to_one_groups.
XLA reduce-scatter transfer. Needs `endpoint_groups`.
XLA AllGather transfer. Needs `endpoint_groups`.
XLA all-reduce transfer. Needs `endpoint_groups`.
XLA ragged all-to-all transfer. Needs `endpoint_groups`.
Used in:
XProf observed send start time.
XProf observed recv_done end time.
Slack is defined as the time the collective has to send and recv data without stalling the tpu. The effect of the network and other overlapping collectives are removed from the collective of interest. HOST 1 : |--------|SEND1|-------|SEND1.DONE|-------|RECV1|------|RECV1.DONE|------- HOST 2: |------|SEND2|-------|SEND2.DONE|-------|RECV2|------|RECV2.DONE |----- Slack is computed as RECV2.DONE.StartTime - SEND2.StartTime - (Overlapping Communication) In this case, Overlapping communication is the duration of SEND2, SEND2.DONE and RECV2. In cases where other collectives are interspaced between this collective, Overlapping duration would include their durations as well. Host 1 is ignored while computing the slack, as we assume that the similar ops are executing each core. This also prevents clock drifts to effect the analysis.
Duration the collective stalled the TPU.
Recv op name
Send op name
Timestamp for the send/send-done/recv/recv-done ops
Used in:
Rendezvous name for the collective.
Slack Time in Microseconds,
Number of occurrences in the sampled duration.
Bytes transmitted over the network.
Duration the collective stalled the TPU.
Observed duration.
Recv op name.
Send op name.
Stall duration based on the op.
A 'device' is a physical entity in the system and is comprised of several resources.
Used in:
The name of the device.
The id of this device, unique in a single trace.
The resources on this device, keyed by resource_id;
Bytes/s.
Information about memory transfer to/from device memory.
Used in:
,Used in:
, , , , ,DmaActivity can be used to add DMA details to a trace event.
Used in:
temporary field, not saved to .sstable.
Used in:
Indicates if kernel launch is a performance bottleneck. Possible values: "no", "moderate", "high".
A statement that recommends if we need to further investigate kernel-launch performance.
Indicates if all other is a performance bottleneck. Possible values: "no", "moderate", "high".
A statement that recommends if we need to further investigate all-other performance.
A statement that recommends if the user should try using lower precision. Shows this statement to users only if it is not empty.
Indicates if device collectives are a performance bottleneck. Possible values: "no", "moderate", "high".
A statement that recommends if we need to further investigate device-collectives performance.
Breakdown of step-time on generic hardware. Note that these components are mutually exclusive so that adding them together is equal to the step time. If an execution time interval has multiple types of event happening, we need to pick one of the event type to attribute the time interval to.
Map event type to the accumulated duration in picoseconds of that type.
Map of string category to accumulated duration in picoseconds for that category.
Summary of all unknown time as a part of step in ms.
Summary of all host-wait-input time as a part of step in ms.
Summary of all host-to-device time as a part of step in ms.
Summary of all input time as a part of step in ms.
Summary of all output time as a part of step in ms.
Summary of all device-compute time as a part of step in ms.
Summary of all device-to-device time as a part of step in ms.
Summary of all device-collectives time as a part of step in ms.
Summary of all host-compute time as a part of step in ms.
Summary of all host-prepare time as a part of step in ms.
Summary of all compilation time as a part of step in ms.
Types of hardware profiled.
Used in:
Unknown hardware.
CPU only without any hardware accelerator.
GPU.
TPU.
Describes a heap object that is displayed in a plot in the memory visualization HTML.
Used in:
Result proto for host-dependent job information.
Used in:
This ID of the host where the job was run on.
The command line used to run the job.
The start time of this run (nanoseconds since the Unix epoch).
BNS address specified by client at time of profiling request.
Profiling start walltime (in ns).
Result proto for host-independent job information.
Used in:
The change-list number of this build.
The time of this build (nanoseconds since the Unix epoch).
The target of this build.
Profiling duration (in ms).
Proto consumed by inference analysis.
Map from host-id to the InferenceStats for that host.
Map from model-id to the InferenceStats for that model.
A database of model ids.
A database of tensor patterns.
Used in:
The Op's name.
The number of occurrences.
Time (accumulated over all occurrences) in milliseconds.
Time (accumulated over all occurrences) in percentage of the total input processing time.
Self time (accumulated over all occurrences) in milliseconds.
Self time (accumulated over all occurrences) in percentage of the total input processing time.
Possible categories: "Enqueue", "Advanced file read", "Demanded file read", "Preprocessing", "Unknown".
Used in:
A list of detailed recommendations.
An analysis of different types of bottlenecks. Can be unpacked into a BottleneckAnalysis.
A suggested step to take next.
Used in:
tag : indicate the format of step_details and step_time_breakdown. true for TPU-specific data models.
Hardware type.
Summary of all step duration across all cores.
Summary of all input-related stall as percentage of step duration.
Percentage of step time that is waiting for input.
Percentage of step time that is doing output.
Percentage of step time that is idle for non-I/O-related reason.
Percentage of step time that is doing compute.
Details of each step. Can be unpacked into a PerGenericStepDetails.
The breakdown of the input processing time.
Details of each input Op executed.
Recommendation for next steps to users.
Breakdown of the step time. Can be unpacked into a GenericStepTimeBreakdown.
Error and warning messages for diagnosing profiling issues.
Metadata for input pipeline.
Used in:
Id of the input pipeline which is set to the id of its root iterator.
The distribution strategy creates one "host" input pipeline which actually runs tf.data user code. Also, it creates a "device" input pipeline per device (e.g., TensorCore) which takes an element from the host input pipeline and transfers it to the device.
Used in:
Stat and metadata for input pipeline.
Used in:
Id of the blocking iterator with the longest self time.
Latency of the bottleneck iterator.
Stats per iterator.
Collection of metadata and stats of input pipeline.
Used in:
Metadata of the input pipeline.
Average latency (i.e., the root iterator's latency) of the input pipeline.
Minimum latency of the input pipeline.
Maximum latency of the input pipeline.
The number of times this input pipeline was slower than 50 us.
Stats per call sorted by the root iterator's duration.
Used in:
Time spent on demanded file read in microseconds.
Time spent on advanced file read in microseconds.
Time spent on data preprocessing in microseconds.
The infeed enqueue time in microseconds.
This entry is for the situtation where we can't further break down the non-enqueue input time (because the input pipeline is not instrumented).
Metadata for iterator.
Used in:
Id of the iterator.
Id of the parent iterator.
Name of the iterator.
Long name of the iterator.
Whether it is an async iterator.
Parameters of the iterator (e.g., num_parallel_calls).
Stat for iterator.
Used in:
Id of the iterator.
Start time of the iterator's GetNext in ps.
Duration of the iterator's GetNext in ps.
Self time of the iterator's GetNext in ps. It takes account into async iterators. It is calculated by subtracting the time overlapped with its child iterator's duration from the iterator's duration.
Whether it is blocking the root iterator. An async iterator's child iterator may not block its parent iterator if it is executed in advance and does not overlap with the parent iterator.
The number of times this iterator is called. For example, a batch iterator's child iterator may be called multiple times.
Next ID: 15
Used in:
Name of the kernel.
Registers per thread.
Static shared memory in bytes.
Dynamic shared memory in bytes.
Block dimensions.
Grid dimensions.
Total duration of this kernel.
Min duration of kernel in nanoseconds.
Max duration of kernel in nanoseconds.
Kernel utilizes TensorCore instructions.
Operation is eligible to use TensorCores.
TF operation name.
Number of occurrences.
Occupancy pct.
Used in:
A list of kernels aggregated by name.
Data layout of an op.
Used in:
The physical data layout, from most-minor to most-major dimensions.
Physical data layout in each tensor dimension.
Used in:
Size of the data in this dimension.
Data must be padded to a multiple of alignment.
What the dimension represents.
What the dimension represents, e.g. spatial, feature or batch.
Used in:
Used in:
The logical topology of the job.
The slices that are part of the job.
The network address of a specific host.
Used in:
Logical metadata about a specific device.
Used in:
The id that uniquely identifies the device globally.
The id that uniquely identifies the device within its slice.
The id that uniquely identifies the device within its host.
Logical metadata about a specific host.
Used in:
The id that uniquely identifies the host within its slice.
The network addresses of the host.
The devices that are connected to this host.
Logical metadata about a specific slice.
Used in:
The id that uniquely identifies the slice globally.
The hosts that are part of this slice.
Types of memory bandwidth we track in the system.
We use FIRST and LAST enum values to be able to iterate over this enum in TypeScript, since the _MIN and _MAX values are not automatically available as in C++.
Aggregated BW across on-chip and off-chip memory. For GPU, 1/2 is shared memory bandwisth.
On-chip memory read bw.
On-chip memory write bw.
Leave last. Leave this MAX unchanged now to avoid op profile changes. TODO(b/359279074) Revisit the memory breakdown in op profile since we have more memory types now.
A container to serialize this repeated field in "symbolized xplane."
The memory activity that causes change of memory state.
Used in:
Memory allocation in heap.
Memory deallocation in heap.
Memory reservation for stack.
Expansion of existing memory allocation.
The metadata associated with each memory allocation/deallocation. It can also be interpreted as the metadata for the delta of memory state. Next ID: 10
Used in:
,The activity associated with the MemoryProfileSnapshot.
The requested memory size in bytes from the caller of memory allocation. Should be a positive number.
The allocated (block/chunk) size for the memory allocation. Should be a positive number.
Starting address of the allocated memory chunk/block.
TensorFlow Op name for the memory activity.
Step Id at which the memory activity occurred.
Tensor memory region type including "output", "temp", "persist", and "dynamic".
From enum DataType defined in tensorflow/core/framework/types.proto.
Tensor shape printed in string, e.g. "[3, 3, 512, 512]".
The aggregated memory stats including heap, stack, free memory and fragmentation at a specific time.
Used in:
,Memory usage by stack reservation, in bytes.
Memory usage by heap allocation, in bytes.
Free memory available for allocation or reservation, in bytes.
Fragmentation value within [0, 1].
The peak memory usage over the entire program (lifetime of memory allocator). It monotonically increases with upper limit as memory capacity.
Data for memory usage analysis in one host.
A map from memory allocator's id to PerAllocatorMemoryProfile for memory usage analysis on this host.
Number of hosts profiled, used to populate host selection list at front end.
Ids for profiled memory allocators, used to populate memory selection list at front end.
Version number of MemoryProfile proto.
Profile snapshot of the TensorFlow memory at runtime, including MemoryAggregationStats (memory usage breakdown etc.), and MemoryActivityMetadata (allocation or deallocation, TF Op name etc.).
Used in:
Memory activity timestamp.
The memory aggregation stats at the snapshot time.
The metadata for the memory activity at the snapshot time.
The summary of memory profile within the profiling window duration.
Used in:
The peak memory usage over the entire program (lifetime of memory allocator).
The peak memory usage stats within the profiling window.
The timestamp for peak memory usage within the profiling window.
The memory capacity of the allocator.
Tensorflow generic memory space names. These space names are used in analysis code to get memory bandwidth per core.
Off-chip memory. Assume all backends use 1 for HBM/off-chip memory.
On-chip memory.
Any memory.
Model ID database. Unknown model id will be "" and won't be stored here. So if model id is not found in the TF-session metadata, ModelIdDatabase will be empty.
Used in:
Array of model ids.
Map from id to index.
Map from id to batching parameters.
Used in:
Metrics for an operation (accumulated over all occurrences). Next ID: 27
Used in:
HLO module id. 0 for Framework ops.
Name of this op.
Long name of this op (e.g., HLO expression).
Category of this op. (e.g. Hlo op category, Framework op type) Could be parsed from provenance if it is a framework op.
Provenance of this op if it is an HLO Op. (e.g. TF Op name, JAX Op name) TODO(b/310434797) Extends this for JAX as now only TF Op is populated.
Whether it is executed eagerly.
Number of executions.
Total time (self + children) in picoseconds.
Minimum time (self + children) among all occurrences.
Total self time in picoseconds.
Total FLOPs. Normalized to the devices peak bandwidth.
Total FLOPs for the model. Can be 0, in which case assume it's same as flops
Fingerprint of the symbol (cs/xla::HloPrintOptions::Fingerprint), if 0, the fingerprint is not set.
Total bytes accessed.
Total dma stall time in picoseconds.
The data layout for this op. Only set for convolution ops for now.
Deduplicated HLO name for this op. Not set for TF ops.
Children of the op. e.g. fused ops if this op is fusion.
Number of cores this op occurs.
Computation primitive size in BITS. This is the size of the type of the hardware computation. In the future this may be extended to include info such as signed/unsigned, int/fp, etc. Currently only the size is needed.
Whether the op is autotuned.
Breakdown of memory accessed by operation type and memory space.
Used in:
,Device-specific id of memory space.
Used in:
A database for OpMetrics. Next ID: 16
Used in:
, ,A bunch of OpMetrics.
The total host infeed-enqueue duration in picoseconds.
The total of the difference between the start times of two consecutive infeed-enqueues (per host) in picoseconds.
The total time in picoseconds.
The total time incurred by OPs in picoseconds.
Precision-related stats.
The below two stats will be different from the total time ps and total op time ps because they are unioned all cores (and not summed). For duty cycle, a device is idle if all the cores are idle.
For duty cycle, a device is busy if any of the cores is busy.
Next ID: 14 Operator Statistics.
The database for the op metrics collected from the host over the entire profiling session including incomplete steps.
The database for the op metrics collected from the device over the entire profiling session including incomplete steps.
The result for the HLO-metric database over the complete steps only.
Performance environment of the op metrics collected.
The database of step sequences.
The run environment of this profiling session.
Kernel stats results from all GPUs.
Statistics for all tf-functions.
A map from core ID to details.
Error and warning messages for diagnosing profiling issues.
A map from program ID to program name.
Performance counters.
Overview result for the inference query latency stats.
Used in:
The percentile numbers that the inference query latency distribution should follow. E.g., 50.0 means 50%ile. Default is [50.0, 75.0, 90.0, 99.0, 99.9].
Total and breakdown of a certain percentile latency. Each element corresponds to element with the same index in percentile_numbers.
Max latency in micro seconds.
Min Latency in micro seconds.
Inference sessions per second aggregated over all hosts. There can be multiple queries batched in one session.
Total and breakdown latency for inference query(s). Breakdown into host/device/communication.
Used in:
The run environment of the profiled session.
The step-time result.
The other analysis result.
The recommendation made to the user.
Error and warning messages for diagnosing profiling issues.
The inference query latency stats.
Overview result for general analysis.
Used in:
MXU utilization in percentage.
Percentage of the device time that is idle.
Percentage of the host time that is idle.
Top TF Ops executed on the device.
Remark text in the performance summary section.
Color of the remark text.
FLOP rate utilization relative to the roofline in percentage.
Memory bandwidth utilization relative to the hw limit in percentage.
Percentage of device computation that is 16-bit.
Percentage of device computation that is 32-bit.
Percentage of TF ops executed on the host.
Percentage of TF ops executed on the device.
Host trace level.
Percentage of TF-op execution time on the host (excluding the idle time) that are in eager mode.
Percentage of TF-op execution time on the device (excluding the idle time) that are in eager mode.
Percentage of TF-op execution time on the device (excluding the idle time) that are for outside compilation.
Percentage of the device time that is in use.
BEGIN-INTERNAL Program Goodput metric in percentage.
Sparse core step time in ms average.
Sparse core infeed time in ms average.
Sparse core outfeed time in ms average.
Sparse core idle time in ms average.
Max FW VDD Core PL1 power metrics in watts.
Max FW VDD Core PL2 power metrics in watts.
Max FW VDD Core PL3 power metrics in watts.
Max FW VDD Core PL4 power metrics in watts.
Max FW HBM PL1 power metrics in watts.
Max FW HBM PL2 power metrics in watts.
Max FW HBM PL3 power metrics in watts.
Max FW HBM PL4 power metrics in watts.
END-INTERNAL
Result proto for host-dependent job information.
Used in:
This ID of the host where the job was run on.
The command line used to run the job.
The start time of this run (nanoseconds since the Unix epoch).
BNS address specified by client at time of profiling request.
Profiling start walltime (in ns).
Result proto for host-independent job information.
Used in:
The change-list number of this build.
The time of this build (nanoseconds since the Unix epoch).
The target of this build.
Profiling duration (in ms).
Overview result for the recommendation section.
Used in:
Possible performance bottleneck: "host", "device", "both".
A statement for input that recommends the next steps for investigating the bottleneck.
A list of tips for tackling input bottleneck.
A statement for output that recommends the next steps for investigating the bottleneck.
A statement that recommends the next steps for investigating eager-mode related bottleneck (it is an html so that it can link to other tools/docs.)
A statement that recommends the next steps for investigating outside-compilation related bottleneck (it is an html so that it can link to other tools/docs.)
A statement that recommends the next steps for investigating tf-function related bottleneck (it is an html so that it can link to other tools/docs.)
A list of tips for improving host performance.
A list of tips for improving device performance.
A list of links to related useful documents.
// The recommendation made to the user. Can be unpacked into a GenericRecommendation.
A list of tips for FAQ.
A list of tips for inference run.
The run environment of a profiling session.
Used in:
Number of hosts used.
Number of tasks used.
Distinct hostnames seen.
The type of device used.
The number of device cores used. In TPU case, this corresponds to the number of TPU cores In GPU case, this corresponds to the number of GPUs (not the number of SMs).
Host-independent information about this job.
Host-dependent information about this job.
The number of replicas, corresponds to input parallelism. If there is no model parallelism, replica_count = device_core_count
The number of cores used for a single replica, e.g. model parallelism. If there is no model parallelism, then num_cores_per_replica = 1
Whether it is a training analysis or inference analysis.
Power Metrics for TPU.
Overview result for a performance tip to users.
Used in:
Link to the tip.
Overview result for a TensorFlow Op.
Used in:
Name of the Op.
Category of the Op.
The amount of time that this Op takes by itself as fraction of the total execution time on the device or host.
The cumulative time upto this Op as fraction of the total execution time.
How many GFlops/sec that this Op achieves.
Whether the Op is eligible to use TensorCores.
Whether at least one of the kernels launched in this op is using TensorCore.
Memory profile snapshots per memory allocator.
Used in:
A list of MemoryProfileSnapshots referenced by <active_allocations>.
The summary of memory profile (e.g. the peak memory usage).
The rows in the table of active allocations at peak memory usage within profiling window.
The special allocations (e.g. pre-allocated heap memory, stack reservation) that are not captured in the MemoryActivityMetadata of memory_profile_snapshots. Need to handle separately.
A list of MemoryProfileSnapshots sampled from all the snapshots during the profiling window. It is used to display the memory timeline graph in the frontend. The snapshots are sorted by timestamp.
Aggregated result per batch size.
Used in:
Result proto for information in a step across all cores.
Used in:
The step number.
A map from core_id to StepInfo.
The result for the per-step HLO-metric database.
A map from core ID to program replica id. Replica id map could change during a profile session, but should stay stable within a step.
A map from core_id to all-reduce ops.
Information about deivce memory transfers, categoried by source and destination. Ordered by following categories: 1. HostToDevice 2. DeviceToHost 3. DeviceToDevice Cores are normally sharing host interfaces (i.e. PCIe).
Per-step details on generic hardware.
The step number of a step.
The step name.
The step time (in ms).
Breakdown of the step time in different event categories. The unknown time (in ms).
The time (in ms) in which the host is waiting for input data to be ready.
The time (in ms) in which the host is sending input data to the device. Total input time = host_wait_input_ms + host_to_device_ms.
The output time (in ms).
The device-compute time (in ms).
The device-to-device communication time (in ms).
The device time spent on collective communications (in ms).
The host-compute time (in ms).
The host-prepare time (in ms).
The time spent on compiling (in ms).
Per-host data for inference analysis.
Used in:
A list of requests selected for inference analysis on this host. This list is in ascending order of the request duration.
A list of batches selected for inference analysis on this host. This list is in ascending order of the batch duration.
Per-model data for inference analysis.
Used in:
A list of requests selected for inference analysis on this model. This list is in ascending order of the request duration.
Aggregated result from all the <request_details>.
Inference requests per second for this model.
Average latency in microseconds of the requests in this model.
A list of batches selected for inference analysis on this model. This list is in ascending order of the batch duration.
Aggregated result from all the <batch_details>.
Batches per second for this model.
Average latency in microseconds of the batches in this model.
The aggregated result of tensor transfer in this model.
Aggregated result per batch size.
Per-step details on TPU. Next ID: 26
The step number of a step.
The TensorCore compute time in this step.
The maximum TensorCore idle time that is due to host overhead (but not input-related).
The part of a step (in ms) TC spends sending data to the host via outfeed.
The part of a step (in ms) on TC that is waiting for input data from the host.
Average infeed-dequeue time across cores (as percentage of step time).
Minimum infeed-dequeue time across cores (as percentage of step time).
Maximum infeed-dequeue time across cores (as percentage of step time).
The core with the maximum infeed time in this step.
The name of the core with the maximum infeed time in this step.
The part of a step (in ms) that is spent on the all-reduce compute.
The part of a step (in ms) that is spent on the all-reduce synchronization.
The part of a step (in ms) that is spent on SparseCoreV0 compute.
The part of a step (in ms) that spent on infeed from host to SparseCoreV0.
The part of the step (in ms) that is spent waiting for device to host or host to device transfer.
The SparseCore compute time in this step.
The maximum SparseCore idle time that is due to host overhead (but not input-related).
The part of a step (in ms) SC spends sending data to the host via outfeed.
The part of a step (in ms) on SC that is waiting for input data from the host.
Sparse core step time in ms.
Performance environment, e.g the peak performance capabilities of the device.
Used in:
Peak performance of a TPU core or a GPU in TFLOP/s.
Peak memory bandwidth of a TPU core or a GPU in GiBs/s.
Peak off-chip memory bandwidth of a TPU core or a GPU in GiBs/s.
Peak memory bandwidths of a TPU core or a GPU in GiBs/s. Index into array using MemBwType enum. TODO: remove the 2 above fields and bump up the proto version to maintain backwards compatibility.
The ridge point of roofline model in FLOP/Byte. (i.e., minimum operational intensity required to achieve maximum performance).
Whether the device has CMEM.
Whether the device has merged VMEM (with CMEM).
Whether megacore is used.
Metrics based on hardware performance counters.
Used in:
Overall matrix unit utilization in percentage.
Predicted computational cost of the instruction associated with the symbol. Estimated by traversing the HLO graph.
The number of floating-point operations computed.
The sum of bytes read and bytes written.
Breakdown of memory accessed by read/write and memory space.
Used in:
Used in:
A database of PodStats records.
All PodStats records, one for each row in the PodStats tool.
Error and warning messages for diagnosing profiling issues.
A map from event type number to event name string for step breakdown.
Result proto for information in a step across all cores.
Used in:
The (micro) step number.
A map from core_id to PodStatsRecord.
A database of channel info.
A map from core ID to program replica id. Replica id map could change during a profile session, but should stay stable within a step.
A database of all reduce ops.
Next ID: 20 There is one PodStatsRecord for each step traced on each compute node.
Used in:
,The host name where the trace was collected.
The TPU global chip id where the trace was collected.
The TPU node id where the trace was collected.
The step number.
The step duration in micro-seconds.
Breakdown the durations for each event type in micro-seconds.
Indicates the bottleneck out of the above mentioned metrics.
A sequence of PodStatsMap for each step.
Used in:
Next ID: 12 A database of pod viewer records.
The type of device used.
Pod level stats for each step.
Top level summary of pod viewer.
Error and warning messages for diagnosing profiling issues.
A map from event type number to event name string for step breakdown.
Info to draw the topology graph.
Used in:
Next ID: 9 Topology graph draws all the cores in the system in a 2-D rectangle or 3-D cube. It is hierarchically grouped by host, chip and core.
Used in:
Number of chips in the x dimension of the rectangle/cube.
Number of chips in the y dimension of the rectangle/cube.
Number of chips in the z dimension of the cube.
Number of chips in the x dimension of each host.
Number of chips in the y dimension of each host.
Number of chips in the z dimension of each host.
Number of cores per chip.
Core locations.
Used in:
power rail or component name, e.g. HBM, Core.
maximum watts monitored.
average watts monitored.
(SPI sampler only) maximum watts of moving average power over a time window of 100us.
(SPI sampler only) maximum watts of moving average power over a time window of 1ms.
(SPI sampler only) maximum watts of moving average power over a time window of 10ms.
(FW only) The timescale in us to compute moving averages.
The number of samples.
(SPI sampler only) maximum watts of moving average power over a time window of 1s.
Used in:
,Statistics about the various precision used in computation.
Used in:
Amount of time spent on 16-bit computation (in ps).
Amount of time spent on 32-bit computation (in ps).
Groups together all results from the preprocessing C++ step.
Heap sizes at each HLO program point (the HLO sequential order).
Unpadded heap sizes (calculated as the minimal sizes based on the data type and dimensionality) at each HLO program point (the HLO sequential order).
The HloInstruction that was being processed at this HLO program point.
Heap objects at the peak memory usage point ordered by HLO program "birth" time.
Heap objects at the peak memory usage point ordered by size, descending.
Mapping from logical buffer ID to the HLO sequential order span in which it is alive.
Indexes to get back and forth from the by-size and by-program-order sequences.
Peak heap size for the HLO program.
Peak unpadded heap size for the HLO program.
HLO program point number at which the peak heap size occurs.
Size of the entry computation parameters in MiB. This does not reflect whether those MiB are reusable during the computation or not, it is simply a size value.
total size of indefinite/global and temporary buffer allocations.
total size of indefinite/global buffer allocations.
RawData contains raw data that can be used to attach further details to a TraceEvent. TraceEvents store this raw data in serialized form so it can be decoded on demand. This can improve performance as TraceEvents are often subject to filtering and only a small subset actually needs to be decoded. NEXT ID: 4
Never used. For the ease of template code.
Describes the replica groups in a cross replica op (e.g., all-reduce and all-to-all).
Used in:
The ids of the replicas that belongs to the same group. The ordering of the ids matters in some ops (e.g., all-to-all).
Detail of a user facing request. Next ID: 22
Used in:
, , ,Request id.
An index to the model_id inside InferenceStats below. Storing index instead of string to save space. It will be -1 if the model id is not given.
Start-time of the request in picosecs.
End-time of the request in picosecs.
Total time in picosecs in this request spent on device.
Total time in picosecs in this request spent on writes to device.
Total time in picosecs in this request spent on reads from device.
If this inference request is running in batching mode, record the latency between a request is scheduled and is processed in a batch. Otherwise, it will always be 0.
Batch ids related to this request.
If this inference request is running in batching mode, record the size of the request. Otherwise, it will always be 0.
Detailed breakdown for host side activities of a request. Total time in picosecs spent on host preprocessing.
Total time in picosecs spent on host batch formation.
Total time in picosecs spent on host runtime.
Total time in picosecs spent on host postprocessing.
Tensor event details. One request can have multiple TensorEventDetails because it might be split into multiple batches for execution.
Host index for this request.
Percentile of this request in all requests in the profile duration.
The time no event associated with. It could be that the machine was idle or executing some events which were not traced.
A 'resource' generally is a specific computation component on a device. These can range from threads on CPUs to specific arithmetic units on hardware devices.
Used in:
The name of the resource.
The id of the resource. Unique within a device.
Number of events added to this resource.
The run environment of a profiling session.
Used in:
Number of hosts used.
Number of tasks used.
Distinct hostnames seen.
The type of device used.
The number of device cores used. In TPU case, this corresponds to the number of TPU cores In GPU case, this corresponds to the number of GPUs (not the number of SMs).
Host-independent information about this job.
Host-dependent information about this job.
The number of replicas, corresponds to input parallelism. If there is no model parallelism, replica_count = device_core_count
The number of cores used for a single replica, e.g. model parallelism. If there is no model parallelism, then num_cores_per_replica = 1
Host trace level.
The chip and host interconnection topology.
Whether it is a training analysis or inference analysis.
Power Metrics for TPU.
Hardware type.
Used in:
Map from model index to the Sampled Stats.
Used in:
Used in:
, , , ,could be `-1`
One stack frame per line.
Breakdown of step-time on SparseCore.
SparseCore step time in picoseconds (equal to SparseCore time - sc_idle - sc_wait_time).
Host to sparse core time in picoseconds.
SparseCore to host time in picoseconds.
Idle time but not waiting for input in picoseconds.
SparseCore busy time in picoseconds.
Similar to TpuStepTimeBreakdown, this is for sparse core step time info.
Used in:
Summary of all SparseCore compute op duration as a part of step in ms.
Summary of all SparseCore infeed op duration as a part of step in ms.
Summary of all SparseCore outfeed op duration as a part of step in ms.
Summary of all SparseCore idle (but not input-related) duration as a part of step in ms.
Summary of all SparseCore step time in ms.
Used in:
,Result proto for a StepDatabase.
Used in:
A sequence of PerCoreStepInfo.
Whether the step db uses incomplete step information. This flag is set to true when: 1) no step marker or annotation present. 2) profiling duration is too short to cover a full step. If this flag is false, we will group and breakdown the profile by complete steps only and ignore incomplete steps. If this flag is true, we will simply aggregate and breakdown over the total profile as a single step.
Number of steps dropped during post processing.
If the step_sequence is empty because: * there is no step profiled on any host, then empty_intersect is false. * there are steps profiled on some host, but the intersection of steps over all hosts is empty, then empty_intersect is true.
Next ID: 7 Result proto for StepInfo.
Used in:
The step number.
The step name.
The step duration in picoseconds.
The start time of this step in picoseconds.
Breakdown of the step-time. Can be unpacked into a GenericStepBreakdown.
Total time/bytes/occurences for collectives. (All-Reduce, All-to-All etc)
Used for both step duration and Op duration.
Used in:
, , ,System topology, which describes the number of chips in a pod and the connectivity style.
The X, Y, and Z dimensions of this topology. 0 means that dimension does not exist.
The number of expected bad chips in this system.
'Task' contains information about a task that profiler traced.
Used in:
The most recent changelist number from the client that built the binary.
True if the client that built the binary was mint (no local changes).
Build time (in ns relative to the Unix epoch).
Build target for the binary.
The full command line used to invoke the task.
Start time of the task (in ns relative to the Unix epoch).
Task address specified by client at time of profiling request.
Profiling start walltime (in ns).
Profiling duration (in ms).
Host trace level.
Hardware core frequency.
Used in:
,The index of the tensor pattern in TensorPatternDatabase.
If batching is enabled, the TensorEventDetails in BatchDetail will have owner = BATCH, and they are counted when calculating statistics like the number of occurrence for each tensor pattern. The TensorEventDetails in RequestDetail will have owner = BATCH, which means the tensor events actually happen in the batch, and they are not counted when calculating various statistics. If batching is not enabled, the TensorEventDetail will only appear in RequestDetail and the owner will only be REQUEST.
Total time in picosecs spent on linearize and delinearize tensors.
The owner of this TensorEventDetail.
Used in:
Unknown. This should not happen in production code.
Owned by the request.
Owned by the batch.
Tensor pattern database for all the tensor patterns that occurred during the profiling window.
Used in:
A tensor pattern is the string concatenation of all the linearize and delinearize events in an inference request. Each event records the tensor shape, data type and the layout on device.
Per-model aggregated result of tensor transfer.
Used in:
Used in:
The index of the tensor pattern in TensorPatternDatabase.
The number of occurrence of this tensor pattern in this model.
The percentiles of the linearize and delinearize time of this tensor pattern in this model.
Used in:
Used in:
Host name.
Input pipeline name.
Maximum latency of the input pipeline.
Name of the bottleneck iterator.
Long name of the bottleneck iterator.
Latency of the bottleneck iterator.
Suggestion to resolve the bottleneck.
Collection of stats of tf.data input pipelines within a host.
Used in:
Metadata per iterator.
Stats per input pipeline.
Statistics for a tf-function.
Used in:
A map from each execution mode to its corresponding metrics.
Total tracing count from the program's beginning (i.e. beyond the profiling period) of this tf-function.
Compiler used to compile this function.
Percentage of time spent in the expensive calls to this function in the profiled period.
All possible compilers that can be used to compile a tf-function in the graph mode.
Used in:
Yet to be set.
Any other compiler.
If some instance of the function is compiled with XLA and some is compiled with Non-XLA, use "MIXED_COMPILER".
XLA compiler.
MLIR compiler.
Statistics for all tf-functions.
Used in:
A map from function name to the statistics of that function.
All possible execution modes of a tf-function.
Yet to be set.
Eager execution.
Graph execution with tracing.
Graph execution without tracing.
Concrete function.
Metrics associated with a particular execution mode of a tf-function.
Used in:
Number of invocations to the function in that execution mode.
The sum of "self-execution" time of this function over those invocations.
A database of TfStatsTables.
The table that includes IDLE time.
The table that excludes IDLE time.
The type of device used.
There is one TfStatsRecord for each TF operation profiled.
Used in:
Rank of this TF-op among all TF-ops.
Whether this TF-op is on "Host" or "Device".
TF-op type.
TF-op name.
Number of occurrences of the operation.
Total "accumulated" time in micro-seconds that the operation took. If this operation has any children operations, the "accumulated" time includes the time spent inside children.
Average "accumulated" time in micro-seconds that each occurrence of the operation took.
Total "self" time in micro-seconds that the operation took. If this operation has any children operations, the "self" time doesn't include the time spent inside children.
Average "self" time in micro-seconds that the operation took.
Total "self" time as fraction of the sum of the total self-time of operations run on the device. It is 0 if this op runs on the host.
Cumulative value of device_total_self_time_as_fraction.
Total "self" time as fraction of the sum of the total self-time of operations run on the host. It is 0 if this op runs on the device.
Cumulative value of host_total_self_time_as_fraction.
Total floating-point operations (FLOPs) performed per second normalized to the bf16 peak capacity.
Total Floating-point operations for the op per second.
Number of bytes (including both read and write) accessed per second.
Operational intensity, which is defined as FLOPs/bytes-accessed.
Whether this operation is "Compute" or "Memory" bound, according to the Roofline Model.
Whether this TF-op is eagerly executed.
Fraction of kernel time that utilizes GPU TensorCore. It is 0.0 if this op does not run on a GPU device.
Number of bytes accessed from HBM (including both read and write) per second.
Number of bytes read from CMEM per second.
Number of bytes written to CMEM per second.
Number of bytes read from VMEM per second.
Number of bytes written to VMEM per second.
Operational intensity based on HBM in FLOP/Byte.
Operational intensity based on CMEM read in FLOP/Byte.
Operational intensity based on CMEM write in FLOP/Byte.
Operational intensity based on VMEM read in FLOP/Byte.
Operational intensity based on VMEM write in FLOP/Byte.
Operational intensity based on the bottleneck resource in FLOP/Byte.
Flops for the record
Bytes accessed for the record
A table of TFStatsRecords plus the corresponding pprof keys.
Used in:
All TfStats records, one for each TF operation.
key to the pprof profile for host TF operations.
key to the pprof profile for device TF operations.
Topology of the system. Describes the number of chips and hosts and their connectivity.
Used in:
Topology of chips per host.
Topology of hosts.
Chip position within the mesh
Used in:
Used in:
,Percentage of step time that is spent on input.
Indicates if input is a bottleneck. Possible values: "host", "device", "both", or "unknown"
A human-readable description of the input bottleneck.
Indicates if output is a bottleneck. Possible values: "host", "device", "both", or "unknown"
Percentage of step time that is spent on output.
A human-readable description of the output bottleneck.
Percentage of step time where the TC is idle (other than I/O).
Indicates if TensorCore being idle (other than input) is a bottleneck. Possible values: "no", "yes".
A human-readable description of the TC-idle bottleneck.
Indicates if SparseCoreV0 is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the SparseCoreV0 bottleneck.
Indicates if all-reduce is a bottleneck. Possible values: "no", "yes".
A human-readable description of the all-reduce bottleneck.
Percentage of step time that is spent on compute.
Breakdown of step-time on TPU. Next ID: 20
The infeed duration (host to TensorCore) in picoseconds.
The outfeed duration (TensorCore to host) in picoseconds.
The TensorCore time that is waiting for SparseCoreV0 in picoseconds.
The TensorCore time spent transforming activations in SparseCoreV0 layout into XLA layout.
The outfeed duration (TensorCore to SparseCoreV0) in picoseconds.
The time spent on all-reduce (used to be cross-replica-sum) in picoseconds.
The percentage of the SparseCoreV0 time that spends on infeed from host (including both data and instruction).
The time spent on send operation.
The time spent on recv operation.
The time spent on host send operation.
The time spent on host recv operation.
Megacore fusion runs different operations on each core, e.g., a convolution on one core and an all-reduce on the other core. This is the time that the core executing the faster operation waits for the core executing the slower operation to reach the synchronization point.
The time waiting for overlay DMAs in picoseconds.
The time spent running high flops ops, such as convolution and output fusion.
The time that the Tensorcore is idle but not waiting for input or SparseCoreV0.
The TensorCore time that is busy in picoseconds.
The SparseCoreV0 time that is busy in picoseconds (equal to SparseCoreV0 time - HOST_INSTRUCTION_STALL - HOST_DATA_STALL - TENSOR_CORE_STALL).
SparseCoreV0 step time in picoseconds (equal to SparseCoreV0 time - TENSOR_CORE_STALL).
Next Id: 9
Summary of all TensorCore compute op duration as a part of step in ms.
Summary of all SparseCoreV0 compute op duration as a part of step in ms.
Summary of all TensorCore infeed op duration as a part of step in ms.
Summary of all TensorCore outfeed op duration as a part of step in ms.
Summary of all SparseCoreV0 infeed op duration as a part of step in ms.
Summary of all TensorCore idle (but not input-related) duration as a part of step in ms.
Summary of all Host to Device and Device to Host transfer part of the step in ms.
Summary of all sparsecore step summary info.
Used in:
A 'Trace' contains metadata for the individual traces of a system.
The devices that this trace has information about. Maps from device_id to more data about the specific device.
The tasks that were traced, keyed by a unique ID for the server on which the task ran.
The time range that this trace covers. Timestamps are picoseconds since tracing started.
Start of first event.
End of last event.
String intern table for event's name or TraceMe argument.
The id of the device that this event occurred on. The full dataset should have this device present in the Trace object.
The id of the resource that this event occurred on. The full dataset should have this resource present in the Device object of the Trace object. A resource_id is unique on a specific device, but not necessarily within the trace. NOTE: counter events do not have this field set as they are per device.
The name of this trace event.
Reference of the name in Trace's name_table (e.g. in SStable format).
The group id which this event belongs to. This allows the trace viewer to show only a particular group of trace events.
The timestamp when this event occurred (picos since tracing started). This timestamp is in the range [min_timestamp, max_timestamp].
The duration of the event in picoseconds, if applicable. Events without duration are called instant events.
Storage for additional details, e.g. the raw data that led to this TraceEvent. These are stored as raw data so that we don't pay the deserialization cost (memory and runtime) if the data isn't used. See RawData in trace_events_raw.proto.
Used to correlate the multiple events of a flow.
For streaming trace viewer frontend deduplication, we need an unique id for each events, in the same time, we want to reduce the entropy overhead introduced by this. therefore we will use tuple<device_id, timestamp_ps, serial> as unique ids, serial is optional and only required when timestamp is not unique.
Used in:
Indicates the order of the event within a flow. Events with the same flow_id will appear in trace_viewer linked by arrows. For an arrow to be shown, at least the FLOW_START and FLOW_END must be present. There can be zero or more FLOW_MID events in the flow. Arrows are drawn from FLOW_START to FLOW_END and through each FLOW_MID event in timestamp order.
Used in:
Generic trace event arguments.
Used in:
Used in:
string type but stored in metadata.