Get desktop application:
View/edit binary Protocol Buffers messages
The active memory allocations at the peak memory usage.
Used in:
The index of a snapshot in the time-sorted list, used to fetch the MemoryActivityMetadata at front end from the memory_profile_snapshots list.
The index of MemoryActivityMetadata in the special_allocations list.
Number of occurrences for identical memory allocations.
Result database for all-reduce ops.
Used in:
Result proto for all -educe ops.
Used in:
Unique id for all-reduce ops.
The name of the hlo op.
For all-reduce nodes from different modules, if they have the same all_reduce_id, they will be 'Allreduce'd'. If empty, AllReduce will not be applied across modules.
The start time in picoseconds of the op event.
The end time in picoseconds of the op event.
The size of the op in bytes.
Generic hardware bottleneck.
Indicates if input is a bottleneck. Possible values: "host", "device", "both", or "unknown"
A human-readable description of the input bottleneck.
Indicates if kernel launching is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the kernel launching overhead.
Indicates if all other is a bottleneck. Possible values: "no", "moderate", "high".
A human-readable description of the all other overhead.
Used in:
A 'device' is a physical entity in the system and is comprised of several resources.
Used in:
The name of the device.
The id of this device, unique in a single trace.
The resources on this device, keyed by resource_id;
Bytes/s.
Result database for core to core flow events.
Used in:
Result proto for metrics on flow events.
Used in:
Unique id for each send and recv pair.
Channel id generated by the XLA compiler, it is statically unique within an HloModule.
The name of the hlo op.
Category of the hlo op.
The start time in picoseconds of the op event.
The end time in picoseconds of the op event.
The size of the op in bytes.
The replica id of the program running the flow event.
Indicates if kernel launch is a performance bottleneck. Possible values: "no", "moderate", "high".
A statement that recommends if we need to further investigate kernel-launch performance.
Indicates if all other is a performance bottleneck. Possible values: "no", "moderate", "high".
A statement that recommends if we need to further investigate all-other performance.
A statement that recommends if the user should try using lower precision. Shows this statement to users only if it is not empty.
Breakdown of step-time on generic hardware. Note that these components are mutually exclusive so that adding them together is equal to the step time. If an execution time interval has multiple types of event happening, we need to pick one of the event type to attribute the time interval to.
Map event type to the accumulated duration in picoseconds of that type.
Summary of all unknown time as a part of step in ms.
Summary of all host-wait-input time as a part of step in ms.
Summary of all host-to-device time as a part of step in ms.
Summary of all input time as a part of step in ms.
Summary of all output time as a part of step in ms.
Summary of all device-compute time as a part of step in ms.
Summary of all device-to-device time as a part of step in ms.
Summary of all host-compute time as a part of step in ms.
Summary of all host-prepare time as a part of step in ms.
Summary of all compilation time as a part of step in ms.
Types of hardware profiled.
Unknown hardware.
CPU only without any hardware accelerator.
GPU.
TPU.
Result proto for host-dependent job information.
Used in:
This ID of the host where the job was run on.
The command line used to run the job.
The start time of this run (nanoseconds since the Unix epoch).
BNS address specified by client at time of profiling request.
Profiling start walltime (in ns).
Result proto for host-independent job information.
Used in:
The change-list number of this build.
The time of this build (nanoseconds since the Unix epoch).
The target of this build.
Profiling duration (in ms).
Used in:
The Op's name.
The number of occurrences.
Time (accumulated over all occurrences) in milliseconds.
Time (accumulated over all occurrences) in percentage of the total input processing time.
Self time (accumulated over all occurrences) in milliseconds.
Self time (accumulated over all occurrences) in percentage of the total input processing time.
Possible categories: "Enqueue", "Advanced file read", "Demanded file read", "Preprocessing", "Unknown".
Used in:
A list of detailed recommendations.
An analysis of different types of bottlenecks. Can be unpacked into a BottleneckAnalysis.
A suggested of step to take next.
Used in:
Hardware type.
Summary of all step duration across all cores.
Summary of all input-related stall as percentage of step duration.
Details of each step. Can be unpacked into a PerGenericStepDetails.
The breakdown of the input processing time.
Details of each input Op executed.
Recommendation for next steps to users.
Breakdown of the step time. Can be unpacked into a GenericStepTimeBreakdown.
Error messages.
Used in:
Time spent on demanded file read in microseconds.
Time spent on advanced file read in microseconds.
Time spent on data preprocessing in microseconds.
The infeed enqueue time in microseconds.
This entry is for the situtation where we can't further break down the non-enqueue input time (because the input pipeline is not instrumented).
Used in:
Name of the kernel.
Registers per thread.
Static shared memory in bytes.
Dynamic shared memory in bytes.
Block dimensions.
Grid dimensions.
Total duration of this kernel.
Min duration of kernel in nanoseconds.
Max duration of kernel in nanoseconds.
Kernel utilizes TensorCore instructions.
Operation is eligible to use TensorCores.
TF operation name.
Number of occurrences.
Used in:
A list of kernels aggregated by name.
Data layout of an op.
Used in:
The physical data layout, from most-minor to most-major dimensions.
Physical data layout in each tensor dimension.
Used in:
Size of the data in this dimension.
Data must be padded to a multiple of alignment.
What the dimension represents.
What the dimension represents, e.g. spatial, feature or batch.
Used in:
The memory activity that causes change of memory state.
Used in:
Memory allocation in heap.
Memory deallocation in heap.
Memory reservation for stack.
Expansion of existing memory allocation.
The metadata associated with each memory allocation/deallocation. It can also be interpreted as the metadata for the delta of memory state. Next ID: 10
Used in: ,
The activity associated with the MemoryProfileSnapshot.
The requested memory size in bytes from the caller of memory allocation. Should be a positive number.
The allocated (block/chunk) size for the memory allocation. Should be a positive number.
Starting address of the allocated memory chunk/block.
TensorFlow Op name for the memory activity.
Step Id at which the memory activity occurred.
Tensor memory region type including "output", "temp", "persist", and "dynamic".
From enum DataType defined in tensorflow/core/framework/types.proto.
Tensor shape printed in string, e.g. "[3, 3, 512, 512]".
The aggregated memory stats including heap, stack, free memory and fragmentation at a specific time.
Used in: ,
Memory usage by stack reservation, in bytes.
Memory usage by heap allocation, in bytes.
Free memory available for allocation or reservation, in bytes.
Fragmentation value within [0, 1].
The peak memory usage over the entire program (lifetime of memory allocator). It monotonically increases with upper limit as memory capacity.
Data for memory usage analysis in one host.
A map from memory allocator's id to PerAllocatorMemoryProfile for memory usage analysis on this host.
Number of hosts profiled, used to populate host selection list at front end.
Ids for profiled memory allocators, used to populate memory selection list at front end.
Map of original random int64 step id to the count of memory activity events assigned with it.
Profile snapshot of the TensorFlow memory at runtime, including MemoryAggregationStats (memory usage breakdown etc.), and MemoryActivityMetadata (allocation or deallocation, TF Op name etc.).
Used in:
Memory activity timestamp.
The memory aggregation stats at the snapshot time.
The metadata for the memory activity at the snapshot time.
The summary of memory profile within the profiling window duration.
Used in:
The peak memory usage over the entire program (lifetime of memory allocator).
The peak memory usage stats within the profiling window.
The timestamp for peak memory usage within the profiling window.
The memory capacity of the allocator.
Metrics for an operation (accumulated over all occurrences). Next ID: 19
Used in:
HLO module id. 0 for TF ops.
Name of this op.
Category of this op.
Provenance of this op (e.g., if HLO op, original TF op).
Whether it is executed eagerly.
Number of executions.
Total time (self + children) in picoseconds.
Minimum time (self + children) among all occurrences.
Total self time in picoseconds.
Total FLOPs.
Total bytes accessed.
Total dma stall time in picoseconds.
The data layout for this op. Only set for convolution ops for now.
Deduplicated HLO name for this op. Not set for TF ops.
Children of the op. e.g. fused ops if this op is fusion.
A database for OpMetrics. Next ID: 14
Used in: , ,
A bunch of OpMetrics.
The total host infeed-enqueue duration in picoseconds.
The total of the difference between the start times of two consecutive infeed-enqueues (per host) in picoseconds.
The total time in picoseconds.
The total time incurred by OPs in picoseconds.
Precision-related stats.
Operator Statistics.
The database for the op metrics collected from the host over the entire profiling session including incomplete steps.
The database for the op metrics collected from the device over the entire profiling session including incomplete steps.
Performance environment of the op metrics collected.
The database of step sequences.
The run environment of this profiling session.
Kernel stats results from all GPUs.
Statistics for all tf-functions.
Errors seen.
The run environment of the profiled session.
The step-time result.
The other analysis result.
The recommendation made to the user.
Errors.
Overview result for general analysis.
Used in:
MXU utilization in percentage.
Percentage of the device time that is idle.
Percentage of the host time that is idle.
Top TF Ops executed on the device.
Remark text in the performance summary section.
Color of the remark text.
FLOP rate utilization relative to the roofline in percentage.
Memory bandwidth utilization relative to the hw limit in percentage.
Percentage of device computation that is 16-bit.
Percentage of device computation that is 32-bit.
Percentage of TF ops executed on the host.
Percentage of TF ops executed on the device.
Host trace level.
Result proto for host-dependent job information.
Used in:
This ID of the host where the job was run on.
The command line used to run the job.
The start time of this run (nanoseconds since the Unix epoch).
BNS address specified by client at time of profiling request.
Profiling start walltime (in ns).
Result proto for host-independent job information.
Used in:
The change-list number of this build.
The time of this build (nanoseconds since the Unix epoch).
The target of this build.
Profiling duration (in ms).
Overview result for the recommendation section.
Used in:
Possible performance bottleneck: "host", "device", "both".
A statement for input that recommends the next steps for investigating the bottleneck.
A statement for output that recommends the next steps for investigating the bottleneck.
A list of tips for improving host performance.
A list of tips for improving device performance.
A list of links to related useful documents.
// The recommendation made to the user. Can be unpacked into a GenericRecommendation.
A list of tips for FAQ.
A list of tips for inference run.
The run environment of a profiling session.
Used in:
Number of hosts used.
Number of tasks used.
Distinct hostnames seen.
The type of device used.
The number of device cores used. In TPU case, this corresponds to the number of TPU cores In GPU case, this corresponds to the number of GPUs (not the number of SMs).
The per-device-core batch size.
Host-independent information about this job.
Host-dependent information about this job.
The number of replicas, corresponds to input parallelism. If there is no model parallelism, replica_count = device_core_count
The number of cores used for a single replica, e.g. model parallelism. If there is no model parallelism, then num_cores_per_replica = 1
Overview result for a performance tip to users.
Used in:
Link to the tip.
Overview result for a TensorFlow Op.
Used in:
Name of the Op.
Category of the Op.
The amount of time that this Op takes by itself as fraction of the total execution time on the device or host.
The cumulative time upto this Op as fraction of the total execution time.
How many GFlops/sec that this Op achieves.
Memory profile snapshots per memory allocator.
Used in:
A list of MemoryProfileSnapshots sorted by time_offset_ps.
The summary of memory profile (e.g. the peak memory usage).
The rows in the table of active allocations at peak memory usage within profiling window.
The special allocations (e.g. pre-allocated heap memory, stack reservation) that are not captured in the MemoryActivityMetadata of memory_profile_snapshots. Need to handle separately.
Result proto for information in a step across all cores.
Used in:
The step number.
A map from core_id to StepInfo.
The result for the per-step HLO-metric database.
The result for send and recv flows.
A map from core ID to program replica id. Replica id map could change during a profile session, but should stay stable within a step.
The result for all-reduce ops.hlo_metrics_db
Per-step details on generic hardware.
The step number of a step.
The step time (in ms).
Breakdown of the step time in different event categories. The unknown time (in ms).
The time (in ms) in which the host is waiting for input data to be ready.
The time (in ms) in which the host is sending input data to the device. Total input time = host_wait_input_ms + host_to_device_ms.
The output time (in ms).
The device-compute time (in ms).
The device-to-device communication time (in ms).
The host-compute time (in ms).
The host-prepare time (in ms).
The time spent on compiling (in ms).
Performance environment, e.g the peak performance capabilities of the device.
Used in:
Peak performance of a TPU core or a GPU in TFLOP/s.
Peak memory bandwidth of a TPU core or a GPU in GiBs/s.
The ridge point of roofline model in FLOP/Byte. (i.e., minimum operational intensity required to achieve maximum performance).
Statistics about the various precision used in computation.
Used in:
Amount of time spent on 16-bit computation (in ps).
Amount of time spent on 32-bit computation (in ps).
A 'resource' generally is a specific computation component on a device. These can range from threads on CPUs to specific arithmetic units on hardware devices.
Used in:
The name of the resource.
The id of the resource. Unique within a device.
The run environment of a profiling session.
Used in:
Number of hosts used.
Number of tasks used.
Distinct hostnames seen.
The type of device used.
The number of device cores used. In TPU case, this corresponds to the number of TPU cores In GPU case, this corresponds to the number of GPUs (not the number of SMs).
The per-device-core batch size.
Host-independent information about this job.
Host-dependent information about this job.
The number of replicas, corresponds to input parallelism. If there is no model parallelism, replica_count = device_core_count
The number of cores used for a single replica, e.g. model parallelism. If there is no model parallelism, then num_cores_per_replica = 1
The chip interconnection topology.
Host trace level.
Result proto for a StepDatabase.
Used in:
Whether the step db uses incomplete step information. This flag is set to true when: 1) no step marker or annotation present. 2) profiling duration is too short to cover a full step. If this flag is false, we will group and breakdown the profile by complete steps only and ignore incomplete steps. If this flag is true, we will simply aggregate and breakdown over the total profile as a single step.
A sequence of PerCoreStepInfo.
Next ID: 5 Result proto for StepInfo.
Used in:
The step number.
The step duration in picoseconds.
The start time of this step in picoseconds.
Breakdown of the step-time. Can be unpacked into a GenericStepBreakdown.
Used for both step duration and Op duration.
Used in: ,
System topology, which describes the number of chips in a pod and the connectivity style.
Used in:
The X, Y, and Z dimensions of this topology. 0 means that dimension does not exist.
The number of expected bad chips in this system.
Statistics for a tf-function.
Used in:
A map from each execution mode to its corresponding metrics.
Total tracing count from the program's beginning (i.e. beyond the profiling period) of this tf-function.
Compiler used to compile this function.
All possible compilers that can be used to compile a tf-function in the graph mode.
Used in:
Yet to be set.
Any other compiler.
If some instance of the function is compiled with XLA and some is compiled with Non-XLA, use "MIXED_COMPILER".
XLA compiler.
MLIR compiler.
Statistics for all tf-functions.
Used in:
A map from function name to the statistics of that function.
All possible execution modes of a tf-function.
Yet to be set.
Eager execution.
Graph execution with tracing.
Graph execution without tracing.
Concrete function.
Metrics associated with a particular execution mode of a tf-function.
Used in:
Number of invocations to the function in that execution mode.
The sum of "self-execution" time of this function over those invocations.
A database of TfStatsTables.
The table that includes IDLE time.
The table that excludes IDLE time.
There is one TfStatsRecord for each TF operation profiled.
Used in:
Rank of this TF-op among all TF-ops.
Whether this TF-op is on "Host" or "Device".
TF-op type.
TF-op name.
Number of occurrences of the operation.
Total "accumulated" time in micro-seconds that the operation took. If this operation has any children operations, the "accumulated" time includes the time spent inside children.
Average "accumulated" time in micro-seconds that each occurrence of the operation took.
Total "self" time in micro-seconds that the operation took. If this operation has any children operations, the "self" time doesn't include the time spent inside children.
Average "self" time in micro-seconds that the operation took.
Total "self" time as fraction of the sum of the total self-time of operations run on the device. It is 0 if this op runs on the host.
Cumulative value of device_total_self_time_as_fraction.
Total "self" time as fraction of the sum of the total self-time of operations run on the host. It is 0 if this op runs on the device.
Cumulative value of host_total_self_time_as_fraction.
Number of floating-point operations (FLOPs) performed per second.
Number of bytes (including both read and write) accessed per second.
Operational intensity, which is defined as FLOPs/bytes-accessed.
Whether this operation is "Compute" or "Memory" bound, according to the Roofline Model.
Whether this TF-op is eagerly executed.
A table of TFStatsRecords plus the corresponding pprof keys.
Used in:
All TfStats records, one for each TF operation.
key to the pprof profile for host TF operations.
key to the pprof profile for device TF operations.
A 'Trace' contains metadata for the individual traces of a system.
The devices that this trace has information about. Maps from device_id to more data about the specific device.
All trace events capturing in the profiling period.
Used in:
The id of the device that this event occurred on. The full dataset should have this device present in the Trace object.
The id of the resource that this event occurred on. The full dataset should have this resource present in the Device object of the Trace object. A resource_id is unique on a specific device, but not necessarily within the trace.
The name of this trace event.
The timestamp that this event occurred at (in picos since tracing started).
The duration of the event in picoseconds if applicable. Events without duration are called instant events.
Extra arguments that will be displayed in trace view.
An XEvent is a trace event, optionally annotated with XStats. Next ID: 6
Used in:
XEventMetadata.id of corresponding metadata.
Start time of the event in picoseconds, as offset from XLine.timestamp_ns().
Number of occurrences of the event, if aggregated.
Duration of the event in picoseconds. Can be zero for an instant event.
XStats associated with the event.
Metadata for an XEvent, shared by all instances of the same event. Next ID: 5
Used in:
XPlane.event_metadata map key.
Name of the event.
Name of the event shown in trace viewer.
Additional metadata in serialized format.
An XLine is a timeline of trace events (XEvents). Next ID: 12
Used in:
Id of this line, can be repeated within an XPlane. All XLines with the same id are effectively the same timeline.
Display id of this line. Multiple lines with the same display_id are grouped together in the same trace viewer row.
Name of this XLine.
Name of this XLine to display in trace viewer.
Start time of this line in nanoseconds since the UNIX epoch. XEvent.offset_ps is relative to this timestamp.
Profiling duration for this line in picoseconds.
XEvents within the same XLine should not overlap in time, but they can be nested.
An XPlane is a container of parallel timelines (XLines), generated by a profiling source or by post-processing one or more XPlanes. Next ID: 7
Used in:
Name of this line.
Parallel timelines grouped in this plane. XLines with the same id are effectively the same timeline.
XEventMetadata map, each entry uses the XEventMetadata.id as key. This map should be used for events that share the same ID over the whole XPlane.
XStatMetadata map, each entry uses the XStatMetadata.id as key. This map should be used for stats that share the same ID over the whole XPlane.
XStats associated with this plane, e.g. device capabilities.
A container of parallel XPlanes, generated by one or more profiling sources. Next ID: 3
Errors (if any) in the generation of planes.
An XStat is a named value associated with an XEvent, e.g., a performance counter value, a metric computed by a formula applied over nested XEvents and XStats. Next ID: 8
Used in: ,
XStatMetadata.id of corresponding metadata.
Value of this stat.
A string value that stored in XStatMetadata::name.
Metadata for an XStat, shared by all instances of the same stat. Next ID: 4
Used in:
XPlane.stat_metadata map key.
Name of the stat (should be short).
Description of the stat (might be long).