Get desktop application:
View/edit binary Protocol Buffers messages
The record type which describes the scope this record captures.
Used in:
Captures the entire profiling duration including incomplete steps.
Captures the average of all complete steps.
Captures a single step.
Same as ALL but the performance metrics (FLOPS and memory bandwidth) are derived from the hardware performance conuters.
A database of RooflineModel records.
The device type.
Whether megacore is used.
Whether the device has shared CMEM.
Whether the device has merged VMEM.
Peak flop rate in GFLOP/s.
Peak HBM bandwidth in GiB/s
Peak CMEM read bandwidth in GiB/s
Peak CMEM write bandwidth in GiB/s
Peak VMEM read bandwidth in GiB/s
Peak VMEM write bandwidth in GiB/s
All RooflineModel records, one for each HLO operation.
Error and warning messages for diagnosing profiling issues.
There is one RooflineModelRecord for each HLO operation profiled. Next ID: 44
Used in:
The record type.
Step number when record type is PER_STEP. Otherwise, invalid.
The rank by self time
The hlo module id of the op
The HLO category name.
The HLO operation name.
Number of occurrences of the operation.
Total "accumulated" time in micro-seconds that the operation took. If this operation has any children operations, the "accumulated" time includes the time spent inside children.
Total time per core in micro-seconds.
Total time as fraction of the total program time.
Average "accumulated" time in micro-seconds that each occurrence of the operation took.
Total "self" time in micro-seconds that the operation took. If this operation has any children operations, the "self" time doesn't include the time spent inside children.
Average "self" time in micro-seconds that the operation took.
Percentage of the total "accumulated" time that was caused by DMA stall.
Number of total floating-point operations (FLOPs) performed per second normalized to the bf16 peak performance.
Numbef or total floating point operations (FLOPs) performed per second for the op.
Number of total bytes (including both read and write) accessed per second.
HBM bandwidth in GiB/s (including both read and write).
CMEM read bandwidth in GiB/s.
CMEM write bandwidth in GiB/s.
VMEM read bandwidth in GiB/s.
VMEM write bandwidth in GiB/s.
Overall operational intensity in FLOP/Byte.
Operational intensity based on HBM in FLOP/Byte.
Operational intensity based on CMEM read in FLOP/Byte.
Operational intensity based on CMEM write in FLOP/Byte.
Operational intensity based on VMEM read in FLOP/Byte.
Operational intensity based on VMEM write in FLOP/Byte.
Operational intensity based on the bottleneck resource in FLOP/Byte.
Whether this operation is "Compute", "HBM", "CMEM Read", "CMEM Write" bound, according to the Roofline Model.
The optimal flop rate calculated as (operational intensity) * (peak memory bw)
Roofline efficiency.
Percentage of measured flop rate relative to the hardware limit.
Percentage of measured memory bandwidth relative to the hardware limit.
Whether the record is calculated including infeed and outfeed ops.
Flops for the record
Bytes accessed for the record
Infrmation about the corresponding source code.