Get desktop application:
View/edit binary Protocol Buffers messages
A segment of a lane with a given adjacent boundary.
Used in:
,The index into the lane's polyline where this lane boundary starts.
The index into the lane's polyline where this lane boundary ends.
The adjacent boundary feature ID of the MapFeature for the boundary. This can either be a RoadLine feature or a RoadEdge feature.
The adjacent boundary type. If the boundary is a road edge instead of a road line, this will be set to TYPE_UNKNOWN.
Box coordinates in image frame.
Dimensions of the box. length: dim x. width: dim y.
The heading of the bounding box (in radians). The heading is the angle required to rotate +x to the surface normal of the box front face. It is normalized to [-pi, pi).
Used in:
Box coordinates in image frame.
Dimensions of the box. length: dim x. width: dim y.
The heading of the bounding box (in radians). The heading is the angle required to rotate +x to the surface normal of the box front face. It is normalized to [-pi, pi).
A breakdown generator defines a way to shard a set of objects such that users can compute metrics for different subsets of objects. Each breakdown generator comes with a unique breakdown generator ID.
Used in:
, , ,The breakdown generator ID.
The breakdown generator shard.
The difficulty level.
Used in:
,Everything is in one shard.
Shard by object types.
Shard by box center distance.
Shard by time of the day at which the scene is.
Shard by location of the scene.
Shard by the weather of the scene.
Shard by object velocity.
All types except SIGN. This is NOT the same as ALL_NS in the leaderboard!! ALL_NS in the leaderboard is the mean of VEHICLE, PED, CYCLIST metrics.
Shard by the object size (the max of length, width, height).
Shard by the corresponding camera.
Used in:
1d Array of [f_u, f_v, c_u, c_v, k{1, 2}, p{1, 2}, k{3}]. Note that this intrinsic corresponds to the images after scaling. Camera model: pinhole camera. Lens distortion: Radial distortion coefficients: k1, k2, k3. Tangential distortion coefficients: p1, p2. k_{1, 2, 3}, p_{1, 2} follows the same definition as OpenCV. https://en.wikipedia.org/wiki/Distortion_(optics) https://docs.opencv.org/2.4/doc/tutorials/calib3d/camera_calibration/camera_calibration.html
Camera frame to vehicle frame.
Camera image size.
Used in:
All timestamps in this proto are represented as seconds since Unix epoch.
Used in:
JPEG image.
SDC pose.
SDC velocity at 'pose_timestamp' below. The velocity value is represented at *global* frame. With this velocity, the pose can be extrapolated. r(t+dt) = r(t) + dr/dt * dt where dr/dt = v_{x,y,z}. dR(t)/dt = W*R(t) where W = SkewSymmetric(w_{x,y,z}) This differential equation solves to: R(t) = exp(Wt)*R(0) if W is constant. When dt is small: R(t+dt) = (I+W*dt)R(t) r(t) = (x(t), y(t), z(t)) is vehicle location at t in the global frame. R(t) = Rotation Matrix (3x3) from the body frame to the global frame at t. SkewSymmetric(x,y,z) is defined as the cross-product matrix in the following: https://en.wikipedia.org/wiki/Cross_product#Conversion_to_matrix_multiplication
Timestamp of the `pose` above.
Rolling shutter params. The following explanation assumes left->right rolling shutter. Rolling shutter cameras expose and read the image column by column, offset by the read out time for each column. The desired timestamp for each column is the middle of the exposure of that column as outlined below for an image with 3 columns: ------time------> |---- exposure col 1----| read | -------|---- exposure col 2----| read | --------------|---- exposure col 3----| read | ^trigger time ^readout end time ^time for row 1 (= middle of exposure of row 1) ^time image center (= middle of exposure of middle row) Shutter duration in seconds. Exposure time per column.
Time when the sensor was triggered and when last readout finished. The difference between trigger time and readout done time includes the exposure time and the actual sensor readout time.
Panoptic segmentation labels for this camera image. NOTE: Not every image has panoptic segmentation labels.
The camera labels associated with a given camera image. This message indicates the ground truth information for the camera image recorded by the given camera. If there are no labeled objects in the image, then the labels field is empty.
Used in:
(message has no fields)
Used in:
, , , , ,Semantic classes for the camera segmentation labels.
(message has no fields)
Anything that does not fit the other classes or is too ambiguous to label.
The Waymo vehicle.
Small vehicle such as a sedan, SUV, pickup truck, minivan or golf cart.
Large vehicle that carries cargo.
Large vehicle that carries more than 8 passengers.
Large vehicle that is not a truck or a bus.
Bicycle with no rider.
Motorcycle with no rider.
Trailer attached to another vehicle or horse.
Pedestrian. Does not include objects associated with the pedestrian, such as suitcases, strollers or cars.
Bicycle with rider.
Motorcycle with rider.
Birds, including ones on the ground.
Animal on the ground such as a dog, cat, cow, etc.
Cone or short pole related to construction.
Permanent horizontal and vertical lamp pole, traffic sign pole, etc.
Large object carried/pushed/dragged by a pedestrian.
Sign related to traffic, including front and back facing signs.
The box that contains traffic lights regardless of front or back facing.
Permanent building and walls, including solid fences.
Drivable road with proper markings, including parking lots and gas stations.
Marking on the road that is parallel to the ego vehicle and defines lanes.
All markings on the road other than lane markers.
Paved walkable surface for pedestrians, including curbs.
Vegetation including tree trunks, tree branches, bushes, tall grasses, flowers and so on.
The sky, including clouds.
Other horizontal surfaces that are drivable or walkable.
Object that is not permanent in its current position and does not belong to any of the above classes.
Object that is permanent in its current position and does not belong to any of the above classes.
Used in:
Segmentation label for a camera.
These must be set when evaluating on the leaderboard. This should be set to Context.name defined in dataset.proto::Context.
This should be set to Frame.timestamp_micros defined in dataset.proto::Frame.
The camera associated with this label.
Used in:
Panoptic (instance + semantic) segmentation labels for a given camera image. Associations can also be provided between each instance ID and a globally unique ID across all frames.
Used in:
,The value used to separate instance_ids from different semantic classes. See the panoptic_label field for how this is used. Must be set to be greater than the maximum instance_id.
A uint16 png encoded image, with the same resolution as the corresponding camera image. Each pixel contains a panoptic segmentation label, which is computed as: semantic_class_id * panoptic_label_divisor + instance_id. We set instance_id = 0 for pixels for which there is no instance_id. NOTE: Instance IDs in this label are only consistent within this camera image. Use instance_id_to_global_id_mapping to get cross-camera consistent instance IDs.
The sequence id for this label. The above instance_id_to_global_id_mapping is only valid with other labels with the same sequence id.
A uint8 png encoded image, with the same resolution as the corresponding camera image. The value on each pixel indicates the number of cameras that overlap with this pixel. Used for the weighted Segmentation and Tracking Quality (wSTQ) metric.
A mapping between each panoptic label with an instance_id and a globally unique ID across all frames within the same sequence. This can be used to match instances across cameras and over time. i.e. instances belonging to the same object will map to the same global ID across all frames in the same sequence. NOTE: These unique IDs are not consistent with other IDs in the dataset, e.g. the bounding box IDs.
Used in:
If false, the corresponding instance will not have consistent global ids between frames.
Panoptic segmentation metrics. weighted Segmentation Tracking and Quality.
weighted Association Quality.
mean Intersection over Union.
User reported, number of frames between inference.
Runtime for the method in milliseconds.
Next ID: 10.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Link to paper or other link that describes the method.
The number of frames skipped between each prediction during inference. Usually 0 (inference on every frame) or 1 (inference on every other frame). e.g. the validation and test groundtruth is provided with frame_dt = 1.
(Optional) The time for the method to run in ms.
Inference results.
Camera tokens for a single camera sensor.
Used in:
Camera sensor name.
Camera tokens is a sequence of integers corresonding to codebook indices.
A set of predictions for a single scenario.
Used in:
The unique ID of the scenario being predicted. This ID must match the scenario_id field in the test or validation set tf.Example or scenario proto corresponding to this set of predictions.
The predictions for the scenario. For the motion prediction challenge, populate the predictions field. For the interaction prediction challenge, populate the joint_predictions_field.
Single object predictions. This must be populated for the motion prediction challenge.
Joint predictions for the interacting objects. This must be populated for the interaction prediction challenge.
Lidar data of a frame.
Used in:
The Lidar data for each timestamp.
Laser calibration data has the same length as that of lasers.
Poses of the SDC corresponding to the track states for each step in the scenario, similar to the one in the Frame proto.
Compressed Laser data.
Used in:
Range image is a 2d tensor. The first dimension (rows) represents pitch. The second dimension represents yaw (columns). Zlib compressed range images include: Raw range image: Raw range image with a non-empty 'range_image_pose_delta_compressed' which tells the vehicle pose of each range image cell. NOTE: 'range_image_pose_delta_compressed' is only populated for the first range image return. The second return has the exact the same range image pose as the first one.
Used in:
Zlib compressed [H, W, 4] serialized DeltaEncodedData message version which stores MatrixFloat. MatrixFloat range_image; range_image.ParseFromString(val); Inner dimensions are: * channel 0: range * channel 1: intensity * channel 2: elongation * channel 3: is in any no label zone.
Zlib compressed [H, W, 4] serialized DeltaEncodedData message version which stores MatrixFloat. To decompress (Please see the documentation for lidar delta encoding): string val = delta_encoder.decompress(range_image_pose_compressed); MatrixFloat range_image_pose; range_image_pose.ParseFromString(val); Inner dimensions are [roll, pitch, yaw, x, y, z] represents a transform from vehicle frame to global frame for every range image pixel. This is ONLY populated for the first return. The second return is assumed to have exactly the same range_image_pose_compressed. The roll, pitch and yaw are specified as 3-2-1 Euler angle rotations, meaning that rotating from the navigation to vehicle frame consists of a yaw, then pitch and finally roll rotation about the z, y and x axes respectively. All rotations use the right hand rule and are positive in the counter clockwise direction.
Configuration to compute detection/tracking metrics.
Score cutoffs used to remove predictions with lower Object::score during matching in order to compute precision-recall pairs at different operating points.
If `score_cutoffs` above is not set, the cutoffs are generated based on the score distributions in the predictions and produce `num_desired_score_cutoffs`. NOTE: this field is to be deprecated. Manually set score_cutoffs above to [0:0.01:1]. TODO: clean this up.
Breakdown generator IDs. Note that users only need to specify the IDs but NOT other information about this generator such as number of shards.
This has the same size as breakdown_generator_ids. Each entry indicates the set of difficulty levels to be considered for each breakdown generator.
Indexed by label type. Size = Label::TYPE_MAX+1. The thresholds must be within [0.0, 1.0].
Desired recall delta when sampling the P/R curve to compute mean average precision.
////////////////////////////////////////////////////////////////////////// Users do not need to modify the following features. ////////////////////////////////////////////////////////////////////////// If set, all precisions below this value is considered as 0.
Any matching with an heading accuracy lower than this is considered as false matching.
When enabled, the details in the matching such as index of the false positives, false negatives or true positives will be included.
Longitudinal error tolerant (LET) metrics config for Camera-Only (Mono) 3D Detection. By enabling this metric, the prediction-groundtruth matching will be more tolerant to the longitudinal noise, rather than just use IoU. The tolerance is larger in the long range, but only along the line of sight from the sensor origin.
Used in:
When enabled, calculate the longitudinal error tolerant 3D AP (LET-3D-AP).
Location of the sensor used to infer the predictions (e.g., camera). The location is related to the vehicle origin. It is used to translate the centers of prediction and ground truth boxes to the sensor cooridinate system so that the range to the sensor origin can be calculated correctly.
The percentage of allowed longitudinal error for a given ground truth object. The final longitudinal tolerance tol_lon in meters given a ground truth object with range r_gt is computed as: tol_r = max(longitudinal_tolerance_percentage* r_gt, min_range_tolerance_meter), where min_longitudinal_tolerance_meter is introduced to handle near-range ground truth objects so that it has a minimum longitudinal error tolerance in meters. A prediction bounding box can be matched with a ground truth bounding box only if the range error between them is less than the tolerance.
Describes how a prediction box aligns with a ground truth box to minimize the longitudinal error.
Used in:
No alignment is performed.
The center of the prediction box moves along the line of sight such that it has the closest distance to the center of the ground truth box.
The center of the prediction box moves to the center of the ground truth box, which means no localization error after alignment.
The center of the prediction box moves along the line of sight such that it has the closest distance to the center of the ground truth box. Same as `TYPE_RANGE_ALIGNED` except this only applies if the prediction is beyond the ground truth. Example: given O is sensor origin, G ground truth center, and P prediction center (O -> G [P]) P will only be moved if it is beyond G in reference to O.
The center of the prediction box moves along the line of sight such that it has the closest distance to the center of the ground truth box. Same as `TYPE_RANGE_ALIGNED` except this only applies if the prediction is before the ground truth in references to the sensor origin. Example: given O is sensor origin, G ground truth center, and P(1/2) the prediction center ([P1] O -> [P2] -> G ) P will only be moved if it is before G in reference to O.
The center of the prediction box moves along the line of sight such that it has the closest distance to the center of the ground truth box. Same as `TYPE_RANGE_ALIGNED` except this only applies if the prediction is between the sensor origin and ground truth. Example: given O is sensor origin, G ground truth center, and P prediction center (O -> [P] -> G ) P will only be moved if it is between G and O.
Location in 3D space described in a Cartersian coordinate system.
Used in:
Used in:
A unique name that identifies the frame sequence.
Some stats for the run segment used.
Used in:
Day, Dawn/Dusk, or Night, determined from sun elevation.
Human readable location (e.g. CHD, SF) of the run segment.
Currently either Sunny or Rain.
Used in:
The number of unique objects with the type in the segment.
Used in:
The polygon defining the outline of the crosswalk. The polygon is assumed to be closed (i.e. a segment exists between the last point and the first point).
Delta Encoded data structure. The protobuf compressed mask and residual data and the compressed data is encoded via zlib: compressed_bytes = zlib.compress( metadata + data_bytes + mask_bytes + residuals_bytes) The range_image_delta_compressed and range_image_pose_delta_compressed in the CompressedRangeImage are both encoded using this method.
Used in:
Number of false positives.
Number of true positives.
Number of false negatives.
If set, will include the ids of the fp/tp/fn objects. Each element corresponds to one frame of matching.
Sum of heading accuracy (ha) for all TPs.
Sum of longitudinal affinity for all TPs.
The score cutoff used to compute this measurement. Optional.
Detailed information regarding the results.
Used in:
False positive prediction ids.
False negative ground truth ids.
True positive ground truth ids. Should be of the same length with tp_pr_ids, tp_ious. Each pair of ids of the same index correspond to the ids of ground truth object and prediction objects which are matched.
True positive prediction ids.
IoU values of the true positive pairs.
Heading accuracies of the true positive pairs.
Longitudinal affinities of the true positive pairs.
Used in:
The breakdown the detection measurements are computed for.
Heading accuracy weighted mean average precision.
Longitudinal affinity weighted mean average precision.
The breakdown the detection metrics are computed for.
Raw measurements.
A set of difficulty levels.
Used in:
If no levels are set, the highest difficulty level is assumed.
Used in:
The polygon defining the outline of the driveway region. The polygon is assumed to be closed (i.e. a segment exists between the last point and the first point).
The dynamic map information at a single time step.
Used in:
The traffic signal states for all observed signals at this time step.
Used in:
The timestamp associated with the dynamic feature data.
The set of traffic signal states for the associated time step.
Message packaging a full submission to the challenge.
The set of trajectories to evaluate. One entry should exist for every frame in the test set.
Identifier of the submission type. Has to be set for the submission to be valid.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Author information.
A brief description of the method.
Link to paper or other link that describes the method.
Set this to true if your model used publicly available open-source LLM/VLM(s) for pre-training. This field is now REQUIRED for a valid submission.
If any open-source model was used, specify their names and configuration.
Specify an estimate of the number of parameters of the model used to generate this submission. The number must be specified as an integer number followed by a multiplier suffix (from the set [K, M, B, T, ...], e.g. "200K"). This field is now REQUIRED for a valid submission.
The challenge submission type.
Used in:
A submission for the Waymo open dataset end-to-end driving challenge.
This proto contains the Waymo Open Dataset End-to-End Driving (E2ED) data format.
WOD frame object populated with camera image, calibration, and metadata. Populated fields: frame.context .name = unique identifier for this frame. .camera_calibrations = calibration metadata for all cameras. All other fields in `frame.context` are unused. frame.timestamp_micros = current frame timestamp. frame.images = camera images. All other fields in `frame` are unused. For details about frame.context.camera_calibrations and frame.images, see the CameraCalibration and CameraImage protos.
t = (0, 5s] future log states at 4Hz. Only position fields are populated. Future position x,y coords are used as prediction targets. z coords are included for visualization, but are not used as prediction targets.
t = (-4s, 0] past history states at 4Hz.
Driving intent of the ego-vehicle at this timestep.
Future trajectories with human-labeled rater scores. Only x,y position fields are populated, along with the rated score. This field is valid for only a subset of frames. For these frames, there are up to 3 rated trajectories. In all other frames, this field is marked as invalid with assigned rater scores of -1 or left empty. Valid scores range from [0, 10].
The final score averaged over all scenario clusters.
Rater feedback scores for each scenario cluster.
First, we compute per frame ADE using the ground truth trajectory with the highest rater score. Then, we average the ADE scores over all frames in the test set.
(message has no fields)
Driving intent of the ego-vehicle at a given timestep.
Used in:
Used in:
Position in meters. Right-handed coordinate system. +x = forward, +y = left, +z = up. The origin (0, 0, 0) is at the middle of the ego vehicle's rear axle.
Velocity in m/s.
Acceleration in m/s^2.
Only populated for trajectories with human-labeled scores. Valid scores range from [0, 10], inclusive.
Used in:
This context is the same for all frames belong to the same driving run segment. Use context.name to identify frames belong to the same driving segment. We do not store all frames from one driving segment in one proto to avoid huge protos.
Frame start time, which is the timestamp of the first top LiDAR scan within this frame. Note that this timestamp does not correspond to the provided vehicle pose (pose).
Frame vehicle pose. Note that unlike in CameraImage, the Frame pose does not correspond to the provided timestamp (timestamp_micros). Instead, it roughly (but not exactly) corresponds to the vehicle pose in the middle of the given frame. The frame vehicle pose defines the coordinate system which the 3D laser labels are defined in.
The camera images.
The LiDAR sensor data.
Native 3D labels that correspond to the LiDAR sensor data. The 3D labels are defined w.r.t. the frame vehicle pose coordinate system (pose).
The native 3D LiDAR labels (laser_labels) projected to camera images. A projected label is the smallest image axis aligned rectangle that can cover all projected points from the 3d LiDAR label. The projected label is ignored if the projection is fully outside a camera image. The projected label is clamped to the camera image if it is partially outside.
Native 2D camera labels. Note that if a camera identified by CameraLabels.name has an entry in this field, then it has been labeled, even though it is possible that there are no labeled objects in the corresponding image, which is identified by a zero sized CameraLabels.labels.
No label zones in the *global* frame.
Map features. Only the first frame in a segment will contain map data. This field will be empty for other frames as the map is identical for all frames.
Map pose offset. This offset must be added to lidar points from this frame to compensate for pose drift and align with the map features.
Camera tokens for all sensors of a frame.
Used in:
Camera tokens for all sensors in a frame.
Used in:
The unique identifier for this frame. This should match the name field in the Context proto (E2EDFrame.frame.context.name).
The ego-vehicle future trajectory prediction for this frame.
Used in:
A set of up to 6 predictions with varying confidences - all for the same pair of objects. All prediction entries must contain trajectories for the same set of objects or an error will be returned. Any joint predictions past the first six will be discarded.
Used in:
Collection of simulated objects trajectories defining a full simulated scene. This needs to be the product of a joint simulation of all the included objects. An object is to be included if is valid in the last history step of the original scenario (11th step).
A message containing a prediction for either a single object or a joint prediction for a set of objects.
Used in:
The trajectories for each object in the set being predicted. This may contain a single trajectory for a single object or a set of trajectories representing a joint prediction of a set of objects.
An optional confidence measure for this prediction. These should not be normalized across the set of trajectories.
Used in:
, ,Object ID.
Difficulty level for detection problem.
Difficulty level for tracking problem.
The total number of lidar points in this box.
The total number of top lidar points in this box.
Used if the Label is a part of `Frame.laser_labels`.
Used if the Label is a part of `Frame.camera_labels`.
Used by Lidar labels to store in which camera it is mostly visible.
Used by Lidar labels to store a camera-synchronized box corresponding to the camera indicated by `most_visible_camera_name`. Currently, the boxes are shifted to the time when the most visible camera captures the center of the box, taking into account the rolling shutter of that camera. Specifically, given the object box living at the start of the Open Dataset frame (t_frame) with center position (c) and velocity (v), we aim to find the camera capture time (t_capture), when the camera indicated by `most_visible_camera_name` captures the center of the object. To this end, we solve the rolling shutter optimization considering both ego and object motion: t_capture = image_column_to_time( camera_projection(c + v * (t_capture - t_frame), transform_vehicle(t_capture - t_ref), cam_params)), where transform_vehicle(t_capture - t_frame) is the vehicle transform from a pose reference time t_ref to t_capture considering the ego motion, and cam_params is the camera extrinsic and intrinsic parameters. We then move the label box to t_capture by updating the center of the box as follows: c_camra_synced = c + v * (t_capture - t_frame), while keeping the box dimensions and heading direction. We use the camera_synced_box as the ground truth box for the 3D Camera-Only Detection Challenge. This makes the assumption that the users provide the detection at the same time as the most visible camera captures the object center.
Information to cross reference between labels for different modalities.
Used in:
Currently only CameraLabels with class `TYPE_PEDESTRIAN` store information about associated lidar objects.
Upright box, zero pitch and roll.
Used in:
Box coordinates in vehicle frame.
Dimensions of the box. length: dim x. width: dim y. height: dim z.
The heading of the bounding box (in radians). The heading is the angle required to rotate +x to the surface normal of the box front face. It is normalized to [-pi, pi).
Used in:
7-DOF 3D (a.k.a upright 3D box).
5-DOF 2D. Mostly used for laser top down representation.
Axis aligned 2D. Mostly used for image.
The difficulty level of this label. The higher the level, the harder it is.
Used in:
, ,Used in:
Used in:
, ,Used in:
The speed limit for this lane.
True if the lane interpolates between two other lanes.
The polyline data for the lane. A polyline is a list of points with segments defined between consecutive points.
A list of IDs for lanes that this lane may be entered from.
A list of IDs for lanes that this lane may exit to.
The boundaries to the left of this lane. There may be different boundary types along this lane. Each BoundarySegment defines a section of the lane with a given boundary feature to the left. Note that some lanes do not have any boundaries (i.e. lane centers in intersections).
The boundaries to the right of this lane. See left_boundaries for details.
A list of neighbors to the left of this lane. Neighbor lanes include only adjacent lanes going the same direction.
A list of neighbors to the right of this lane. Neighbor lanes include only adjacent lanes going the same direction.
Type of this lane.
Used in:
Used in:
The feature ID of the neighbor lane.
The self adjacency segment. The other lane may only be a neighbor for only part of this lane. These indices define the points within this lane's polyline for which feature_id is a neighbor. If the lanes are neighbors at disjoint places (e.g., a median between them appears and then goes away) multiple neighbors will be listed. A lane change can only happen from this segment of this lane into the segment of the neighbor lane defined by neighbor_start_index and neighbor_end_index.
The neighbor adjacency segment. These indices define the valid portion of the neighbor lane's polyline where that lane is a neighbor to this lane. A lane change can only happen into this segment of the neighbor lane from the segment of this lane defined by self_start_index and self_end_index.
A list of segments within the self adjacency segment that have different boundaries between this lane and the neighbor lane. Each entry in this field contains the boundary type between this lane and the neighbor lane along with the indices into this lane's polyline where the boundary type begins and ends.
Used in:
,Used in:
,If non-empty, the beam pitch (in radians) is non-uniform. When constructing a range image, this mapping is used to map from beam pitch to range image row. If this is empty, we assume a uniform distribution.
beam_inclination_{min,max} (in radians) are used to determine the mapping.
Lidar frame to vehicle frame.
'Laser' is used interchangeably with 'Lidar' in this file.
(message has no fields)
Used in:
, ,The full set of map features.
A set of dynamic states per time step. These are ordered in consecutive time steps.
Used in:
, ,A unique ID to identify this feature.
Type specific data.
Used in:
, , , , , , ,Position in meters. The origin is an arbitrary location.
Different types of matchers can be supported. Each matcher has a unique ID.
(message has no fields)
Used in:
The Hungarian algorithm based matching that maximizes the sum of IoUs of all matched pairs. Detection scores have no effect on this matcher. https://en.wikipedia.org/wiki/Hungarian_algorithm
A COCO-style matcher: matches detections (ordered by scores) one by one to the groundtruth of largest IoUs.
TEST ONLY.
Row-major matrix. Requires: data.size() = product(shape.dims()).
Used in:
Row-major matrix. Requires: data.size() = product(shape.dims()).
Used in:
,Dimensions for the Matrix messages defined below. Must not be empty. The order of entries in 'dims' matters, as it indicates the layout of the values in the tensor in-memory representation. The first entry in 'dims' is the outermost dimension used to lay out the values; the last entry is the innermost dimension. This matches the in-memory layout of row-major matrices.
Metadata used for delta encoder.
Used in:
Range image's shape information in the compressed data.
Range image quantization precision for each range image channel.
A set of ScenarioPredictions protos. A ScenarioPredictions proto for each example in the test or validation set must be included for a valid submission.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Author information.
A brief description of the method.
Link to paper or other link that describes the method.
The challenge submission type.
Set this to true if your model uses the lidar data provided in the motion dataset. This field is now REQUIRED for a valid submission.
Set this to true if your model uses the camera data provided in the motion dataset. This field is now REQUIRED for a valid submission.
Set this to true if your model used publicly available open-source LLM/VLM(s) for pre-training. This field is now REQUIRED for a valid submission.
If any open-source model was used, specify their names and configuration.
Specify an estimate of the number of parameters of the model used to generate this submission. The number must be specified as an integer number followed by a multiplier suffix (from the set [K, M, B, T, ...], e.g. "200K"). This field is now REQUIRED for a valid submission.
The set of scenario predictions to evaluate. One entry should exist for every record in the test set.
Used in:
A submission for the Waymo open dataset motion prediction challenge.
A submission for the Waymo open dataset interaction prediction challenge.
A configuration for converting Scenario protos to tf.Example protos.
The maximum number of agents to populate in the tf.Example.
The maximum number of modeled agents to populate in the tf.Example. This field should not be changed from 8 for open dataset motion challenge uses.
The number of past steps (including the current step) in each trajectory. The defaults correspond to the open motion dataset data - 11 past steps (includes the current step), and 80 future steps). Changing these values will make the data incompatible with the open dataset motion challenges.
The number of future steps in each trajectory.
The maximum number of map points to store in each example. This defines the sizes of the roadgraph_samples/* tensors. Any additional samples in the source Scenario protos will be truncated. Lane centers and lane boundaries are prioritized over other types. This parameter along with the polyline_sample_spacing and polygon_sample_spacing fields will determine if points are truncated. For reference, in the current waymo open motion dataset the vast majority of Scenarios have less than 60,000 samples at 0.5m spacing (not including polygon samples). Only a few outliers exceed this where the largest has approximately 75,000 samples at 0.5m.
The input source polyline sample spacing. Do not change this from the default when using open dataset input data.
The roadgraph points will be re-sampled with this spacing. Note that decreasing this parameter may require an increase in the max_roadgraph_samples parameter to avoid truncating roadgraph data. If this is set to <= 0, the value in source_polyline_spacing will be used.
Features like speed bumps and crosswalks are defined only by polygon corner points. If this value is > 0, samples along the sides of the polygons will be added, spaced apart by this value. If this value is <= 0, only the polygon vertices will be added as sample points. Note that decreasing this parameter may require an increase in the max_roadgraph_samples parameter to avoid truncating roadgraph data.
The maximum number of traffic light points per time step.
A set of metrics broken down by measurement time step and object type.
Used in:
The object type these metrics were filtered by. All metrics below are only for this type of object. If not set, the metrics are aggregated for all types.
The prediction time step used to compute the metrics. The metrics are computed as if this was the last time step in the trajectory.
For each object, the average difference from the ground truth in meters is computed up to the measurement time step is computed for all trajectory predictions for that object. The value with the minimum error is kept (minADE). The resulting values are accumulated for all predicted objects in all scenarios.
For each object the error for a given trajectory at the measurement time step is computed for all trajectory predictions for that objects. The value with the minimum error is kept (minFDE). The mean of all measurements in the accumulator is the average minFDE.
The miss rate is calculated by computing the displacement from ground truth at the measurement time step. If the displacement is greater than the miss rate threshold it is considered a miss. The number of misses for all objects divided by the total number of objects is equal to the miss rate.
Overlaps are detected as any intersection of the bounding boxes of the highest confidence predicted object trajectory with those of any other valid object at the same time step for time steps up to the measurement time step. Only objects that were valid at the prediction time step are considered. If one or more overlaps occur up to the measurement step it is considered a single overlap measurement. The total number of overlaps divided by the total number of objects is equal to the overall overlap rate.
The mAP metric is computed by accumulating true and false positive measurements based on thresholding the FDE at the measurement time step over all object predictions. The measurements are separated into buckets based on the trajectory shape. The mean average precision of each bucket is computed as described in "The PASCAL Visual Object Classes (VOC) Challenge" (Everingham, 2009, p. 11). using the newer method that includes all samples in the computation consistent with the current PASCAL challenge metrics. The mean of the AP value across all trajectory shape buckets is equal to this mAP value.
Same as mean_average_precision but duplicate true positives per ground truth trajectory are ignored rather than counted as false positives.
Custom metrics (those not already included above) can be stored in the following map, identified by name.
Configuration to compute motion metrics.
The sampling rates for the scenario track data and the prediction data. The track sampling must be an integer multiple of the prediction sampling.
The number of samples for both the history and the future track data. Tracks must be of length track_history_samples + track_future_samples + 1 (one extra for the current time step). Predictions must be length (track_history_samples + track_future_samples) * prediction_steps_per_second / track_steps_per_second (current time is not included in the predictions). IMPORTANT: Note that the first element of the prediction corresponds to time (1.0 / prediction_steps_per_second) NOT time 0.
Parameters for miss rate and mAP threshold scaling as a function of the object initial speed. If the object speed is below speed_lower_bound, the scale factor for the thresholds will equal speed_scale_lower. Above speed_upper_bound, the scale factor will equal speed_scale_upper. In between the two bounds, the scale factor will be interpolated linearly between the lower and upper scale factors. Both the lateral and longitudinal displacement thresholds for miss rate and mAP will be scaled by this factor before the thresholds are applied.
The prediction samples and parameters used to compute metrics at a specific time step. Time in seconds can be computed as (measurement_step + 1) / prediction_steps_per_second. Metrics are computed for each step in the list as if the given measurement_step were the last step in the predicted trajectory.
The maximum number of predictions to use as K in all min over K metrics computations.
Used in:
The prediction step to use to measure all metrics. The metrics are computed as if this were the last step in the predicted trajectory. Time in seconds can be computed as (measurement_step + 1) / prediction_steps_per_second.
The threshold for lateral distance error in meters for miss rate and mAP computations.
The threshold for longitudinal distance error in meters for miss rate and mAP computations.
Used in:
A set of predictions (or joint predictions) with varying confidences - all for the same object or group of objects. All prediction entries must contain trajectories for the same set of objects or an error will be returned. Any predictions past the max number of predictions set in the metrics config will be discarded.
Used in:
This is a wrapper on waymo.open_dataset.Label. We have another proto to add more information such as class confidence for metrics computation.
Used in:
The confidence within [0, 1] of the prediction. Defaults to 1.0 for ground truths.
Whether this object overlaps with any NLZ (no label zone). Users do not need to set this field when evaluating on the eval leaderboard as the leaderboard does this computation.
These must be set when evaluating on the leaderboard. This should be set to Context.name defined in dataset.proto::Context.
This should be set to Frame.timestamp_micros defined in dataset.proto::Frame.
Optionally, if this object is used for camera image labels or predictions, this needs to be populated to uniquely identify which image this object is for.
Used in:
Coordinates of the center of the object bounding box.
The dimensions of the bounding box in meters.
The yaw angle in radians of the forward direction of the bounding box (the vector from the center of the box to the middle of the front box segment) counter clockwise from the X-axis (right hand system about the Z axis). This angle is normalized to [-pi, pi).
The velocity vector in m/s. This vector direction may be slightly different from the heading of the bounding box.
False if the state data is invalid or missing.
Used in:
The ID of the object being predicted. This must match the object_id field in the test or validation set tf.Example or scenario proto corresponding to this prediction. Note this must be the same as the object_id in the scenario track or the state/id field in the tf.Example, not the track index.
The trajectory for the object.
Used in:
Users do not need to set this when evaluating on the leaderboard.
Occupancy and flow metrics averaged over all prediction waypoints. Please refer to occupancy_flow_metrics.py for an implementation of these metrics.
The metrics stored in this proto are averages over all waypoints. However, blank waypoints, which contain no occupancy or flow ground-truth, are excluded when computing the metrics. The following fields record the number of waypoints which are used for computing each of the 3 categories of metrics.
Treating occupancy in each grid cell as an independent binary prediction, this metric measures the area under the precision-recall curve of all grid cells in the future occupancy of currently-observed vehicles.
Measures the soft intersection-over-union between ground-truth bounding boxes and predicted future occupancy grids of currently-observed vehicles.
Same as above, but for currently-occluded vehicles. NOTE: All agents in future timesteps are divided into the two categories (currently-observed and currently-occluded) depending on whether the agent is present (valid) at the current timestep. Agents which are not valid at the current time, but become valid later are considered currently- occluded. The model is expected to predict the two categories separately, and the occupancy metrics are also computed separately for the two categories.
End-point-error between ground-truth and predicted flow fields, averaged over all cells in the grid. Flow end-point-error measures the Euclidean distance between the predicted and ground-truth flow vectors.
Configuration for all parameters defining the occupancy flow task.
The following default values reflect the size of sequences in the Waymo Open Motion Dataset.
Number of predicted waypoints (snapshots over time) for each scene. The waypoints uniformly divide the future timesteps (num_future_steps) into num_waypoints equal intervals.
When cumulative_waypoints is false, ground-truth waypoints are created by sampling individual timesteps from the future timesteps. For example, for num_futures_steps = 80 and num_waypoints = 8, ground-truth occupancy is taken from timesteps {10, 20, 30, ..., 80}, and ground-truth flow fields are constructed from the displacements between timesteps {0 -> 10, 10 -> 20, ..., 70 -> 80} where 0 is the current time and 1-80 are the future timesteps. When cumulative_waypoints is true, ground-truth waypoints are created by aggregating occupancy and flow over all the timesteps that fall inside each waypoint. For example, the last waypoint's occupancy is constructed by accumulating occupancy over timesteps [71, 72, ..., 80] and the last waypoint's flow field is constructed by averaging all 10 flow fields between timesteps [61 -> 71, 62 -> 72, ..., 70 -> 80]. The code provided in occupancy_flow_data.py implements the above logic to construct the ground truth.
Whether to rotate the scene such that the SDC is heading up in ground-truth grids.
Occupancy grids are organized [grid_height_cells, grid_width_cells, 1]. Flow fields are organized as [grid_height_cells, grid_width_cells, 2].
The ground-truth occupancy and flow for all future waypoints are rendered with reference to the location of the autonomous vehicle at the current time. The autonomous vehicle's current location is mapped to the following coordinates.
Prediction scale. With a value of 3.2, the 256x256 grid covers an 80mx80m area of the world.
Ground-truth occupancy grids are constructed by sampling the specified number of points along the length and width from the interior of agent boxes and scattering those points on the grid. Similarly, ground-truth flow fields are constructed from the (dx, dy) displacements of such points over time.
Non-self-intersecting 2d polygons. This polygon is not necessarily convex.
Used in:
,A globally unique ID.
Used in:
A list of predictions for the required objects in the scene. These must exactly match the objects in the tracks_to_predict field of the test scenario or tf.Example.
Range image is a 2d tensor. The first dim (row) represents pitch. The second dim represents yaw. There are two types of range images: 1. Raw range image: Raw range image with a non-empty 'range_image_pose_compressed' which tells the vehicle pose of each range image cell. 2. Virtual range image: Range image with an empty 'range_image_pose_compressed'. This range image is constructed by transforming all lidar points into a fixed vehicle frame (usually the vehicle frame of the middle scan). NOTE: 'range_image_pose_compressed' is only populated for the first range image return. The second return has the exact the same range image pose as the first one.
Used in:
Zlib compressed [H, W, 4] serialized version of MatrixFloat. To decompress: string val = ZlibDecompress(range_image_compressed); MatrixFloat range_image; range_image.ParseFromString(val); Inner dimensions are: * channel 0: range * channel 1: intensity * channel 2: elongation * channel 3: is in any no label zone.
Lidar point to camera image projections. A point can be projected to multiple camera images. We pick the first two at the following order: [FRONT, FRONT_LEFT, FRONT_RIGHT, SIDE_LEFT, SIDE_RIGHT]. Zlib compressed [H, W, 6] serialized version of MatrixInt32. To decompress: string val = ZlibDecompress(camera_projection_compressed); MatrixInt32 camera_projection; camera_projection.ParseFromString(val); Inner dimensions are: * channel 0: CameraName.Name of 1st projection. Set to UNKNOWN if no projection. * channel 1: x (axis along image width) * channel 2: y (axis along image height) * channel 3: CameraName.Name of 2nd projection. Set to UNKNOWN if no projection. * channel 4: x (axis along image width) * channel 5: y (axis along image height) Note: pixel 0 corresponds to the left edge of the first pixel in the image.
Zlib compressed [H, W, 6] serialized version of MatrixFloat. To decompress: string val = ZlibDecompress(range_image_pose_compressed); MatrixFloat range_image_pose; range_image_pose.ParseFromString(val); Inner dimensions are [roll, pitch, yaw, x, y, z] represents a transform from vehicle frame to global frame for every range image pixel. This is ONLY populated for the first return. The second return is assumed to have exactly the same range_image_pose_compressed. The roll, pitch and yaw are specified as 3-2-1 Euler angle rotations, meaning that rotating from the navigation to vehicle frame consists of a yaw, then pitch and finally roll rotation about the z, y and x axes respectively. All rotations use the right hand rule and are positive in the counter clockwise direction.
Zlib compressed [H, W, 5] serialized version of MatrixFloat. To decompress: string val = ZlibDecompress(range_image_flow_compressed); MatrixFloat range_image_flow; range_image_flow.ParseFromString(val); Inner dimensions are [vx, vy, vz, pointwise class]. If the point is not annotated with scene flow information, class is set to -1. A point is not annotated if it is in a no-label zone or if its label bounding box does not have a corresponding match in the previous frame, making it infeasible to estimate the motion of the point. Otherwise, (vx, vy, vz) are velocity along (x, y, z)-axis for this point and class is set to one of the following values: -1: no-flow-label, the point has no flow information. 0: unlabeled or "background,", i.e., the point is not contained in a bounding box. 1: vehicle, i.e., the point corresponds to a vehicle label box. 2: pedestrian, i.e., the point corresponds to a pedestrian label box. 3: sign, i.e., the point corresponds to a sign label box. 4: cyclist, i.e., the point corresponds to a cyclist label box.
Zlib compressed [H, W, 2] serialized version of MatrixInt32. To decompress: string val = ZlibDecompress(segmentation_label_compressed); MatrixInt32 segmentation_label. segmentation_label.ParseFromString(val); Inner dimensions are [instance_id, semantic_class]. NOTE: 1. Only TOP LiDAR has segmentation labels. 2. Not every frame has segmentation labels. This field is not set if a frame is not labeled. 3. There can be points missing segmentation labels within a labeled frame. Their label are set to TYPE_NOT_LABELED when that happens.
Deprecated, do not use.
An object that must be predicted for the scenario.
Used in:
An index into the Scenario `tracks` field for the object to be predicted.
The difficulty level for this object.
A difficulty level for predicting a given track.
Used in:
Used in:
The type of road edge.
The polyline defining the road edge. A polyline is a list of points with segments defined between consecutive points.
Type of this road edge.
Used in:
Physical road boundary that doesn't have traffic on the other side (e.g., a curb or the k-rail on the right side of a freeway).
Physical road boundary that separates the car from other traffic (e.g. a k-rail or an island).
Used in:
The type of the lane boundary.
The polyline defining the road edge. A polyline is a list of points with segments defined between consecutive points.
Type of this road line.
Used in:
,The unique ID for this scenario.
Timestamps corresponding to the track states for each step in the scenario. The length of this field is equal to tracks[i].states_size() for all tracks i and equal to the length of the dynamic_map_states_field.
The index into timestamps_seconds for the current time. All time steps after this index are future data to be predicted. All steps before this index are history data.
Tracks for all objects in the scenario. All object tracks in all scenarios in the dataset have the same number of object states. In this way, the tracks field forms a 2 dimensional grid with objects on one axis and time on the other. Each state can be associated with a timestamp in the 'timestamps_seconds' field by its index. E.g., tracks[i].states[j] indexes the i^th agent's state at time timestamps_seconds[j].
The dynamic map states in the scenario (e.g. traffic signal states). This field has the same length as timestamps_seconds. Each entry in this field can be associated with a timestamp in the 'timestamps_seconds' field by its index. E.g., dynamic_map_states[i] indexes the dynamic map state at time timestamps_seconds[i].
The set of static map features for the scenario.
The index into the tracks field of the autonomous vehicle object.
A list of objects IDs in the scene detected to have interactive behavior. The objects in this list form an interactive group. These IDs correspond to IDs in the tracks field above.
A list of tracks to generate predictions for. For the challenges, exactly these objects must be predicted in each scenario for test and validation submissions. This field is populated in the training set only as a suggestion of objects to train on.
Per time step Lidar data. This contains lidar up to the current time step such that compressed_frame_laser_data[i] corresponds to the states at timestamps_seconds[i] where i <= current_time_index. This field is not populated in all versions of the dataset.
Per time step camera tokens. This contains camera tokens up to the current time step such that frame_camera_tokens[i] corresponds to the states at timestamps_seconds[i] where i <= current_time_index. This field is not populated in all versions of the dataset.
A set of predictions used for metrics evaluation.
The unique ID of the scenario being predicted. This ID must match the scenario_id field in the test or validation set tf.Example or scenario proto corresponding to this set of predictions.
The predictions for the scenario. These represent either single object predictions or joint predictions for a group of objects.
Used in:
String ID of the original scenario proto used as initial conditions.
Collection of multiple `JointScene`s simulated from the same initial conditions (corresponding to the original Scenario proto). This needs to include exactly 32 parallel simulations.
A message containing a prediction for either a single object or a joint prediction for a set of objects.
Used in:
The trajectories for the objects in the scenario being predicted. For the interactive challenge, this must contain exactly 2 trajectories for the pair of objects listed in the tracks_to_predict field of the Scenario or tf.Example proto.
An optional confidence measure for this joint prediction. These confidence scores should reflect confidence in the existence of the trajectory across scenes, not normalized within a scene or per-agent.
Used in:
The object predicted trajectory.
An optional confidence measure for this joint prediction. These confidence scores should reflect confidence in the existence of the trajectory across scenes, not normalized within a scene or per-agent.
(message has no fields)
Used in:
Other small vehicles (e.g. pedicab) and large vehicles (e.g. construction vehicles, RV, limo, tram).
Lamp post, traffic sign pole etc.
Construction cone/pole.
Bushes, tree branches, tall grasses, flowers etc.
Curb on the edge of roads. This does not include road boundaries if there’s no curb.
Surface a vehicle could drive on. This include the driveway connecting parking lot and road over a section of sidewalk.
Marking on the road that’s specifically for defining lanes such as single/double white/yellow lines.
Marking on the road other than lane markers, bumps, cateyes, railtracks etc.
Most horizontal surface that’s not drivable, e.g. grassy hill, pedestrian walkway stairs etc.
Nicely paved walkable surface when pedestrians most likely to walk on.
Used in:
Segmentation labels by lasers.
These must be set when evaluating on the leaderboard. This should be set to Context.name defined in dataset.proto::Context.
This should be set to Frame.timestamp_micros defined in dataset.proto::Frame.
Used in:
Used in:
Stats for each class. The length of each field should be the num_class. The number of points with matching prediction and groundtruth for this class.
The total number of points for this class in both prediction and groundtruth.
Per class IOU (Intersection Over Union). Keyed by class index.
The list of segmentation_types to eval.
If your inference results are too large to fit in one proto, you can shard them to multiple files by sharding the inference_results field. Next ID: 11.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Link to paper or other link that describes the method.
Number of frames used.
Inference results.
Used in:
Aggregation (at the dataset-level or scenario-level) of the lower-level features into proper metrics.
If these metrics are at the scenario-level, specify the ID of the scenario they relate to. If not specified, represent the aggregation at the dataset level of the per-scenario metrics.
The meta-metric, i.e. the weighted aggregation of all the lower-level features. This score is used to rank the submissions for the Sim Agents challenge.
Average displacement error (average or minimum over simulations).
Dynamic features, i.e. speeds and accelerations.
Interactive features.
Map-based features: distance to road edge, offroad indication.
Fraction of simulated objects that collide for at least one step with any other simulated object.
Fraction of simulated objects that drive offroad for at least one step.
Fraction of simulated objects that violate a traffic light for at least one step.
Configuration for the Sim Agents metrics.
Dynamics features.
Interactive features.
Map-based features.
The Bernoulli estimator is used for boolean features, e.g. collision.
Used in:
Additive smoothing to apply to the underlying 2-bins histogram, to avoid infinite values for empty bins.
Each of the features used to evaluated sim-agents has one of the following configs.
Used in:
To estimate the likelihood of the logged features under the simulated distribution of features, an approximator of such distribution is needed. For continuous values we support histogram-based and kernel-density-based estimators.
Based on this flag, the distribution of simulated features will be aggregated over time to approximate one single (per-scenario, per-object) distribution instead of `N_STEP` per-step distributions. Example: When using `independent_timesteps=False` for speed, each logged step will be evaluated under the speed distribution of the 32 parallel simulations at that specific step. When `independent_timesteps=True`, each logged step will be evaluated against the same distribution over all the steps (32 * 80 total samples).
For each of the features, we extract a likelihood score in the range [0,1]. The meta-metric (i.e. how all the submission are finally scored and ranked) is just a weighted average of these scores.
Based on this flag, the distribution of simulated features will be aggregated over all the objects at every single time step. Example: 1. SIM_AGENTS challenge uses `aggregate_objects=False` for all histogram-based features. For example, for speed features, each logged step for each single agent will be evaluated under the speed distribution of the 32 parallel simulations at that specific step; 2. SCENARIO_GEN challenge uses `aggregate_objects=True` for all histogram-based features. Similar to speed, for each logged step all objects will be evaluated against the same distribution over all the 32 parallel simulated scenarios' objects (parallel_simulations * num_valid_objects);
Configuration for the histogram-based likelihood estimation.
Used in:
Extremes on which the histogram is defined. The default configuration provided for the challenge has these values carefully set based on ground truth data. Any user submission exceeding these thresholds will be clipped, resulting in lower score for the submission.
Number of bins for the histogram to be discretized into.
Additive smoothing to apply to the histogram, to avoid infinite values when 1+ bins are empty.
Used in:
Bandwidth for the Kernel Density estimation. For more details, check sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html This field needs to be set and needs to be strictly positive, otherwise an error is raised at runtime.
Bucketed version of the sim agent metrics. This aggregated message is used in the challenge leaderboard to provide an easy to read but still informative metric output format. All the bucketed metrics are rescaled to be in the range [0, 1], but still according to the meta-metric weights defined in the metrics config.
Realism meta-metric.
Kinematic metrics: a linear combination of the kinematic-related likelihoods, namely `linear_speed`, `linear_acceleration`, `angular_speed` and `angular_acceleration`.
Interactive metrics: a linear combination of the object-interaction likelihoods, namely `distance_to_nearest_object`, `collision_indication` and `time_to_collision`.
Map-based metrics: a linear combination of the map-related likelihoods, namely `distance_to_road_edge` and `offroad_indication`.
MinADE.
Fraction of simulated objects that collide for at least one step with any other simulated object.
Fraction of simulated objects that drive offroad for at least one step.
Fraction of simulated objects that violate a traffic light for at least one step.
Message packaging a full submission to the challenge.
The set of scenario rollouts to evaluate. One entry should exist for every record in the test set.
Identifier of the submission type. Has to be set for the submission to be valid.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Author information.
A brief description of the method.
Link to paper or other link that describes the method.
Set this to true if your model uses the lidar data provided in the motion dataset. This field is now REQUIRED for a valid submission.
Set this to true if your model uses the camera data provided in the motion dataset. This field is now REQUIRED for a valid submission.
Set this to true if your model used publicly available open-source LLM/VLM(s) for pre-training. This field is now REQUIRED for a valid submission.
If any open-source model was used, specify their names and configuration.
Specify an estimate of the number of parameters of the model used to generate this submission. The number must be specified as an integer number followed by a multiplier suffix (from the set [K, M, B, T, ...], e.g. "200K"). This field is now REQUIRED for a valid submission.
Several submissions for the 2023 challenge did not comply with the closed-loop at 10Hz requirement we specified both on the website https://waymo.com/open/challenges/2024/sim-agents/ and the NeurIPS paper https://arxiv.org/abs/2305.12032, Section 3 "Task constraints". Please make sure your method complies with these rules before submitting, to ensure our leaderboard is fair.
The challenge submission type.
Used in:
A submission for the Waymo open dataset Sim Agents challenge.
Used in:
The simulated trajectory for a single object, including position and heading. The (x, y, z) coordinates identify the centroid of the modeled object, defined in the same coordinate frame as the original input scenario. Heading is defined in radians, counterclockwise from East. See https://waymo.com/open/data/motion/ for more info. The length of these fields must be exactly 80, encoding the 8 seconds of future simulation at the same frequency of the Scenario proto (10Hz). These objects will only be considered if they are valid at the `current_time_index` step (which is hardcoded to 10, with 0-indexing). These objects will be assumed to be valid for the whole duration of the simulation after `current_time_index`, maintaining the latest box sizes (width, length and height) observed in the original scenario at the `current_time_index`.
Optional fields. These fields represent the dimensions (in time) of the bounding box of the object, with the same conventions as above. If these are not required by the challenge, we assume fixed box dimensions. Please refer to challenge specification to check if these fields are used.
Specifies an object field, when required by the challenge, otherwise ignored. Please refer to challenge specification to check if these fields are used.
ID of the object.
Optional field, representing the type of the object. If this is not required by the challenge, this field is ignored. Please refer to challenge specification to check if these fields are used.
Used in:
The ID of the object being predicted. This must match the object_id field in the test or validation set tf.Example or scenario proto corresponding to this prediction. Note this must be the same as the object_id in the scenario track or the state/id field in the tf.Example, not the track index.
A set of up to 6 trajectory predictions for this object with varying confidences. Any predictions past the first six will be discarded.
Used in:
The ID of the object being predicted. This must match the object_id field in the test or validation set tf.Example or scenario proto corresponding to this prediction.
The predicted trajectory positions.
Used in:
The polygon defining the outline of the speed bump. The polygon is assumed to be closed (i.e. a segment exists between the last point and the first point).
Used in:
The IDs of lane features controlled by this stop sign.
The position of the stop sign.
If your inference results are too large to fit in one proto, you can shard them to multiple files by sharding the inference_results field. Next ID: 17.
This specifies which task this submission is for.
This must be set as the full email used to register at waymo.com/open.
This name needs to be short, descriptive and unique. Only the latest result of the method from a user will show up on the leaderboard.
Link to paper or other link that describes the method.
Link to the latency submission Docker image stored in Google Storage bucket or pushed to Google Container/Artifact Registry. Google Storage bucket example: gs://example_bucket_name/example_folder/example_docker_image.tar.gz Google Container/Artifact Registry example: us-west1-docker.pkg.dev/example-registry-name/example-folder/example-image@sha256:example-sha256-hash Follow latency/README.md to create a docker file.
Number of frames used.
Inference results.
Object types this submission contains. By default, we assume all types.
Self-reported end to end inference latency in seconds. This is NOT shown on the leaderboard for now. But it is still recommended to set this. Do not confuse this with the `docker_image_source` field above. That is needed to evaluate your model latency on our server.
Used in:
These values correspond to the tasks on the waymo.com/open site.
Used in:
The object states for a single object through the scenario.
Used in:
The unique ID of the object being tracked. The IDs start from zero and are non-negative.
The type of object being tracked.
The object states through the track. States include the 3D bounding boxes and velocities.
Used in:
, ,This is an invalid state that indicates an error.
Used in:
The number of misses (false negatives).
The number of false positives.
The number of mismatches.
The sum of matching costs for all matched objects.
Total number of matched objects.
Total number of ground truth objects (i.e. labeled objects).
The score cutoff used to compute this measurement.
If set, will include the ids of the fp/tp/fn objects. Each element corresponds to one frame of matching.
Used in:
False positive prediction ids.
False negative ground truth ids.
True positive ground truth ids. Should be of the same length with tp_pr_ids, tp_ious. Each pair of ids of the same index correspond to the ids of ground truth object and prediction objects which are matched.
True positive prediction ids.
Used in:
The breakdown this measurements are computed for.
Multiple object tracking accuracy (sum of miss, mismatch and fp).
Multiple object tracking precision (matching_cost / num_matches).
Miss ratio (num_misses / num_objects_gt).
Mismatch ratio (num_mismatches / num_objects_gt).
False positive ratio (num_fps / num_objects_gt).
Total number of ground truth objects (i.e. labeled objects).
The breakdown this metrics are computed for.
Raw measurements.
Used in:
,The ID for the MapFeature corresponding to the lane controlled by this traffic signal state.
The state of the traffic signal.
The stopping point along the lane controlled by the traffic signal. This is the point where dynamic objects must stop when the signal is in a stop state.
Used in:
States for traffic signals with arrows.
Standard round traffic signals.
Flashing light signals.
Used in:
,The predicted trajectory positions. For the Waymo prediction challenges, these fields must be exactly length 16 - 8 seconds with 2 steps per second starting at timestamp 1.5 (step 15) in the scenario. IMPORTANT: For the challenges, the first entry in each of these fields must correspond to time step 15 in the scenario NOT step 10 or 11 (i.e. the entries in these fields must correspond to steps 15, 20, 25, ... 85, 90 in the scenario).
Used in:
Position in meters. Right-handed coordinate system. +x = forward, +y = left, +z = up. The ego-vehicle is located at (0, 0, 0) at t=0. The prediction length should be 5s at 4Hz, containing 20 waypoints. The first waypoint should be at t+0.25s and the last waypoint should be at t+5s. Only x,y coordinates are included. The z coordinate is not used.
4x4 row major transform matrix that tranforms 3d points from one frame to another.
Used in:
, , , ,Used in:
,Used in:
, ,Used in:
Velocity in m/s.
Angular velocity in rad/s.