package determined.trial.v1

Get desktop application:
View/edit binary Protocol Buffers messages

CheckpointWorkload is an artifact created by a trial during training.

Used in: api.v1.WorkloadContainer, run.v1.Run, Trial

string uuid = 1
UUID of the checkpoint.
optional google.protobuf.Timestamp end_time = 3
The time the workload finished or was stopped.
checkpoint.v1.State state = 4
The state of the checkpoint.
map<string, int64> resources = 5
Dictionary of file paths to file sizes in bytes of all files in the checkpoint.
int32 total_batches = 8
Total number of batches as of this workload's completion.
optional google.protobuf.Struct metadata = 9
User defined metadata associated with the checkpoint.

Metrics report.

Used in: api.v1.GetMetricsResponse, api.v1.GetTrainingMetricsResponse, api.v1.GetTrialMetricsByCheckpointResponse, api.v1.GetTrialMetricsByModelVersionResponse, api.v1.GetValidationMetricsResponse

int32 trial_id = 1
ID of the trial.
optional google.protobuf.Timestamp end_time = 2
End time of when metric was reported.
optional google.protobuf.Struct metrics = 3
Struct of the reported metrics.
int32 total_batches = 4
batches completed in the report.
bool archived = 5
If metric is archived.
int32 id = 6
ID of metric in table.
int32 trial_run_id = 7
Run ID of trial when metric was reported.
string group = 8
Name of the Metric Group ("training", "validation", anything else)

MetricsWorkload is a workload generating metrics.

Used in: api.v1.WorkloadContainer, run.v1.Run, Trial

optional google.protobuf.Timestamp end_time = 2
The time the workload finished or was stopped.
optional common.v1.Metrics metrics = 40
Metrics.
int32 num_inputs = 5
Number of inputs processed.
int32 total_batches = 8
Total number of batches as of this workload's completion.

The rendezvous info for the trial to rendezvous with sibling containers.

Used in: api.v1.AllocationRendezvousInfoResponse

repeated string addresses = 1
The rendezvous addresses of the other containers.
int32 rank = 2
The container rank.
repeated int32 slots = 3
The slots for each address, respectively.

The current state of the trial. see \dT+ trial_state in db

Used in: api.v1.PatchTrialRequest, run.v1.FlatRun, run.v1.Run, Trial

STATE_UNSPECIFIED = 0
The trial is in an unspecified state.
STATE_ACTIVE = 1
The trial is in an active state.
STATE_PAUSED = 2
The trial is in a paused state
STATE_STOPPING_CANCELED = 3
The trial is canceled and is shutting down.
STATE_STOPPING_KILLED = 4
The trial is killed and is shutting down.
STATE_STOPPING_COMPLETED = 5
The trial is completed and is shutting down.
STATE_STOPPING_ERROR = 6
The trial is errored and is shutting down.
STATE_CANCELED = 7
The trial is canceled and is shut down.
STATE_COMPLETED = 8
The trial is completed and is shut down.
STATE_ERROR = 9
The trial is errored and is shut down.
STATE_QUEUED = 10
The trial is queued (waiting to be run, or job state is still queued). Queued is a substate of the Active state.
STATE_PULLING = 11
The trial is pulling the image. Pulling is a substate of the Active state.
STATE_STARTING = 12
The trial is preparing the environment after finishing pulling the image. Starting is a substate of the Active state.
STATE_RUNNING = 13
The trial's allocation is actively running. Running is a substate of the Active state.

Trial is a set of workloads and are exploring a determined set of hyperparameters.

Used in: api.v1.ComparableTrial, api.v1.CreateTrialResponse, api.v1.GetExperimentTrialsResponse, api.v1.GetTrialByExternalIDResponse, api.v1.GetTrialResponse, api.v1.PatchTrialResponse, api.v1.PutTrialResponse, api.v1.SearchExperimentExperiment

int32 id = 1
The id of the trial.
int32 experiment_id = 2
The id of the parent experiment.
optional google.protobuf.Timestamp start_time = 3
The time the trial was started.
optional google.protobuf.Timestamp end_time = 4
The time the trial ended if the trial is stopped.
State state = 5
The current state of the trial.
int32 restarts = 17
Number times the trial restarted.
optional google.protobuf.Struct hparams = 6
Trial hyperparameters.
int32 total_batches_processed = 7
The current processed batches.
optional MetricsWorkload best_validation = 8
Best validation.
optional MetricsWorkload latest_validation = 9
Latest validation.
optional CheckpointWorkload best_checkpoint = 10
Best checkpoint.
string runner_state = 11
The last reported state of the trial runner (harness code).
double wall_clock_time = 12
The wall clock time is all active time of the cluster for the trial, inclusive of everything (restarts, initiailization, etc), in seconds.
string warm_start_checkpoint_uuid = 13
UUID of checkpoint that this trial started from.
string task_id = 14
Id of the first task associated with this trial. This field is deprecated since trials can have multiple tasks.
uint64 total_checkpoint_size = 15
The sum of sizes of all resources in all checkpoints for the trial.
int32 checkpoint_count = 18
The count of checkpoints.
optional google.protobuf.Struct summary_metrics = 19
summary metrics
repeated string task_ids = 20
Task IDs of tasks associated with this trial. Length of task_ids will always be greater or equal to one when TaskID is sent. For example CompareTrial we will send a reduced Trial object, without TaskID or TaskIDs fileld in. The first element of task_ids will be the same as task_id. task_ids is sorted ascending by task_run_id.
double searcher_metric_value = 21
Signed searcher metrics value.
optional int32 log_retention_days = 22
Number of days to retain logs for.
optional google.protobuf.Struct metadata = 23
metadata associated with the trial (based off the metadata stored in the run).
optional string log_policy_matched = 24
Log Policy Matched.

Signals to the experiment the trial early exited.

Used in: api.v1.ReportTrialSearcherEarlyExitRequest

TrialEarlyExit.ExitedReason reason = 1
The reason for the exit.

The reason for an early exit.

Used in: TrialEarlyExit

EXITED_REASON_UNSPECIFIED = 0
Zero-value (not allowed).
EXITED_REASON_INVALID_HP = 1
Indicates the trial exited due to an invalid hyperparameter.
EXITED_REASON_INIT_INVALID_HP = 3
Indicates the trial exited due to an invalid hyperparameter in the trial init.

Metrics from the trial some duration of training.

Used in: api.v1.ReportTrialMetricsRequest, api.v1.ReportTrialTrainingMetricsRequest, api.v1.ReportTrialValidationMetricsRequest

int32 trial_id = 1
The trial associated with these metrics.
int32 trial_run_id = 2
The trial run associated with these metrics.
optional int32 steps_completed = 3
The number of batches trained on when these metrics were reported.
optional google.protobuf.Timestamp report_time = 4
The client-reported time associated with these metrics.
optional common.v1.Metrics metrics = 9
The metrics for this bit of training, including: - avg_metrics: metrics reduced over the reporting period). - batch_metrics: (optional) per-batch metrics.

TrialProfilerMetricLabels are the labels for a single series, where a series is a defined as all metrics sharing a distinct set of labels

Used in: api.v1.GetTrialProfilerAvailableSeriesResponse, api.v1.GetTrialProfilerMetricsRequest, TrialProfilerMetricsBatch

int32 trial_id = 1
The ID of the trial.
string name = 2
The name of the metric.
string agent_id = 3
The agent ID associated with the metric.
string gpu_uuid = 4
The GPU UUID associated with the metric.
TrialProfilerMetricLabels.ProfilerMetricType metric_type = 5
The type of the metric.

To distinguish the 2 different categories of metrics.

Used in: TrialProfilerMetricLabels

PROFILER_METRIC_TYPE_UNSPECIFIED = 0
Zero-value (not allowed).
PROFILER_METRIC_TYPE_SYSTEM = 1
For systems metrics, like GPU utilization or memory.
PROFILER_METRIC_TYPE_TIMING = 2
For timing metrics, like how long a backwards pass or getting a batch from the dataloader took.
PROFILER_METRIC_TYPE_MISC = 3
For other miscellaneous metrics.

TrialProfilerMetricsBatch is a batch of trial profiler metrics. A batch will contain metrics pertaining to a single series. The fields values, batches and timestamps will be equal length arrays with each index corresponding to a reading.

Used in: api.v1.GetTrialProfilerMetricsResponse, api.v1.PostTrialProfilerMetricsBatchRequest

repeated float values = 1
The measurement for a reading, repeated for the batch of metrics.
repeated int32 batches = 2
The batch at which a reading occurred, repeated for the batch of metrics.
repeated google.protobuf.Timestamp timestamps = 3
The timestamp at which a reading occurred, repeated for the batch of metrics.
optional TrialProfilerMetricLabels labels = 4
The labels for this series.

The metadata pertaining to the current running task for a trial.

Used in: api.v1.PostTrialRunnerMetadataRequest

string state = 1
The state of the trial runner.

Denotes a connection between a given trial and a checkpoint or model_version

Used in: api.v1.ReportTrialSourceInfoRequest

int32 trial_id = 1
ID of the trial.
string checkpoint_uuid = 2
UUID of the checkpoint.
optional int32 model_id = 3
Source `id`` for the model which generated the checkpoint (if applicable)
optional int32 model_version = 4
Source `version` in the model_version version field which generated the checkpoint (if applicable)
TrialSourceInfoType trial_source_info_type = 5
Type for this trial_source_info

TrialSourceInfoType is the type of the TrialSourceInfo, which serves as a link between a trial and a checkpoint or model version

Used in: api.v1.GetTrialMetricsByCheckpointRequest, api.v1.GetTrialMetricsByModelVersionRequest, TrialSourceInfo

TRIAL_SOURCE_INFO_TYPE_UNSPECIFIED = 0
The type is unspecified
TRIAL_SOURCE_INFO_TYPE_INFERENCE = 1
"Inference" Trial Source Info Type, used for batch inference
TRIAL_SOURCE_INFO_TYPE_FINE_TUNING = 2
"Fine Tuning" Trial Source Info Type, used in model hub

package determined.trial.v1

message CheckpointWorkload

string uuid = 1

optional google.protobuf.Timestamp end_time = 3

checkpoint.v1.State state = 4

map<string, int64> resources = 5

int32 total_batches = 8

optional google.protobuf.Struct metadata = 9

message MetricsReport

int32 trial_id = 1

optional google.protobuf.Timestamp end_time = 2

optional google.protobuf.Struct metrics = 3

int32 total_batches = 4

bool archived = 5

int32 id = 6

int32 trial_run_id = 7

string group = 8

message MetricsWorkload

optional google.protobuf.Timestamp end_time = 2

optional common.v1.Metrics metrics = 40

int32 num_inputs = 5

int32 total_batches = 8

message RendezvousInfo

repeated string addresses = 1

int32 rank = 2

repeated int32 slots = 3

enum State

STATE_UNSPECIFIED = 0

STATE_ACTIVE = 1

STATE_PAUSED = 2

STATE_STOPPING_CANCELED = 3

STATE_STOPPING_KILLED = 4

STATE_STOPPING_COMPLETED = 5

STATE_STOPPING_ERROR = 6

STATE_CANCELED = 7

STATE_COMPLETED = 8

STATE_ERROR = 9

STATE_QUEUED = 10

STATE_PULLING = 11

STATE_STARTING = 12

STATE_RUNNING = 13

message Trial

int32 id = 1

int32 experiment_id = 2

optional google.protobuf.Timestamp start_time = 3

optional google.protobuf.Timestamp end_time = 4

State state = 5

int32 restarts = 17

optional google.protobuf.Struct hparams = 6

int32 total_batches_processed = 7

optional MetricsWorkload best_validation = 8

optional MetricsWorkload latest_validation = 9

optional CheckpointWorkload best_checkpoint = 10

string runner_state = 11

double wall_clock_time = 12

string warm_start_checkpoint_uuid = 13

string task_id = 14

uint64 total_checkpoint_size = 15

int32 checkpoint_count = 18

optional google.protobuf.Struct summary_metrics = 19

repeated string task_ids = 20

double searcher_metric_value = 21

optional int32 log_retention_days = 22

optional google.protobuf.Struct metadata = 23

optional string log_policy_matched = 24

message TrialEarlyExit

TrialEarlyExit.ExitedReason reason = 1

enum TrialEarlyExit.ExitedReason

EXITED_REASON_UNSPECIFIED = 0

EXITED_REASON_INVALID_HP = 1

EXITED_REASON_INIT_INVALID_HP = 3

message TrialMetrics

int32 trial_id = 1

int32 trial_run_id = 2

optional int32 steps_completed = 3

optional google.protobuf.Timestamp report_time = 4

optional common.v1.Metrics metrics = 9

message TrialProfilerMetricLabels

int32 trial_id = 1

string name = 2