Get desktop application:
View/edit binary Protocol Buffers messages
Ensembling spec. Details in: go/phx-ensemble-search
Used in:
When ensembling, all architectures previously trained, and now being tried in the ensemble are frozen if this variable is set to true. Obviously, new architectures (which were not trained in previous trials) will not be frozen.
This number is added to the train steps when there is no training ops For example, in an average ensemble.
This message is used for both ADAPTIVE_ENSEMBLE_SEARCH and RESIDUAL_ENSEMBLE_SEARCH as there spec is the same.
Used in:
Increase width every `increase_width_every` trials
When distillation and ensemble search are specified for the same run, `minimal_pool_size` determines which phase of the run should be for distillation and which should be for ensemble search. Ex. if the distillation pool size is 100 and the adaptive pool size is 200, then distillation will happen on trials [100, 200), an adaptive ensemble search will happen on trials [200, end).
The type of ensemble.
Used in:
Weighted ensemble: uses mixure weight on top of various learners logits.
Average ensemble: average all the logits of various learners.
The type of search for the ensemble.
Used in:
Adaptive: After T trials, we select the best architecture, A. During the next T trials (ie, from Trial T+1 to Trial 2T), we search for an architecture A' that works well with A. In the next set of trials (ie, from Trial 2T+1 to Trial 3), we search for an architecture A'' that works well with both A and A', and so on. In this option, the new architectures (A', A", ...) are trained directly on the labels (as opposed to RESIDUAL_ENSEMBLE_SEARCH).
NonAdaptive: First, we create a pool of good architectures. From this pool, we select the top performing candidates. Then, we form random groups of these canddiates to see which architectures complement each other.
Residual: This case is identical to Adaptive, except the learners are trained from the ensemble loss (and not directly from their own loss against the labels).
Intermixed: This case is similar to NonAdaptive ensembling. After a pool of candidates is formed, we combine candidates randomly into groups of size w, where w is specified by the user. However, unlike NonApative ensembling, this algorithm works cyclically. On every kth trial, it will try to form an ensemble. (Past ensembling trials are excluded from the pool of potential candidates.)
Used in:
How many towers to ensemble together? Should be bigger than 1.
Every 5 (default value) trials will try an architecture that is an ensembling of candidates.
When ensembling, will choose `width` towers from the best performing `num_trials_to_consider`. Must be greater than 1.
Used in:
How many towers to ensemble together? Should be bigger than 1.
Start ensembling starting trial id `minimal_pool_size`
When ensembling will chose `width` towers from the best performing `num_trials_to_consider` after finishing `minimal_pool_size` trials Must be greater than 1.