package yggdrasil_decision_forests.model.gradient_boosted_trees.proto

Get desktop application:
View/edit binary Protocol Buffers messages

optional Header header = 1

Training configuration for the Gradient Boosted Trees algorithm.

Next ID: 39

Used in: distributed_gradient_boosted_trees.proto.DistributedGradientBoostedTreesTrainingConfig

optional int32 num_trees = 1
Maximum number of iterations during training. In the case of single output models (e.g. binary classification, regression, ranking), this value is equal to the number of trees.
optional decision_tree.proto.DecisionTreeTrainingConfig decision_tree = 2
Decision tree specific parameters. The default maximum depth of the trees is: 6.
optional float shrinkage = 3
Shrinkage parameters. Default values: forest_extraction : MART => 0.1 forest_extraction : DART => 1.0
optional float subsample = 4
Fraction of examples used to train each tree. If =1, all the examples are used to train each tree. If <1, a random subset of examples is sampled for each tree. Deprecated. Use "stochastic_gradient_boosting" instead. Deprecated. Use "stochastic_gradient_boosting" instead. Note: This parameter is ignored if another sampling strategy ("sampling_methods") is set.
oneof sampling_implementation
How is the sampling ("subsample" or "sampling_methods") is implemented.
- GradientBoostedTreesTrainingConfig.SampleInMemory sample_in_memory = 32
  The entire dataset is loaded in memory, and the subsampling ("subsample" parameter) and extraction of the validation dataset ("validation_set_ratio" parameter) are done at the example-level. This method is fast, accurate but consume a lot of memory is the dataset is large (e.g. >10B values).
- GradientBoostedTreesTrainingConfig.SampleWithShards sample_with_shards = 31
  The entire dataset is NOT loaded in memory, and "subsample" and "validation_set_ratio" are applied at the shard level. This method consumes less memory (great if your dataset does not fit in memory) but requires more IO and CPU operations and the final model might (or not) require more trees to reach the same quality as the first method. For the sampling to be of high quality (and the model to train well), the number of shards should be large enough (>=1000 shards is good situation). In the logs: - "loader-blocking" indicates how much of the time the training is stopped to wait for IO. A good value is 0%. This value can be optimized by locating the dataset "near" the job, using appropriate amount of sharding (i.e. not too much; 100 shards per sample works well) and compression (is uncompressing the dataset is too expensive), increasing the sample size (i.e. making the training more complex) or recycling the samples (with "num_recycling"). - "preprocessing-load" indicates how much of the preparation time (IO + preprocessing) is spent preprocessing the data. High value are not an issue as long as "loader-blocking" is small. Constraints: - The code raise an error is the number of shards is <10. - No support (yet) for ranking or dart. In details, each tree is trained on a random subset of shards (controlled by "subsample"). Once the tree is trained, the random subset of shards is discarded and the training continue. A same shard can be used multiple times for different trees. The loading of the next shard and the training of the current tree are done in parallel. Ideally, both should run at the same speed. The amount of time without training and waiting for the shard loading and preparation is displayed in the logs as "loader-blocking").
optional Loss loss = 5
Loss minimized by the model. The value "DEFAULT" selects the likely most adapted loss according to the nature of the task and the statistics of the label.
optional float validation_set_ratio = 6
Ratio of the training dataset used to monitor the training. If >0, the validation set is used to select the actual number of trees (<=num_trees).
optional string validation_set_group_feature = 11
If set, define the name of the feature to use in the splitting of the validation dataset. In other words, if set, the validation dataset and the sub-training dataset cannot share examples with the same "group feature" value.
optional int32 validation_interval_in_trees = 7
Evaluate the model on the validation set every "validation_interval_in_trees" trees. Impact the early stopping policy.
optional int32 export_logs_during_training_in_trees = 33
If set and >0, export the training logs every "export_logs_during_training_in_trees" trees.
optional GradientBoostedTreesTrainingConfig.EarlyStopping early_stopping = 8
optional int32 early_stopping_num_trees_look_ahead = 9
optional int32 early_stopping_initial_iteration = 37
0-based index of the first iteration considered for early stopping computation. During the first iterations of a learner, the validation loss can be noisy, since the learner has yet to learn useful information. In particular, the validation loss during early iterations can be unusually small. This leads to early stopping while the model still has poor quality. This parameter specifies the index of the first iteration during which a validation loss is computed, i.e. the first iteration considered for early stopping
oneof loss_options
- LossConfiguration.LambdaMartNdcg lambda_mart_ndcg = 12
- LossConfiguration.XeNdcg xe_ndcg = 26
- LossConfiguration.BinaryFocalLossOptions binary_focal_loss_options = 36
optional float l2_regularization = 13
L2 regularization on the tree predictions i.e. on the value of the leaf. See the equation 2 of the XGBoost paper for the definition (https://arxiv.org/pdf/1603.02754.pdf). This term is not integrated in the reported loss of the model i.e. the loss of models trained with and without l2 regularization can be compared. Used for the following losses: BINOMIAL_LOG_LIKELIHOOD, SQUARED_ERROR, MULTINOMIAL_LOG_LIKELIHOOD, LAMBDA_MART_NDCG, or if use_hessian_gain is true. Note: In the case of RMSE loss for regression, the L2 regularization play the same role as the "shrinkage" parameter.
optional float l2_regularization_categorical = 22
L2 regularization for the categorical features with the hessian loss.
optional float l1_regularization = 19
L1 regularization on the tree predictions i.e. on the value of the leaf. Used for the following losses: LAMBDA_MART_NDCG, or if use_hessian_gain is true.
optional float clamp_leaf_logit = 30
Maximum absolute value of the leaf representing a logit (for binary and multi-class classification). This parameter has generally not impact on the quality of the model. This parameter prevents the apparition of large values, and then infinity and then NaN during training in the computation of logistic and soft-max. The value is selected such that log(clamp_leaf_logit) can be comfortably represented as a float.
optional float lambda_loss = 14
The "lambda" constant available in some loss formulations. Does not impact the optimal solution, but provides a smoothing of the loss that can be beneficial. Currently only used for the losses: - LAMBDA_MART_NDCG
oneof forest_extraction
How is the forest of tree built. Defaults to "mart".
- GradientBoostedTreesTrainingConfig.MART mart = 15
  MART (Multiple Additive Regression Trees): The "classical" way to build a GBDT i.e. each tree tries to "correct" the mistakes of the previous trees.
- GradientBoostedTreesTrainingConfig.DART dart = 16
  DART (Dropout Additive Regression Trees), a modification of MART proposed in http://proceedings.mlr.press/v38/korlakaivinayak15.pdf. Here, each tree tries to "correct" the mistakes of a random subset of the previous trees.
optional bool adapt_subsample_for_maximum_training_duration = 17
If true, the "subsample" parameter will be adapted dynamically such that the model trains in the "maximum_training_duration" time. "subsample" can only be reduced i.e. enabling this feature can only reduce the training time likely at the expense of quality.
optional float min_adapted_subsample = 18
Maximum impact of the "adapt_subsample_for_maximum_training_duration" parameter.
optional bool use_hessian_gain = 20
Use true, uses a formulation of split gain with the hessian i.e. optimize the splits to minimize the variance of "gradient / hessian". Hessian gain is available for the losses: BINOMIAL_LOG_LIKELIHOOD, SQUARED_ERROR, MULTINOMIAL_LOG_LIKELIHOOD, LAMBDA_MART_NDCG.
optional float min_sum_hessian_in_leaf = 21
Minimum value of the sum of the hessians in the leafs. Splits that would violate this constraint are ignored. Only used when "use_hessian_gain" is true.
optional bool use_goss = 23
Deprecated: Use GradientOneSideSampling in the "sampling_methods" below.
optional float goss_alpha = 24
optional float goss_beta = 25
oneof sampling_methods
- GradientBoostedTreesTrainingConfig.SelectiveGradientBoosting selective_gradient_boosting = 27
- GradientBoostedTreesTrainingConfig.GradientOneSideSampling gradient_one_side_sampling = 28
  aka GOSS
- GradientBoostedTreesTrainingConfig.StochasticGradientBoosting stochastic_gradient_boosting = 29
  StochasticGradientBoosting is the default and classical approach.
optional bool apply_link_function = 34
If true, applies the link function (a.k.a. activation function), if any, before returning the model prediction. If false, returns the pre-link function model output. For example, in the case of binary classification, the pre-link function output is a logic while the post-link function is a probability.
optional bool compute_permutation_variable_importance = 35
If true, compute the permutation variable importance of the model at the end of the training using the validation dataset. Enabling this feature can increase the training time significantly.
optional int64 total_max_num_nodes = 40
Limit the total number of nodes in the model over all trees. This limit is an upper bound that may not be reached exactly. If the value is smaller than the number of nodes of a single tree according to other hyperparameter, the learner may return an empty model. This hyperparameter is useful for hyperparameter tuning models with very few nodes. For training individual models, prefer adapting max_num_nodes / max_depth and num_trees.
optional GradientBoostedTreesTrainingConfig.Internal internal = 39
Internal knobs of the algorithm that don't impact the final model.

Used in: GradientBoostedTreesTrainingConfig

optional float dropout_rate = 1
Rate of tree that are randomly masked. "dropout_rate=1" indicates that all trees will be masked i.e. the algorithm is somehow equivalent to Random Forest. "dropout_rate=0" means that one tree will be masked i.e. the algorithm is almost equivalent to MART.

Decision Trees are trained sequentially. Training too many trees leads to training dataset overfitting. The "early stopping" policy controls the detection of training overfitting and halts the training (before "num_trees" trees have be trained). The overfitting is estimated using the validation dataset. Therefore, "validation_set_ratio" should be >0 if the early stopping is enabled. The early stopping policy runs every "validation_interval_in_trees" trees: The number of trees of the final model will be a multiple of "validation_interval_in_trees".

Used in: GradientBoostedTreesTrainingConfig

NONE = 0
No early stopping. Train all the "num_trees" trees.
MIN_VALIDATION_LOSS_ON_FULL_MODEL = 1
Trains all the "num_trees", and then selects the subset {1,.., k} of trees that minimize the validation loss.
VALIDATION_LOSS_INCREASE = 2
Stops the training training when the validation loss stops decreasing. More precisely, stops the training when the set of trees with the best validation loss has less than "early_stopping_num_trees_look_ahead" trees than the current model. "VALIDATION_LOSS_INCREASE" is more efficient than "MIN_VALIDATION_LOSS_ON_FULL_MODEL" but can lead to worse models.

"Gradient-based One-Side Sampling" (GOSS) is a sampling algorithm proposed in the following paper: "LightGBM: A Highly Efficient Gradient Boosting Decision Tree.' The paper claims that GOSS speeds up training without hurting quality by way of a clever sub-sampling methodology. Briefly, at the start of every iteration, the algorithm selects a subset of examples for training. It does so by sorting examples in decreasing order of absolute gradients, placing the top \alpha percent into the subset, and finally sampling \beta percent of the remaining examples.

Used in: GradientBoostedTreesTrainingConfig

optional float alpha = 1
Fraction of examples with the largest absolute gradient to keep in the sampled training set. Its value is expected to be in [0, 1]. As an example, setting alpha to .2 means that 20% of the examples with the largest absolute gradient will be placed into the sampled set.
optional float beta = 2
Sampling ratio in [0, 1] used to select remaining examples that did not make the cut for goss_alpha above. For example, if goss_alpha is 0.2 and goss_beta is 0.1, then first 20% of examples with the largest gradient will be placed into the set. Then, of the remaining examples, 10% are selected randomly and placed into the set.

Used in: GradientBoostedTreesTrainingConfig

optional bool enable_ndcg_indicator_labels_optimization = 1
If true, the optimization for binary labels with a single 1 per query for NDCG gradient computation used. This will not impact the model but increase training time. Exposed for testing only.

Used in: GradientBoostedTreesTrainingConfig

(message has no fields)

Used in: GradientBoostedTreesTrainingConfig

(message has no fields)

Used in: GradientBoostedTreesTrainingConfig

optional int32 num_recycling = 1
Number of times a sample is re-used before being discarded. Increasing this value will speed-up the training speed if IO is the bottle-neck (

Selective Gradient Boosting (SelGB) is a method proposed in the SIGIR 2018 paper entitled "Selective Gradient Boosting for Effective Learning to Rank" by Lucchese et al. The algorithm always selects all positive examples, but selects only those negative training examples that are more difficult (i.e., those with larger scores). Note: Selective Gradient Boosting is only available for ranking tasks. This method is disabled for all other tasks.

Used in: GradientBoostedTreesTrainingConfig

optional float ratio = 1
The ratio of negative examples to keep. Negative examples are sorted by their score and the top examples are added to the selected set.

Stochastic Gradient Boosting samples examples uniformly randomly.

Used in: GradientBoostedTreesTrainingConfig

optional float ratio = 1
Relative size of the dataset sampled for each tree. A value of 1 indicates that the sample has the same size as the original dataset.

Header for the gradient boosted trees model.

Next ID: 10

Used in: GradientBoostedTreesSerializedModel

optional int32 num_node_shards = 1
Number of shards used to store the nodes.
optional int64 num_trees = 2
Number of trees.
optional Loss loss = 3
Loss used to train the model.
repeated float initial_predictions = 4
Initial predictions of the model (before any tree is applied). The semantic of the prediction depends on the "loss".
optional int32 num_trees_per_iter = 5
Number of trees extracted at each gradient boosting operation.
optional float validation_loss = 6
Loss evaluated on the validation dataset. Only available is a validation dataset was provided during training.
optional string node_format = 7
Container used to store the trees' nodes.
optional TrainingLogs training_logs = 8
Evaluation metrics and other meta-data computed during training.
optional bool output_logits = 9
If true, call to predict methods return logits (e.g. instead of probability in the case of classification).
optional LossConfiguration loss_configuration = 10
Configuration options for losses.

Used in: GradientBoostedTreesTrainingConfig, Header

DEFAULT = 0
Selects the most adapted loss according to the nature of the task and the statistics of the label. - Binary classification -> BINOMIAL_LOG_LIKELIHOOD.
BINOMIAL_LOG_LIKELIHOOD = 1
Binomial log likelihood. Only valid for binary classification.
SQUARED_ERROR = 2
Least square loss. Only valid for regression.
MULTINOMIAL_LOG_LIKELIHOOD = 3
Multinomial log likelihood i.e. cross-entropy.
LAMBDA_MART_NDCG5 = 4
DEPRECATED: Use LAMBDA_MART_NDCG. LambdaMART with NDCG5
XE_NDCG_MART = 5
XE_NDCG_MART [arxiv.org/abs/1911.09798]
BINARY_FOCAL_LOSS = 6
EXPERIMENTAl. Focal loss. Only valid for binary classification. [https://arxiv.org/pdf/1708.02002.pdf]
POISSON = 7
Poisson log likelihood. Only valid for regression.
MEAN_AVERAGE_ERROR = 8
Mean average error (MAE).
LAMBDA_MART_NDCG = 9
LambdaMART with NDCG loss. Truncation defaults to 5, configurable.

Used in: Header

oneof loss_options
- LossConfiguration.LambdaMartNdcg lambda_mart_ndcg = 1
- LossConfiguration.XeNdcg xe_ndcg = 2
- LossConfiguration.BinaryFocalLossOptions binary_focal_loss_options = 3

Used in: GradientBoostedTreesTrainingConfig, LossConfiguration

optional float misprediction_exponent = 1
Exponent of the misprediction multiplier in focal loss. Corresponds to the gamma parameter in https://arxiv.org/pdf/1708.02002.pdf
optional float positive_sample_coefficient = 2
A hypertuning coefficient to multiply the loss and its gradient(s) in case of a positive sample. Loss and gradient on positive samples will be multiplied by positive_sample_coefficient, on negative samples will be multiplied by (1 - positive_sample_coefficient) Corresponds to the 'alpha' parameter in https://arxiv.org/pdf/1708.02002.pdf

Used in: GradientBoostedTreesTrainingConfig, LossConfiguration

optional bool gradient_use_non_normalized_dcg = 1
If false, the gradient is computed using NDCG i.e. normalized-DCG. If false, the gradient is computed using DCG.
optional int32 ndcg_truncation = 2
Number of candidates considered when computing the NDCG loss. NDCG losses are usually truncated at a particular rank level (generally between 4 and 10), i.e. only the highly ranked documents are considered when computing the rank. A smaller values results in a model with increased emphasis on the first results of the ranking. Note that the NDCG truncation of the cross-entropy NDCG loss must be configured separately.

Used in: GradientBoostedTreesTrainingConfig, LossConfiguration

optional XeNdcg.Gamma gamma = 1
optional int32 ndcg_truncation = 2
Number of candidates considered when computing the NDCG loss. NDCG losses are usually truncated at a particular rank level (generally between 4 and 10), i.e. only the highly ranked documents are considered when computing the rank. A smaller values results in a model with increased emphasis on the first results of the ranking. Note that the NDCG truncation of the classic NDCG loss must be configured separately.

Used in: XeNdcg

AUTO = 0
For the time being, defaults to UNIFORM.
UNIFORM = 1
Gammas are sampled from a uniform distribution on [0, 1].
ONE = 2
Gammas are set to 1 across the board. This is more appropriate for click datasets with a large number of documents per query.

Log of the training. This proto is generated during the training of the model and optionally exported (as a plot) in the training logs directory.

Used in: Header

repeated TrainingLogs.Entry entries = 1
Measurements the model size and performances during the training.
repeated string secondary_metric_names = 2
Names of the metrics stored in "secondary_metrics" field. The secondary metrics depends on the task (e.g. classification) and is accessible with "SecondaryMetricNames()". The i-th metric name of "secondary_metric_names" correspond to the i-th metric value in "training_secondary_metrics" and "validation_secondary_metrics".
optional int32 number_of_trees_in_final_model = 3
Number of trees in the final model. Without early stopping, "number_of_trees_in_final_model" is equal to the "number_of_trees" of the last "entries".

Used in: TrainingLogs

optional int32 number_of_trees = 1
Number of trees. In the case of multi-dimensional gradients, "number_of_trees" is the number of training step.
optional float training_loss = 2
Performance of the model on the training dataset.
repeated float training_secondary_metrics = 3
optional float validation_loss = 4
Performance of the model on the validation dataset.
repeated float validation_secondary_metrics = 5
optional double mean_abs_prediction = 6
Average of the absolute value of the new tree predictions estimated on the training dataset. See Dart paper (http://proceedings.mlr.press/v38/korlakaivinayak15.pdf) for details on how to interpret it.
optional float subsample_factor = 7
Sub-sampling factor applied during training on top of the "sampling" hyper-parameter. Currently, the "subsample_factor" is only controlled by the "adapt_subsample_for_maximum_training_duration" field i.e. the "subsample_factor" factor (default to 1) is reduced so the training finishes in "maximum_training_duration".
optional utils.proto.IntegersConfusionMatrixDouble validation_confusion_matrix = 8
Confusion between the label and the predictions.

package yggdrasil_decision_forests.model.gradient_boosted_trees.proto

message GradientBoostedTreesSerializedModel

optional Header header = 1

message GradientBoostedTreesTrainingConfig

optional int32 num_trees = 1

optional decision_tree.proto.DecisionTreeTrainingConfig decision_tree = 2

optional float shrinkage = 3

optional float subsample = 4

oneof sampling_implementation

GradientBoostedTreesTrainingConfig.SampleInMemory sample_in_memory = 32

GradientBoostedTreesTrainingConfig.SampleWithShards sample_with_shards = 31

optional Loss loss = 5

optional float validation_set_ratio = 6

optional string validation_set_group_feature = 11

optional int32 validation_interval_in_trees = 7

optional int32 export_logs_during_training_in_trees = 33

optional GradientBoostedTreesTrainingConfig.EarlyStopping early_stopping = 8

optional int32 early_stopping_num_trees_look_ahead = 9

optional int32 early_stopping_initial_iteration = 37

oneof loss_options

LossConfiguration.LambdaMartNdcg lambda_mart_ndcg = 12

LossConfiguration.XeNdcg xe_ndcg = 26

LossConfiguration.BinaryFocalLossOptions binary_focal_loss_options = 36

optional float l2_regularization = 13

optional float l2_regularization_categorical = 22

optional float l1_regularization = 19

optional float clamp_leaf_logit = 30

optional float lambda_loss = 14

oneof forest_extraction

GradientBoostedTreesTrainingConfig.MART mart = 15

GradientBoostedTreesTrainingConfig.DART dart = 16

optional bool adapt_subsample_for_maximum_training_duration = 17

optional float min_adapted_subsample = 18

optional bool use_hessian_gain = 20

optional float min_sum_hessian_in_leaf = 21

optional bool use_goss = 23

optional float goss_alpha = 24

optional float goss_beta = 25

oneof sampling_methods

GradientBoostedTreesTrainingConfig.SelectiveGradientBoosting selective_gradient_boosting = 27

GradientBoostedTreesTrainingConfig.GradientOneSideSampling gradient_one_side_sampling = 28

GradientBoostedTreesTrainingConfig.StochasticGradientBoosting stochastic_gradient_boosting = 29

optional bool apply_link_function = 34

optional bool compute_permutation_variable_importance = 35

optional int64 total_max_num_nodes = 40

optional GradientBoostedTreesTrainingConfig.Internal internal = 39

message GradientBoostedTreesTrainingConfig.DART

optional float dropout_rate = 1

enum GradientBoostedTreesTrainingConfig.EarlyStopping

NONE = 0

MIN_VALIDATION_LOSS_ON_FULL_MODEL = 1

VALIDATION_LOSS_INCREASE = 2

message GradientBoostedTreesTrainingConfig.GradientOneSideSampling

optional float alpha = 1

optional float beta = 2

message GradientBoostedTreesTrainingConfig.Internal

optional bool enable_ndcg_indicator_labels_optimization = 1

message GradientBoostedTreesTrainingConfig.MART

message GradientBoostedTreesTrainingConfig.SampleInMemory

message GradientBoostedTreesTrainingConfig.SampleWithShards

optional int32 num_recycling = 1

message GradientBoostedTreesTrainingConfig.SelectiveGradientBoosting

optional float ratio = 1

message GradientBoostedTreesTrainingConfig.StochasticGradientBoosting

optional float ratio = 1

message Header

optional int32 num_node_shards = 1

optional int64 num_trees = 2

optional Loss loss = 3

repeated float initial_predictions = 4

optional int32 num_trees_per_iter = 5

optional float validation_loss = 6

optional string node_format = 7

optional TrainingLogs training_logs = 8

optional bool output_logits = 9

optional LossConfiguration loss_configuration = 10

enum Loss

DEFAULT = 0

BINOMIAL_LOG_LIKELIHOOD = 1

SQUARED_ERROR = 2