Get desktop application:
View/edit binary Protocol Buffers messages
Training configuration for the Gradient Boosted Trees algorithm.
Next ID: 39
Used in:
Maximum number of iterations during training. In the case of single output models (e.g. binary classification, regression, ranking), this value is equal to the number of trees.
Decision tree specific parameters. The default maximum depth of the trees is: 6.
Shrinkage parameters. Default values: forest_extraction : MART => 0.1 forest_extraction : DART => 1.0
Fraction of examples used to train each tree. If =1, all the examples are used to train each tree. If <1, a random subset of examples is sampled for each tree. Deprecated. Use "stochastic_gradient_boosting" instead. Deprecated. Use "stochastic_gradient_boosting" instead. Note: This parameter is ignored if another sampling strategy ("sampling_methods") is set.
How is the sampling ("subsample" or "sampling_methods") is implemented.
The entire dataset is loaded in memory, and the subsampling ("subsample" parameter) and extraction of the validation dataset ("validation_set_ratio" parameter) are done at the example-level. This method is fast, accurate but consume a lot of memory is the dataset is large (e.g. >10B values).
The entire dataset is NOT loaded in memory, and "subsample" and "validation_set_ratio" are applied at the shard level. This method consumes less memory (great if your dataset does not fit in memory) but requires more IO and CPU operations and the final model might (or not) require more trees to reach the same quality as the first method. For the sampling to be of high quality (and the model to train well), the number of shards should be large enough (>=1000 shards is good situation). In the logs: - "loader-blocking" indicates how much of the time the training is stopped to wait for IO. A good value is 0%. This value can be optimized by locating the dataset "near" the job, using appropriate amount of sharding (i.e. not too much; 100 shards per sample works well) and compression (is uncompressing the dataset is too expensive), increasing the sample size (i.e. making the training more complex) or recycling the samples (with "num_recycling"). - "preprocessing-load" indicates how much of the preparation time (IO + preprocessing) is spent preprocessing the data. High value are not an issue as long as "loader-blocking" is small. Constraints: - The code raise an error is the number of shards is <10. - No support (yet) for ranking or dart. In details, each tree is trained on a random subset of shards (controlled by "subsample"). Once the tree is trained, the random subset of shards is discarded and the training continue. A same shard can be used multiple times for different trees. The loading of the next shard and the training of the current tree are done in parallel. Ideally, both should run at the same speed. The amount of time without training and waiting for the shard loading and preparation is displayed in the logs as "loader-blocking").
Loss minimized by the model. The value "DEFAULT" selects the likely most adapted loss according to the nature of the task and the statistics of the label.
Ratio of the training dataset used to monitor the training. If >0, the validation set is used to select the actual number of trees (<=num_trees).
If set, define the name of the feature to use in the splitting of the validation dataset. In other words, if set, the validation dataset and the sub-training dataset cannot share examples with the same "group feature" value.
Evaluate the model on the validation set every "validation_interval_in_trees" trees. Impact the early stopping policy.
If set and >0, export the training logs every "export_logs_during_training_in_trees" trees.
0-based index of the first iteration considered for early stopping computation. During the first iterations of a learner, the validation loss can be noisy, since the learner has yet to learn useful information. In particular, the validation loss during early iterations can be unusually small. This leads to early stopping while the model still has poor quality. This parameter specifies the index of the first iteration during which a validation loss is computed, i.e. the first iteration considered for early stopping
L2 regularization on the tree predictions i.e. on the value of the leaf. See the equation 2 of the XGBoost paper for the definition (https://arxiv.org/pdf/1603.02754.pdf). This term is not integrated in the reported loss of the model i.e. the loss of models trained with and without l2 regularization can be compared. Used for the following losses: BINOMIAL_LOG_LIKELIHOOD, SQUARED_ERROR, MULTINOMIAL_LOG_LIKELIHOOD, LAMBDA_MART_NDCG, or if use_hessian_gain is true. Note: In the case of RMSE loss for regression, the L2 regularization play the same role as the "shrinkage" parameter.
L2 regularization for the categorical features with the hessian loss.
L1 regularization on the tree predictions i.e. on the value of the leaf. Used for the following losses: LAMBDA_MART_NDCG, or if use_hessian_gain is true.
Maximum absolute value of the leaf representing a logit (for binary and multi-class classification). This parameter has generally not impact on the quality of the model. This parameter prevents the apparition of large values, and then infinity and then NaN during training in the computation of logistic and soft-max. The value is selected such that log(clamp_leaf_logit) can be comfortably represented as a float.
The "lambda" constant available in some loss formulations. Does not impact the optimal solution, but provides a smoothing of the loss that can be beneficial. Currently only used for the losses: - LAMBDA_MART_NDCG
How is the forest of tree built. Defaults to "mart".
MART (Multiple Additive Regression Trees): The "classical" way to build a GBDT i.e. each tree tries to "correct" the mistakes of the previous trees.
DART (Dropout Additive Regression Trees), a modification of MART proposed in http://proceedings.mlr.press/v38/korlakaivinayak15.pdf. Here, each tree tries to "correct" the mistakes of a random subset of the previous trees.
If true, the "subsample" parameter will be adapted dynamically such that the model trains in the "maximum_training_duration" time. "subsample" can only be reduced i.e. enabling this feature can only reduce the training time likely at the expense of quality.
Maximum impact of the "adapt_subsample_for_maximum_training_duration" parameter.
Use true, uses a formulation of split gain with the hessian i.e. optimize the splits to minimize the variance of "gradient / hessian". Hessian gain is available for the losses: BINOMIAL_LOG_LIKELIHOOD, SQUARED_ERROR, MULTINOMIAL_LOG_LIKELIHOOD, LAMBDA_MART_NDCG.
Minimum value of the sum of the hessians in the leafs. Splits that would violate this constraint are ignored. Only used when "use_hessian_gain" is true.
Deprecated: Use GradientOneSideSampling in the "sampling_methods" below.
aka GOSS
StochasticGradientBoosting is the default and classical approach.
If true, applies the link function (a.k.a. activation function), if any, before returning the model prediction. If false, returns the pre-link function model output. For example, in the case of binary classification, the pre-link function output is a logic while the post-link function is a probability.
If true, compute the permutation variable importance of the model at the end of the training using the validation dataset. Enabling this feature can increase the training time significantly.
Limit the total number of nodes in the model over all trees. This limit is an upper bound that may not be reached exactly. If the value is smaller than the number of nodes of a single tree according to other hyperparameter, the learner may return an empty model. This hyperparameter is useful for hyperparameter tuning models with very few nodes. For training individual models, prefer adapting max_num_nodes / max_depth and num_trees.
Internal knobs of the algorithm that don't impact the final model.
Used in:
Rate of tree that are randomly masked. "dropout_rate=1" indicates that all trees will be masked i.e. the algorithm is somehow equivalent to Random Forest. "dropout_rate=0" means that one tree will be masked i.e. the algorithm is almost equivalent to MART.
Decision Trees are trained sequentially. Training too many trees leads to training dataset overfitting. The "early stopping" policy controls the detection of training overfitting and halts the training (before "num_trees" trees have be trained). The overfitting is estimated using the validation dataset. Therefore, "validation_set_ratio" should be >0 if the early stopping is enabled. The early stopping policy runs every "validation_interval_in_trees" trees: The number of trees of the final model will be a multiple of "validation_interval_in_trees".
Used in:
No early stopping. Train all the "num_trees" trees.
Trains all the "num_trees", and then selects the subset {1,.., k} of trees that minimize the validation loss.
Stops the training training when the validation loss stops decreasing. More precisely, stops the training when the set of trees with the best validation loss has less than "early_stopping_num_trees_look_ahead" trees than the current model. "VALIDATION_LOSS_INCREASE" is more efficient than "MIN_VALIDATION_LOSS_ON_FULL_MODEL" but can lead to worse models.
"Gradient-based One-Side Sampling" (GOSS) is a sampling algorithm proposed in the following paper: "LightGBM: A Highly Efficient Gradient Boosting Decision Tree.' The paper claims that GOSS speeds up training without hurting quality by way of a clever sub-sampling methodology. Briefly, at the start of every iteration, the algorithm selects a subset of examples for training. It does so by sorting examples in decreasing order of absolute gradients, placing the top \alpha percent into the subset, and finally sampling \beta percent of the remaining examples.
Used in:
Fraction of examples with the largest absolute gradient to keep in the sampled training set. Its value is expected to be in [0, 1]. As an example, setting alpha to .2 means that 20% of the examples with the largest absolute gradient will be placed into the sampled set.
Sampling ratio in [0, 1] used to select remaining examples that did not make the cut for goss_alpha above. For example, if goss_alpha is 0.2 and goss_beta is 0.1, then first 20% of examples with the largest gradient will be placed into the set. Then, of the remaining examples, 10% are selected randomly and placed into the set.
Used in:
If true, the optimization for binary labels with a single 1 per query for NDCG gradient computation used. This will not impact the model but increase training time. Exposed for testing only.
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Number of times a sample is re-used before being discarded. Increasing this value will speed-up the training speed if IO is the bottle-neck (
Selective Gradient Boosting (SelGB) is a method proposed in the SIGIR 2018 paper entitled "Selective Gradient Boosting for Effective Learning to Rank" by Lucchese et al. The algorithm always selects all positive examples, but selects only those negative training examples that are more difficult (i.e., those with larger scores). Note: Selective Gradient Boosting is only available for ranking tasks. This method is disabled for all other tasks.
Used in:
The ratio of negative examples to keep. Negative examples are sorted by their score and the top examples are added to the selected set.
Stochastic Gradient Boosting samples examples uniformly randomly.
Used in:
Relative size of the dataset sampled for each tree. A value of 1 indicates that the sample has the same size as the original dataset.
Header for the gradient boosted trees model.
Next ID: 10
Used in:
Number of shards used to store the nodes.
Number of trees.
Loss used to train the model.
Initial predictions of the model (before any tree is applied). The semantic of the prediction depends on the "loss".
Number of trees extracted at each gradient boosting operation.
Loss evaluated on the validation dataset. Only available is a validation dataset was provided during training.
Container used to store the trees' nodes.
Evaluation metrics and other meta-data computed during training.
If true, call to predict methods return logits (e.g. instead of probability in the case of classification).
Configuration options for losses.
Used in:
,Selects the most adapted loss according to the nature of the task and the statistics of the label. - Binary classification -> BINOMIAL_LOG_LIKELIHOOD.
Binomial log likelihood. Only valid for binary classification.
Least square loss. Only valid for regression.
Multinomial log likelihood i.e. cross-entropy.
DEPRECATED: Use LAMBDA_MART_NDCG. LambdaMART with NDCG5
XE_NDCG_MART [arxiv.org/abs/1911.09798]
EXPERIMENTAl. Focal loss. Only valid for binary classification. [https://arxiv.org/pdf/1708.02002.pdf]
Poisson log likelihood. Only valid for regression.
Mean average error (MAE).
LambdaMART with NDCG loss. Truncation defaults to 5, configurable.
Used in:
Used in:
,Exponent of the misprediction multiplier in focal loss. Corresponds to the gamma parameter in https://arxiv.org/pdf/1708.02002.pdf
A hypertuning coefficient to multiply the loss and its gradient(s) in case of a positive sample. Loss and gradient on positive samples will be multiplied by positive_sample_coefficient, on negative samples will be multiplied by (1 - positive_sample_coefficient) Corresponds to the 'alpha' parameter in https://arxiv.org/pdf/1708.02002.pdf
Used in:
,If false, the gradient is computed using NDCG i.e. normalized-DCG. If false, the gradient is computed using DCG.
Number of candidates considered when computing the NDCG loss. NDCG losses are usually truncated at a particular rank level (generally between 4 and 10), i.e. only the highly ranked documents are considered when computing the rank. A smaller values results in a model with increased emphasis on the first results of the ranking. Note that the NDCG truncation of the cross-entropy NDCG loss must be configured separately.
Used in:
,Number of candidates considered when computing the NDCG loss. NDCG losses are usually truncated at a particular rank level (generally between 4 and 10), i.e. only the highly ranked documents are considered when computing the rank. A smaller values results in a model with increased emphasis on the first results of the ranking. Note that the NDCG truncation of the classic NDCG loss must be configured separately.
Used in:
For the time being, defaults to UNIFORM.
Gammas are sampled from a uniform distribution on [0, 1].
Gammas are set to 1 across the board. This is more appropriate for click datasets with a large number of documents per query.
Log of the training. This proto is generated during the training of the model and optionally exported (as a plot) in the training logs directory.
Used in:
Measurements the model size and performances during the training.
Names of the metrics stored in "secondary_metrics" field. The secondary metrics depends on the task (e.g. classification) and is accessible with "SecondaryMetricNames()". The i-th metric name of "secondary_metric_names" correspond to the i-th metric value in "training_secondary_metrics" and "validation_secondary_metrics".
Number of trees in the final model. Without early stopping, "number_of_trees_in_final_model" is equal to the "number_of_trees" of the last "entries".
Used in:
Number of trees. In the case of multi-dimensional gradients, "number_of_trees" is the number of training step.
Performance of the model on the training dataset.
Performance of the model on the validation dataset.
Average of the absolute value of the new tree predictions estimated on the training dataset. See Dart paper (http://proceedings.mlr.press/v38/korlakaivinayak15.pdf) for details on how to interpret it.
Sub-sampling factor applied during training on top of the "sampling" hyper-parameter. Currently, the "subsample_factor" is only controlled by the "adapt_subsample_for_maximum_training_duration" field i.e. the "subsample_factor" factor (default to 1) is reduced so the training finishes in "maximum_training_duration".
Confusion between the label and the predictions.