Get desktop application:
View/edit binary Protocol Buffers messages
How to handle categorical input features.
Used in:
CART algorithm (default). Find the a categorical split of the form "value \in mask". The solution is exact for binary classification, regression and ranking. It is approximated for multi-class classification. This is a good first algorithm to use. In case of overfitting (very small dataset, large dictionary), the "random" algorithm is a good alternative.
One-hot encoding. Find the optimal categorical split of the form "attribute == param". This method is similar (but more efficient) than converting converting each possible categorical value into a boolean feature. This method is available for comparison purpose and generally performs worst than other alternatives.
Best splits among a set of random candidate. Find the a categorical split of the form "value \in mask" using a random search. This solution can be seen as an approximation of the CART algorithm. This method is a strong alternative to CART. This algorithm is inspired from section "5.1 Categorical Variables" of "Random Forest", 2001 (https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf).
If the dictionary size of the attribute is greater or equal to "arity_limit_for_random", the "random" algorithm is be used (instead of the algorithm specified in "algorithm");
Used in:
(message has no fields)
Used in:
Sampling of the item tested. 1. means that all the items will be tested.
Used in:
Controls the number of random splits to evaluated. The effective number of splits is "min(max_num_trials, num_trial_offset + {vocab size}^num_trial_exponent"), with "vocab size" the number of unique categorical values in the node.
Maximum number of candidates.
The sub-messages of "ConditionParams" are the different types of condition that can be attached to a node.
Used in:
Type of condition.
Make sure to update "kNumConditionTypes" in "decision_tree.h" accordingly.
Condition of the type: (value \intersect elements) != empty_set where elements is stored as a bitmap over the possible values.
Used in:
Next ID: 2 [Required]
Condition of the type: (value \intersect elements) != empty_set.
Used in:
Next ID: 2
Condition of the type: indexed_value >= indexed_threshold.
Used in:
[Required]
Condition of the type: value >= threshold.
Used in:
[Required]
Next ID: 6 Condition of the type: value == NA (i.e. missing).
Used in:
(message has no fields)
Used in:
Used in:
,Static / direct value of the anchor.
Used in:
Condition of the type \exits a \in Obs; |a - anchor|^2 <= threshold2. Where |.|^2 is the squared euclidean distance. Note: The condition is not on the euclidean distance as to avoid rounding errors of a squared root. Note: Notice the direction of the inequality (<=) compared to other ydf conditions (>=).
Threshold to apply on the squared distance.
Used in:
Condition of the type \exits a \in Obs; <a|anchor> threshold.
Used in:
True iff. \sum_i examples[attribute_i] * weight_i >= threshold. The "attribute" field in "NodeCondition" should be one of the "attributes" in this message. If any of the attributes is missing, the conditions evaluates to missing and returns "na_value".
If set, "na_replacements" defines the replacement values for missing attributes. For example, if attributes = [3,5], where attribute 3 is available and attribute 5 is missing. The condition will be evaluated with attribute 3's value and the replacement value na_replacements[1]. If not set, in case of a missing value of any of the attributes, the condition evaluates to "na_value".
Condition of the type: value == True.
Used in:
(message has no fields)
Training configuration for the Random Forest algorithm.
Next ID: 26
Used in:
, , ,Maximum depth of the tree. max_depth=1 means that all trees will be roots. If max_depth=-1, the depth of the tree is not limited.
Minimum number of examples in a node.
Number of unique valid attributes tested for each node. An attribute is valid if it has at least a valid split. If "num_candidate_attributes=0" the value is set to the classical default value for Random Forest: "sqrt(number of input attributes)" in case of classification and "number of input attributes/3" in case of regression. If "num_candidate_attributes=-1", all the attributes are tested.
If set, replaces "num_candidate_attributes" with the "number_of_input_features x num_candidate_attributes_ratio". The possible values are between ]0, and 1] as well as -1. If not set or equal to -1, the "num_candidate_attributes" is used.
Whether to check the "min_examples" constraint in the split search (i.e. splits leading to one child having less than "min_examples" examples are considered invalid) or before the split search (i.e. a node can be derived only if it contains more than "min_examples" examples). If false, there can be nodes with less than "min_examples" training examples.
Whether to store the full distribution (e.g. the distribution of all the possible label values in case of classification) or only the top label (e.g. the most representative class). This information is used for model interpretation as well as in case of "winner_take_all_inference=false". In the worst case, this information can account for a significant part of the model size.
INFO: Use "pure_serving_model" instead of "keep_non_leaf_label_distribution". "pure_serving_model" is more general, works in more situations, and removes more unused data from the model. Whether to keep the node value (i.e. the distribution of the labels of the training examples) of non-leaf nodes. This information is not used during serving, however it can be used for model interpretation as well as hyper parameter tuning. In the worst case, this can account for half of the model size. keep_non_leaf_label_distribution=false is not compatible with monotonic constraints.
If true, the tree training evaluates conditions of the type "X is NA" i.e. "X is missing".
How to grow the tree.
[Default strategy] Each node is split independently of the other nodes. In other words, as long as a node satisfy the splits constraints (e.g. maximum depth, minimum number of observations), the node will be split. This is the "classical" way to grow decision trees.
The node with the best loss reduction among all the nodes of the tree is selected for splitting. This method is also called "best first" or "leaf-wise growth". See "Best-first decision tree learning", Shi (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.2862&rep=rep1&type=pdf) and "Additive logistic regression : A statistical view of boosting", Friedman et al. (https://projecteuclid.org/euclid.aos/1016218223) for more details.
Options related to the learning of categorical splits.
Generate an error (if true) or a warning (if false) when the statistics exported by splitters don't match the observed statistics. This fields is used in the unit tests.
What structure of split to consider.
Axis aligned splits (i.e. one condition at a time). This is the "classical" way to train a tree. Default value.
Sparse oblique splits (i.e. splits one a small number of features) from "Sparse Projection Oblique Random Forests", Tomita et al., 2020. These splits are tested iif. "sparse_oblique_split" is set.
Oblique splits from "Classification Based on Multivariate Contrast Patterns" by Canete-Sifuentes et al.
Uplift specific hyper-parameters.
Options to learn numerical vector sequence conditions.
Internal knobs of the algorithm that don't impact the final model.
See "split_axis".
Used in:
(message has no fields)
Used in:
Ratio of examples used to set the leaf values.
If true, a new random separation is generated for each tree. If false, the same separation is used for all the trees (for examples, in a Gradient Boosted Trees containing multiple trees).
Used in:
If set, ensures that the effective strategy is "ensure_effective_strategy". "ensure_effective_strategy" is only used in unit test when the sorting strategy is not manually set i.e. sorting_strategy = AUTO.
If false, the score of a hessian split is: score \approx \sum_{children} sum_grad^2/sum_hessian If true, the score of a hessian split is: score \approx (\sum_{children} sum_grad^2/sum_hessian) - sum_grad_parent^2/sum_hessian_parent. This flag has two effects: - The absolute value of the score is different (e.g. when looking at the variable importance). - When growing the tree with global optimization, the structure of the tree might differ (however there is not impact on the structure when using the divide and conquer strategy). YDF used implicitly hessian_split_score_subtract_parent=false. XGBoost uses hessian_split_score_subtract_parent=true but the paper is explicit that this is just a possible solution. Both versions make sense (and produce similar results). Another possible version would be subtracting the parent gradient before the square. An experiment was conducted on 68 datasets, 10 folds CV, and 3 times repetitions to evaluate the effect of this flags. Both methods produce close models. However, in average accuracy, average auc and average rank, the "false" method is better than the "true" one by a small but visible margin.
If true, partially checks monotonic constraints of trees after training. This option is used by unit testing. That is, check that the value of a positive node is greater than the value of a generative note (in case of increasing monotonic constraint). If false and if a monotonic constraint is not satisfied, the monotonic constraint is manually enforced. The current checking implementation might detect as non-monotonic trees that are in fact monotonic (e.g. false positive). However, with the current algorithm used to create monotonic constraints, this checking algorithm cannot create false positives.
If true, the splitter returns an InvalidArgumentError. This field can be used to check the propagation of error to the user.
How the computation of sorted values (non discretized numerical values) are obtained.
Used in:
Values are sorted within each node.
Values are pre-sorted into an index to speed-up training. The index will be automatically ignored When using the index is slower than sorting in-node or if the algorithm does not benefit from pre-sorting. This method can increase significantly the amount of memory required for training.
Always use the presorted index, even if the result would be slower. For testing only.
Select automatically the best method (The quickest method that does not consume excessive RAM).
Used in:
Maximum number of attributes in the projection. Increasing this value increases the training time. Decreasing this value acts as a regularization. The value should be in [2, num_numerical_features]. If the value is above num_numerical_features, the value is capped to num_numerical_features. The value 1 is allowed but results in ordinary (non-oblique) splits
If true, applies the attribute sampling in "num_candidate_attributes" and "num_candidate_attributes_ratio". If false, all attributes are tested.
Method used to handle missing attribute values.
Used in:
Missing attribute values are imputed, with the mean (in case of numerical attribute) or the most-frequent-item (in case of categorical attribute) computed on the entire dataset (i.e. the information contained in the data spec).
Missing attribute values are imputed with the mean (numerical attribute) or most-frequent-item (in the case of categorical attribute) evaluated on the training examples in the current node.
Missing attribute values are imputed from randomly sampled values from the training examples in the current node. This method was proposed by Clinic et al. in "Random Survival Forests" (https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908043).
Used in:
Number of training examples to use when evaluating the score of an anchor in the anchor selection stage. Note that all the training examples are used when evaluating the score of an anchor-based split.
Number of anchors generated by sampling training example observations.
See "split_axis".
Used in:
Controls of the number of random projections to test at each node. Increasing this value very likely improves the quality of the model, drastically increases the training time, and doe not impact the inference time. Oblique splits try out max(p^num_projections_exponent, max_num_projections) random projections for choosing a split, where p is the number of numerical features. Therefore, increasing this `num_projections_exponent` and possibly `max_num_projections` may improve model quality, but will also significantly increase training time. Note that the complexity of (classic) Random Forests is roughly proportional to `num_projections_exponent=0.5`, since it considers sqrt(num_features) for a split. The complexity of (classic) GBDT is roughly proportional to `num_projections_exponent=1`, since it considers all features for a split. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) recommends values in [1/4, 2].
Maximum number of projections (applied after the "num_projections_exponent"). Oblique splits try out max(p^num_projections_exponent, max_num_projections) random projections for choosing a split, where p is the number of numerical features. Increasing "max_num_projections" increases the training time but not the inference time. In late stage model development, if every bit of accuracy if important, increase this value. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) does not define this hyperparameter.
Minimum number of projections. In a dataset with very few numerical features, increasing this parameter might improve model quality. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) does not define this hyperparameter.
Density of the projections as an exponent of the number of features. Independently for each projection, each feature has a probability "projection_density_factor / num_features" to be considered in the projection. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) calls this parameter `lambda` and recommends values in [1, 5]. Increasing this value increases training and inference time (on average). This value is best tuned for each dataset.
Deprecated, use `weights` instead. If true, the weight will be sampled in {-1,1} (default in "Sparse Projection Oblique Random Forests" (Tomita et al, 2020)). If false, the weight will be sampled in [-1,1].
Weights to apply to the projections. Continuous weights generally give better performance.
Normalization applied on the features, before applying the sparse oblique projections.
Maximum number of features in a projection. Set to -1 or not provided for no maximum. Use only if a hard maximum on the number of variables is needed, otherwise prefer `projection_density_factor` for controlling the number of features per projection.
Weights sampled in {-1, 1} (default in "Sparse Projection Oblique Random Forests" (Tomita et al, 2020))).
Used in:
(message has no fields)
Weights sampled in [-1, 1]. Consistently gives better quality models than binary weights.
Used in:
(message has no fields)
Weights sampled in uniformly in the integer range [minimum, maximum].
Used in:
Used in:
No normalization. Logic used in the "Sparse Projection Oblique Random Forests" (Tomita et al, 2020).
Normalize the feature by the estimated standard deviation on the entire train dataset. Also known as Z-Score normalization.
Normalize the feature by the range (i.e. max-min) estimated on the entire train dataset.
Weights sampled uniformly in the exponend space, i.e. the weights are of the form $s * 2^i$ with the integer exponent $i$ sampled uniformly in [min_exponent, max_exponent] and the sign $s$ sampled uniformly in {-1, 1}.
Used in:
Used in:
Minimum number of examples per treatment in a node. Only used for uplift models.
How to order buckets having no values for one of the treatments. This parameter is used exclusively for the bucket sorting during the generation of some of the candidate splits. For example, for categorical features with the CART splitter
Used in:
Uses the treatment conditional mean outcome of the parent node.
Uses the mean outcome of the parent node.
Splitter score i.e. score optimized by the splitters. Changing the splitter score will impact the trained model. The following scores are introduced in "Decision trees for uplift modeling with single and multiple treatments", Rzepakowski et al. Notation: p: probability of the positive outcome (categorical outcome) or average value of the outcome (numerical outcome) in the treatment group. q: probability / average value in the control group.
Used in:
Score: - p log (p/q) Categorical outcome only.
Score: (p-q)^2 Categorical outcome only. TODO: Add numerical outcome.
Score: (p-q)^2/q Categorical outcome only.
Conservative estimate (lower bound) of the euclidean distance.
Used in:
Probability for a categorical value to be a candidate for the positive set in the extraction of a categorical set split. The sampling is applied once per node (i.e. not at every step of the greedy optimization).
Maximum number of items (prior to the sampling). If more items are available, the least frequent items are ignored. Changing this value is similar to change the "max_vocab_count" before loading the dataset, with the following exception: With "max_vocab_count", all the remaining items are grouped in a special Out-of-vocabulary item. With "max_num_items", this is not the case.
Minimum number of occurrences of an item to be considered.
Maximum number of items selected in the condition. Note: max_selected_items=1 is equivalent to one-hot encoding.
Specifies the global best growing strategy.
Used in:
Maximum number of nodes in the tree. Set to "-1" to disable this limit.
Specifies the local best growing strategy. No extra configuration needed.
Used in:
(message has no fields)
Statistics about the label values used to operate a splitter algorithm.
Used in:
, , , ,Used in:
Used in:
Used in:
Node in a decision tree (without the information about the children).
Used in:
Next ID: 7 Label value. Might be unspecified for non-leaf nodes.
Branching condition to the children. If not specified, this node is a leaf.
Number of examples (non-weighted) that reached this node during training. Warning: Contrary to what the name suggest, this is not the count of examples branched to the positive child.
Output of a node in an anomaliy detection tree.
Next ID: 2
Used in:
Number of examples that reached this node.
Output of a node in a classification tree.
Used in:
Next ID: 3 Label value.
Distribution of label values. The most frequent value is "top_value".
Binary condition attached to a non-leaf node.
Used in:
,Next ID: 9 Evaluation value of this condition in case of a NA (i.e. missing) value.
Attribute on which the condition applies.
If the condition is not set, this node is a leaf node.
Number of examples (non-weighted) that reached this node during training.
Number of examples (weighted) that reached this node during training.
Score attached to the split.
Number of positive examples (non-weighted) that reached this node during training.
Number of positive examples (weighted) that reached this node during training.
Output of a node in a regression tree.
Used in:
Next ID: 6 Label value.
Distribution of label values. The mean is "top_value".
Statistics of hessian splits.
Output of a node in an uplift tree with either binary categorical or numerical outcome. The fields have the same definition as the fields in the message "UpliftCategoricalLabelDistribution".
Used in:
Weighted number of examples.
Currently, the code only support binary categorical or regressive outcomes.
Number of examples for each outcome (major) and each treatment (minor). In the case of categorical outcome, exclude the zero outcome. For example, in case of binary treatment, "sum_weights_per_treatment_and_outcome" contains one value for each treatment. In the case of numerical outcome, "sum_weights_per_treatment_and_outcome" is the weighted sum of the outcomes. Currently, the code only supports binary categorical or regressive outcome.
treatment_effect[i] is the effect of the "i+1"-th treatment (categorical value i+2) compared to the control group (0-th treatment; categorical value = 1). The treatment out-of-vocabulary item (value = 0) is not taken into account.
Number of examples in each treatment. Not weighted.
How to find numerical splits.
Used in:
Number of candidate thresholds. Ignored for EXACT. Default: HISTOGRAM_RANDOM => 1 HISTOGRAM_EQUAL_WIDTH => 255
Used in:
Original/CART splitting. Slow but gives good (small, high accuracy) models. Equivalent to XGBoost Exact.
Select candidate splits randomly between the min and max values. Similar to the ExtraTrees algorithm: https://link.springer.com/content/pdf/10.1007%2Fs10994-006-6226-1.pdf
Select the candidate splits uniformly (in the feature space) between the min and max value.