package yggdrasil_decision_forests.model.decision_tree.proto

Get desktop application:
View/edit binary Protocol Buffers messages

How to handle categorical input features.

Used in: DecisionTreeTrainingConfig

oneof algorithm
- Categorical.CART cart = 1
  CART algorithm (default). Find the a categorical split of the form "value \in mask". The solution is exact for binary classification, regression and ranking. It is approximated for multi-class classification. This is a good first algorithm to use. In case of overfitting (very small dataset, large dictionary), the "random" algorithm is a good alternative.
- Categorical.OneHot one_hot = 2
  One-hot encoding. Find the optimal categorical split of the form "attribute == param". This method is similar (but more efficient) than converting converting each possible categorical value into a boolean feature. This method is available for comparison purpose and generally performs worst than other alternatives.
- Categorical.Random random = 3
  Best splits among a set of random candidate. Find the a categorical split of the form "value \in mask" using a random search. This solution can be seen as an approximation of the CART algorithm. This method is a strong alternative to CART. This algorithm is inspired from section "5.1 Categorical Variables" of "Random Forest", 2001 (https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf).
optional int32 arity_limit_for_random = 4
If the dictionary size of the attribute is greater or equal to "arity_limit_for_random", the "random" algorithm is be used (instead of the algorithm specified in "algorithm");

Used in: Categorical

(message has no fields)

Used in: Categorical

optional float sampling = 1
Sampling of the item tested. 1. means that all the items will be tested.

Used in: Categorical

optional float num_trial_exponent = 1
Controls the number of random splits to evaluated. The effective number of splits is "min(max_num_trials, num_trial_offset + {vocab size}^num_trial_exponent"), with "vocab size" the number of unique categorical values in the node.
optional float num_trial_offset = 2
optional int32 max_num_trials = 3
Maximum number of candidates.

The sub-messages of "ConditionParams" are the different types of condition that can be attached to a node.

Used in: NodeCondition

oneof type
Type of condition.
- Condition.NA na_condition = 1
- Condition.Higher higher_condition = 2
- Condition.TrueValue true_value_condition = 3
- Condition.ContainsVector contains_condition = 4
- Condition.ContainsBitmap contains_bitmap_condition = 5
- Condition.DiscretizedHigher discretized_higher_condition = 6
- Condition.Oblique oblique_condition = 7
- Condition.NumericalVectorSequence numerical_vector_sequence = 8
  Make sure to update "kNumConditionTypes" in "decision_tree.h" accordingly.

Condition of the type: (value \intersect elements) != empty_set where elements is stored as a bitmap over the possible values.

Used in: Condition

optional bytes elements_bitmap = 1
Next ID: 2 [Required]

Condition of the type: (value \intersect elements) != empty_set.

Used in: Condition

repeated int32 elements = 1
Next ID: 2

Condition of the type: indexed_value >= indexed_threshold.

Used in: Condition

optional int32 threshold = 1
[Required]

Condition of the type: value >= threshold.

Used in: Condition

optional float threshold = 1
[Required]

Next ID: 6 Condition of the type: value == NA (i.e. missing).

Used in: Condition

(message has no fields)

Used in: Condition

oneof type
- NumericalVectorSequence.CloserThan closer_than = 1
- NumericalVectorSequence.ProjectedMoreThan projected_more_than = 2

Used in: CloserThan, ProjectedMoreThan

repeated float grounded = 1
Static / direct value of the anchor.

Used in: NumericalVectorSequence

optional Anchor anchor = 1
Condition of the type \exits a \in Obs; |a - anchor|^2 <= threshold2. Where |.|^2 is the squared euclidean distance. Note: The condition is not on the euclidean distance as to avoid rounding errors of a squared root. Note: Notice the direction of the inequality (<=) compared to other ydf conditions (>=).
optional float threshold2 = 2
Threshold to apply on the squared distance.

Used in: NumericalVectorSequence

optional Anchor anchor = 1
Condition of the type \exits a \in Obs; <a|anchor> threshold.
optional float threshold = 2

Used in: Condition

repeated int32 attributes = 1
True iff. \sum_i examples[attribute_i] * weight_i >= threshold. The "attribute" field in "NodeCondition" should be one of the "attributes" in this message. If any of the attributes is missing, the conditions evaluates to missing and returns "na_value".
repeated float weights = 2
optional float threshold = 3
repeated float na_replacements = 4
If set, "na_replacements" defines the replacement values for missing attributes. For example, if attributes = [3,5], where attribute 3 is available and attribute 5 is missing. The condition will be evaluated with attribute 3's value and the replacement value na_replacements[1]. If not set, in case of a missing value of any of the attributes, the condition evaluates to "na_value".

Condition of the type: value == True.

Used in: Condition

(message has no fields)

Training configuration for the Random Forest algorithm.

Next ID: 26

Used in: cart.proto.CartTrainingConfig, gradient_boosted_trees.proto.GradientBoostedTreesTrainingConfig, isolation_forest.proto.IsolationForestTrainingConfig, random_forest.proto.RandomForestTrainingConfig

optional int32 max_depth = 1
Maximum depth of the tree. max_depth=1 means that all trees will be roots. If max_depth=-1, the depth of the tree is not limited.
optional int32 min_examples = 2
Minimum number of examples in a node.
oneof control_num_candidate_attributes
- int32 num_candidate_attributes = 6
  Number of unique valid attributes tested for each node. An attribute is valid if it has at least a valid split. If "num_candidate_attributes=0" the value is set to the classical default value for Random Forest: "sqrt(number of input attributes)" in case of classification and "number of input attributes/3" in case of regression. If "num_candidate_attributes=-1", all the attributes are tested.
- float num_candidate_attributes_ratio = 17
  If set, replaces "num_candidate_attributes" with the "number_of_input_features x num_candidate_attributes_ratio". The possible values are between ]0, and 1] as well as -1. If not set or equal to -1, the "num_candidate_attributes" is used.
optional bool in_split_min_examples_check = 3
Whether to check the "min_examples" constraint in the split search (i.e. splits leading to one child having less than "min_examples" examples are considered invalid) or before the split search (i.e. a node can be derived only if it contains more than "min_examples" examples). If false, there can be nodes with less than "min_examples" training examples.
optional bool store_detailed_label_distribution = 4
Whether to store the full distribution (e.g. the distribution of all the possible label values in case of classification) or only the top label (e.g. the most representative class). This information is used for model interpretation as well as in case of "winner_take_all_inference=false". In the worst case, this information can account for a significant part of the model size.
optional bool keep_non_leaf_label_distribution = 5
INFO: Use "pure_serving_model" instead of "keep_non_leaf_label_distribution". "pure_serving_model" is more general, works in more situations, and removes more unused data from the model. Whether to keep the node value (i.e. the distribution of the labels of the training examples) of non-leaf nodes. This information is not used during serving, however it can be used for model interpretation as well as hyper parameter tuning. In the worst case, this can account for half of the model size. keep_non_leaf_label_distribution=false is not compatible with monotonic constraints.
optional DecisionTreeTrainingConfig.MissingValuePolicy missing_value_policy = 7
optional bool allow_na_conditions = 8
If true, the tree training evaluates conditions of the type "X is NA" i.e. "X is missing".
optional GreedyForwardCategoricalSet categorical_set_greedy_forward = 12
oneof growing_strategy
How to grow the tree.
- GrowingStrategyLocalBest growing_strategy_local = 13
  [Default strategy] Each node is split independently of the other nodes. In other words, as long as a node satisfy the splits constraints (e.g. maximum depth, minimum number of observations), the node will be split. This is the "classical" way to grow decision trees.
- GrowingStrategyGlobalBest growing_strategy_best_first_global = 14
  The node with the best loss reduction among all the nodes of the tree is selected for splitting. This method is also called "best first" or "leaf-wise growth". See "Best-first decision tree learning", Shi (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.2862&rep=rep1&type=pdf) and "Additive logistic regression : A statistical view of boosting", Friedman et al. (https://projecteuclid.org/euclid.aos/1016218223) for more details.
optional NumericalSplit numerical_split = 15
optional Categorical categorical = 16
Options related to the learning of categorical splits.
optional bool internal_error_on_wrong_splitter_statistics = 18
Generate an error (if true) or a warning (if false) when the statistics exported by splitters don't match the observed statistics. This fields is used in the unit tests.
oneof split_axis
What structure of split to consider.
- DecisionTreeTrainingConfig.AxisAlignedSplit axis_aligned_split = 19
  Axis aligned splits (i.e. one condition at a time). This is the "classical" way to train a tree. Default value.
- DecisionTreeTrainingConfig.SparseObliqueSplit sparse_oblique_split = 20
  Sparse oblique splits (i.e. splits one a small number of features) from "Sparse Projection Oblique Random Forests", Tomita et al., 2020. These splits are tested iif. "sparse_oblique_split" is set.
- DecisionTreeTrainingConfig.MHLDObliqueSplit mhld_oblique_split = 25
  Oblique splits from "Classification Based on Multivariate Contrast Patterns" by Canete-Sifuentes et al.
optional DecisionTreeTrainingConfig.Uplift uplift = 22
Uplift specific hyper-parameters.
optional DecisionTreeTrainingConfig.Honest honest = 24
optional DecisionTreeTrainingConfig.NumericalVectorSequence numerical_vector_sequence = 26
Options to learn numerical vector sequence conditions.
optional DecisionTreeTrainingConfig.Internal internal = 21
Internal knobs of the algorithm that don't impact the final model.

See "split_axis".

Used in: DecisionTreeTrainingConfig

(message has no fields)

Used in: DecisionTreeTrainingConfig

optional float ratio_leaf_examples = 1
Ratio of examples used to set the leaf values.
optional bool fixed_separation = 2
If true, a new random separation is generated for each tree. If false, the same separation is used for all the trees (for examples, in a Gradient Boosted Trees containing multiple trees).

Used in: DecisionTreeTrainingConfig

optional Internal.SortingStrategy sorting_strategy = 21
optional Internal.SortingStrategy ensure_effective_sorting_strategy = 1
If set, ensures that the effective strategy is "ensure_effective_strategy". "ensure_effective_strategy" is only used in unit test when the sorting strategy is not manually set i.e. sorting_strategy = AUTO.
optional bool hessian_split_score_subtract_parent = 22
If false, the score of a hessian split is: score \approx \sum_{children} sum_grad^2/sum_hessian If true, the score of a hessian split is: score \approx (\sum_{children} sum_grad^2/sum_hessian) - sum_grad_parent^2/sum_hessian_parent. This flag has two effects: - The absolute value of the score is different (e.g. when looking at the variable importance). - When growing the tree with global optimization, the structure of the tree might differ (however there is not impact on the structure when using the divide and conquer strategy). YDF used implicitly hessian_split_score_subtract_parent=false. XGBoost uses hessian_split_score_subtract_parent=true but the paper is explicit that this is just a possible solution. Both versions make sense (and produce similar results). Another possible version would be subtracting the parent gradient before the square. An experiment was conducted on 68 datasets, 10 folds CV, and 3 times repetitions to evaluate the effect of this flags. Both methods produce close models. However, in average accuracy, average auc and average rank, the "false" method is better than the "true" one by a small but visible margin.
optional bool check_monotonic_constraints = 23
If true, partially checks monotonic constraints of trees after training. This option is used by unit testing. That is, check that the value of a positive node is greater than the value of a generative note (in case of increasing monotonic constraint). If false and if a monotonic constraint is not satisfied, the monotonic constraint is manually enforced. The current checking implementation might detect as non-monotonic trees that are in fact monotonic (e.g. false positive). However, with the current algorithm used to create monotonic constraints, this checking algorithm cannot create false positives.
optional bool generate_fake_error_in_splitter = 24
If true, the splitter returns an InvalidArgumentError. This field can be used to check the propagation of error to the user.

How the computation of sorted values (non discretized numerical values) are obtained.

Used in: Internal

IN_NODE = 0
Values are sorted within each node.
PRESORTED = 1
Values are pre-sorted into an index to speed-up training. The index will be automatically ignored When using the index is slower than sorting in-node or if the algorithm does not benefit from pre-sorting. This method can increase significantly the amount of memory required for training.
FORCE_PRESORTED = 2
Always use the presorted index, even if the result would be slower. For testing only.
AUTO = 3
Select automatically the best method (The quickest method that does not consume excessive RAM).

Used in: DecisionTreeTrainingConfig

optional int32 max_num_attributes = 1
Maximum number of attributes in the projection. Increasing this value increases the training time. Decreasing this value acts as a regularization. The value should be in [2, num_numerical_features]. If the value is above num_numerical_features, the value is capped to num_numerical_features. The value 1 is allowed but results in ordinary (non-oblique) splits
optional bool sample_attributes = 2
If true, applies the attribute sampling in "num_candidate_attributes" and "num_candidate_attributes_ratio". If false, all attributes are tested.

Method used to handle missing attribute values.

Used in: DecisionTreeTrainingConfig

GLOBAL_IMPUTATION = 0
Missing attribute values are imputed, with the mean (in case of numerical attribute) or the most-frequent-item (in case of categorical attribute) computed on the entire dataset (i.e. the information contained in the data spec).
LOCAL_IMPUTATION = 1
Missing attribute values are imputed with the mean (numerical attribute) or most-frequent-item (in the case of categorical attribute) evaluated on the training examples in the current node.
RANDOM_LOCAL_IMPUTATION = 2
Missing attribute values are imputed from randomly sampled values from the training examples in the current node. This method was proposed by Clinic et al. in "Random Survival Forests" (https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908043).

Used in: DecisionTreeTrainingConfig

optional int64 max_num_test_examples = 1
Number of training examples to use when evaluating the score of an anchor in the anchor selection stage. Note that all the training examples are used when evaluating the score of an anchor-based split.
optional int32 num_random_selected_anchors = 2
Number of anchors generated by sampling training example observations.

See "split_axis".

Used in: DecisionTreeTrainingConfig

optional float num_projections_exponent = 1
Controls of the number of random projections to test at each node. Increasing this value very likely improves the quality of the model, drastically increases the training time, and doe not impact the inference time. Oblique splits try out max(p^num_projections_exponent, max_num_projections) random projections for choosing a split, where p is the number of numerical features. Therefore, increasing this `num_projections_exponent` and possibly `max_num_projections` may improve model quality, but will also significantly increase training time. Note that the complexity of (classic) Random Forests is roughly proportional to `num_projections_exponent=0.5`, since it considers sqrt(num_features) for a split. The complexity of (classic) GBDT is roughly proportional to `num_projections_exponent=1`, since it considers all features for a split. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) recommends values in [1/4, 2].
optional int32 max_num_projections = 2
Maximum number of projections (applied after the "num_projections_exponent"). Oblique splits try out max(p^num_projections_exponent, max_num_projections) random projections for choosing a split, where p is the number of numerical features. Increasing "max_num_projections" increases the training time but not the inference time. In late stage model development, if every bit of accuracy if important, increase this value. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) does not define this hyperparameter.
optional int32 min_num_projections = 6
Minimum number of projections. In a dataset with very few numerical features, increasing this parameter might improve model quality. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) does not define this hyperparameter.
optional float projection_density_factor = 3
Density of the projections as an exponent of the number of features. Independently for each projection, each feature has a probability "projection_density_factor / num_features" to be considered in the projection. The paper "Sparse Projection Oblique Random Forests" (Tomita et al, 2020) calls this parameter `lambda` and recommends values in [1, 5]. Increasing this value increases training and inference time (on average). This value is best tuned for each dataset.
optional bool binary_weight = 4
Deprecated, use `weights` instead. If true, the weight will be sampled in {-1,1} (default in "Sparse Projection Oblique Random Forests" (Tomita et al, 2020)). If false, the weight will be sampled in [-1,1].
oneof weights
Weights to apply to the projections. Continuous weights generally give better performance.
- SparseObliqueSplit.BinaryWeights binary = 7
- SparseObliqueSplit.ContinuousWeights continuous = 8
- SparseObliqueSplit.PowerOfTwoWeights power_of_two = 9
- SparseObliqueSplit.IntegerWeights integer = 10
optional SparseObliqueSplit.Normalization normalization = 5
Normalization applied on the features, before applying the sparse oblique projections.
optional int32 max_num_features = 11
Maximum number of features in a projection. Set to -1 or not provided for no maximum. Use only if a hard maximum on the number of variables is needed, otherwise prefer `projection_density_factor` for controlling the number of features per projection.

Weights sampled in {-1, 1} (default in "Sparse Projection Oblique Random Forests" (Tomita et al, 2020))).

Used in: SparseObliqueSplit

(message has no fields)

Weights sampled in [-1, 1]. Consistently gives better quality models than binary weights.

Used in: SparseObliqueSplit

(message has no fields)

Weights sampled in uniformly in the integer range [minimum, maximum].

Used in: SparseObliqueSplit

optional int32 minimum = 1
optional int32 maximum = 2

Used in: SparseObliqueSplit

NONE = 0
No normalization. Logic used in the "Sparse Projection Oblique Random Forests" (Tomita et al, 2020).
STANDARD_DEVIATION = 1
Normalize the feature by the estimated standard deviation on the entire train dataset. Also known as Z-Score normalization.
MIN_MAX = 2
Normalize the feature by the range (i.e. max-min) estimated on the entire train dataset.

Weights sampled uniformly in the exponend space, i.e. the weights are of the form $s * 2^i$ with the integer exponent $i$ sampled uniformly in [min_exponent, max_exponent] and the sign $s$ sampled uniformly in {-1, 1}.

Used in: SparseObliqueSplit

optional int32 min_exponent = 1
optional int32 max_exponent = 2

Used in: DecisionTreeTrainingConfig

optional int32 min_examples_in_treatment = 1
Minimum number of examples per treatment in a node. Only used for uplift models.
optional Uplift.SplitScore split_score = 2
optional Uplift.EmptyBucketOrdering empty_bucket__ordering = 3

How to order buckets having no values for one of the treatments. This parameter is used exclusively for the bucket sorting during the generation of some of the candidate splits. For example, for categorical features with the CART splitter

Used in: Uplift

PARENT_TREATMENT_OUTCOME = 0
Uses the treatment conditional mean outcome of the parent node.
PARENT_OUTCOME = 1
Uses the mean outcome of the parent node.

Splitter score i.e. score optimized by the splitters. Changing the splitter score will impact the trained model. The following scores are introduced in "Decision trees for uplift modeling with single and multiple treatments", Rzepakowski et al. Notation: p: probability of the positive outcome (categorical outcome) or average value of the outcome (numerical outcome) in the treatment group. q: probability / average value in the control group.

Used in: Uplift

KULLBACK_LEIBLER = 0
Score: - p log (p/q) Categorical outcome only.
EUCLIDEAN_DISTANCE = 1
Score: (p-q)^2 Categorical outcome only. TODO: Add numerical outcome.
CHI_SQUARED = 2
Score: (p-q)^2/q Categorical outcome only.
CONSERVATIVE_EUCLIDEAN_DISTANCE = 3
Conservative estimate (lower bound) of the euclidean distance.

Used in: DecisionTreeTrainingConfig

optional float sampling = 1
Probability for a categorical value to be a candidate for the positive set in the extraction of a categorical set split. The sampling is applied once per node (i.e. not at every step of the greedy optimization).
optional int32 max_num_items = 2
Maximum number of items (prior to the sampling). If more items are available, the least frequent items are ignored. Changing this value is similar to change the "max_vocab_count" before loading the dataset, with the following exception: With "max_vocab_count", all the remaining items are grouped in a special Out-of-vocabulary item. With "max_num_items", this is not the case.
optional int32 min_item_frequency = 3
Minimum number of occurrences of an item to be considered.
optional int32 max_selected_items = 4
Maximum number of items selected in the condition. Note: max_selected_items=1 is equivalent to one-hot encoding.

Specifies the global best growing strategy.

Used in: DecisionTreeTrainingConfig

optional int32 max_num_nodes = 1
Maximum number of nodes in the tree. Set to "-1" to disable this limit.

Specifies the local best growing strategy. No extra configuration needed.

Used in: DecisionTreeTrainingConfig

(message has no fields)

Statistics about the label values used to operate a splitter algorithm.

Used in: distributed_decision_tree.proto.Split, distributed_gradient_boosted_trees.proto.Checkpoint, distributed_gradient_boosted_trees.proto.WorkerRequest.SetInitialPredictions, distributed_gradient_boosted_trees.proto.WorkerResult.GetLabelStatistics, distributed_gradient_boosted_trees.proto.WorkerResult.StartNewIter

optional int64 num_examples = 1
oneof type
- LabelStatistics.Classification classification = 2
- LabelStatistics.Regression regression = 3
- LabelStatistics.RegressionWithHessian regression_with_hessian = 4

Used in: LabelStatistics

optional utils.proto.IntegerDistributionDouble labels = 1

Used in: LabelStatistics

optional utils.proto.NormalDistributionDouble labels = 1

Used in: LabelStatistics

optional utils.proto.NormalDistributionDouble labels = 1
optional double sum_hessian = 2

Node in a decision tree (without the information about the children).

Used in: distributed_gradient_boosted_trees.proto.WorkerRequest.EndIter.Tree

oneof output
Next ID: 7 Label value. Might be unspecified for non-leaf nodes.
- NodeClassifierOutput classifier = 1
- NodeRegressorOutput regressor = 2
- NodeUpliftOutput uplift = 5
- NodeAnomalyDetectionOutput anomaly_detection = 6
optional NodeCondition condition = 3
Branching condition to the children. If not specified, this node is a leaf.
optional int64 num_pos_training_examples_without_weight = 4
Number of examples (non-weighted) that reached this node during training. Warning: Contrary to what the name suggest, this is not the count of examples branched to the positive child.

Output of a node in an anomaliy detection tree.

Next ID: 2

Used in: Node

optional int64 num_examples_without_weight = 1
Number of examples that reached this node.

Output of a node in a classification tree.

Used in: Node

optional int32 top_value = 1
Next ID: 3 Label value.
optional utils.proto.IntegerDistributionDouble distribution = 2
Distribution of label values. The most frequent value is "top_value".

Binary condition attached to a non-leaf node.

Used in: Node, distributed_decision_tree.proto.Split

optional bool na_value = 1
Next ID: 9 Evaluation value of this condition in case of a NA (i.e. missing) value.
optional int32 attribute = 2
Attribute on which the condition applies.
optional Condition condition = 3
If the condition is not set, this node is a leaf node.
optional int64 num_training_examples_without_weight = 4
Number of examples (non-weighted) that reached this node during training.
optional double num_training_examples_with_weight = 5
Number of examples (weighted) that reached this node during training.
optional float split_score = 6
Score attached to the split.
optional int64 num_pos_training_examples_without_weight = 7
Number of positive examples (non-weighted) that reached this node during training.
optional double num_pos_training_examples_with_weight = 8
Number of positive examples (weighted) that reached this node during training.

Output of a node in a regression tree.

Used in: Node

optional float top_value = 1
Next ID: 6 Label value.
optional utils.proto.NormalDistributionDouble distribution = 2
Distribution of label values. The mean is "top_value".
optional double sum_gradients = 3
Statistics of hessian splits.
optional double sum_hessians = 4
optional double sum_weights = 5

Output of a node in an uplift tree with either binary categorical or numerical outcome. The fields have the same definition as the fields in the message "UpliftCategoricalLabelDistribution".

Used in: Node

optional double sum_weights = 1
Weighted number of examples.
repeated double sum_weights_per_treatment = 2
Currently, the code only support binary categorical or regressive outcomes.
repeated double sum_weights_per_treatment_and_outcome = 3
Number of examples for each outcome (major) and each treatment (minor). In the case of categorical outcome, exclude the zero outcome. For example, in case of binary treatment, "sum_weights_per_treatment_and_outcome" contains one value for each treatment. In the case of numerical outcome, "sum_weights_per_treatment_and_outcome" is the weighted sum of the outcomes. Currently, the code only supports binary categorical or regressive outcome.
repeated float treatment_effect = 4
treatment_effect[i] is the effect of the "i+1"-th treatment (categorical value i+2) compared to the control group (0-th treatment; categorical value = 1). The treatment out-of-vocabulary item (value = 0) is not taken into account.
repeated int64 num_examples_per_treatment = 5
Number of examples in each treatment. Not weighted.

How to find numerical splits.

Used in: DecisionTreeTrainingConfig

optional NumericalSplit.Type type = 1
optional int32 num_candidates = 2
Number of candidate thresholds. Ignored for EXACT. Default: HISTOGRAM_RANDOM => 1 HISTOGRAM_EQUAL_WIDTH => 255

Used in: NumericalSplit

EXACT = 0
Original/CART splitting. Slow but gives good (small, high accuracy) models. Equivalent to XGBoost Exact.
HISTOGRAM_RANDOM = 1
Select candidate splits randomly between the min and max values. Similar to the ExtraTrees algorithm: https://link.springer.com/content/pdf/10.1007%2Fs10994-006-6226-1.pdf
HISTOGRAM_EQUAL_WIDTH = 2
Select the candidate splits uniformly (in the feature space) between the min and max value.

package yggdrasil_decision_forests.model.decision_tree.proto

message Categorical

oneof algorithm

Categorical.CART cart = 1

Categorical.OneHot one_hot = 2

Categorical.Random random = 3

optional int32 arity_limit_for_random = 4

message Categorical.CART

message Categorical.OneHot

optional float sampling = 1

message Categorical.Random

optional float num_trial_exponent = 1

optional float num_trial_offset = 2

optional int32 max_num_trials = 3

message Condition

oneof type

Condition.NA na_condition = 1

Condition.Higher higher_condition = 2

Condition.TrueValue true_value_condition = 3

Condition.ContainsVector contains_condition = 4

Condition.ContainsBitmap contains_bitmap_condition = 5

Condition.DiscretizedHigher discretized_higher_condition = 6

Condition.Oblique oblique_condition = 7

Condition.NumericalVectorSequence numerical_vector_sequence = 8

message Condition.ContainsBitmap

optional bytes elements_bitmap = 1

message Condition.ContainsVector

repeated int32 elements = 1

message Condition.DiscretizedHigher

optional int32 threshold = 1

message Condition.Higher

optional float threshold = 1

message Condition.NA

message Condition.NumericalVectorSequence

oneof type

NumericalVectorSequence.CloserThan closer_than = 1

NumericalVectorSequence.ProjectedMoreThan projected_more_than = 2

message Condition.NumericalVectorSequence.Anchor

repeated float grounded = 1

message Condition.NumericalVectorSequence.CloserThan

optional Anchor anchor = 1

optional float threshold2 = 2

message Condition.NumericalVectorSequence.ProjectedMoreThan

optional Anchor anchor = 1

optional float threshold = 2

message Condition.Oblique

repeated int32 attributes = 1

repeated float weights = 2

optional float threshold = 3

repeated float na_replacements = 4

message Condition.TrueValue

message DecisionTreeTrainingConfig

optional int32 max_depth = 1

optional int32 min_examples = 2

oneof control_num_candidate_attributes

int32 num_candidate_attributes = 6

float num_candidate_attributes_ratio = 17

optional bool in_split_min_examples_check = 3

optional bool store_detailed_label_distribution = 4

optional bool keep_non_leaf_label_distribution = 5

optional DecisionTreeTrainingConfig.MissingValuePolicy missing_value_policy = 7

optional bool allow_na_conditions = 8

optional GreedyForwardCategoricalSet categorical_set_greedy_forward = 12

oneof growing_strategy

GrowingStrategyLocalBest growing_strategy_local = 13

GrowingStrategyGlobalBest growing_strategy_best_first_global = 14

optional NumericalSplit numerical_split = 15

optional Categorical categorical = 16

optional bool internal_error_on_wrong_splitter_statistics = 18

oneof split_axis

DecisionTreeTrainingConfig.AxisAlignedSplit axis_aligned_split = 19

DecisionTreeTrainingConfig.SparseObliqueSplit sparse_oblique_split = 20

DecisionTreeTrainingConfig.MHLDObliqueSplit mhld_oblique_split = 25

optional DecisionTreeTrainingConfig.Uplift uplift = 22

optional DecisionTreeTrainingConfig.Honest honest = 24

optional DecisionTreeTrainingConfig.NumericalVectorSequence numerical_vector_sequence = 26

optional DecisionTreeTrainingConfig.Internal internal = 21

message DecisionTreeTrainingConfig.AxisAlignedSplit

message DecisionTreeTrainingConfig.Honest

optional float ratio_leaf_examples = 1