Get desktop application:
View/edit binary Protocol Buffers messages
Contains the same information as a model::AbstractModel (without the data_spec field).
Used in:
Name of the model. Should match one of the registered models in the :model_library.
Task solved by the model e.g. classification, regression.
Index of the label column in the dataspec.
Training example weights.
List of indices (in the dataspec) of the model input features.
Index of the "grouping" attribute in the dataspec for ranking problems e.g. the query in a <query,document> ranking problem.
Pre-computed variable importances (VI). The VIs of the model are composed of the pre-computed VIs (this field) and the "model specific VIs" (i.e. variable importance computed on the fly based on the models structure).
If true, the output of a task=CLASSIFICATION model is a probability and can be used accordingly (e.g. averaged, clamped to [0,1]). If false, the output of the task=CLASSIFICATION model might not be a probability.
Index of the "treatment" attribute in the dataspec for uplift problems.
Logs of the automated hyper-parameter tuning of the model.
Logs of the automated feature selection of the model.
Indicate if a model is pure for serving i.e. the model was tripped of all information not required for serving.
Specification of the computing resources used to perform an action (e.g. train a model, run a cross-validation, generate predictions). The deployment configuration does not impact the results (e.g. learned model). If not specified, more consumer will assume local computation with multiple threads.
Next ID: 9
Used in:
, , ,Path to temporary directory available to the training algorithm. Currently cache_path is only used (and required) by the distributed algorithms or if "try_resume_training=True" (for the snapshots). In case of distributed training, the "cache_path" should be available by the manager and the workers (unless specified otherwise) -- so local machine/memory partition won't work.
Number of threads.
If true, try to resume an interrupted training using snapshots stored in the "cache_path". Not supported by all learning algorithms. Resuming training after changing the hyper-parameters might lead to failure when training is resumed.
Indicative number of seconds in between snapshots when "try_resume_training=True". Might be ignored by some algorithms.
Number of threads to use for IO operations e.g. reading a dataset from disk. Increasing this value can speed-up IO operations when IO operations are either latency or cpu bounded.
Maximum number of snapshots to keep.
Use GPU for algorithms that supports it if a GPU is available and if YDF is compiled with GPU support.
Computation distribution engine.
Local execution.
Distribution using the Distribute interface. Note that the selected distribution strategy implementation (selected in "distribute") needs to be linked with the binary if you are using the C++ API.
Used in:
(message has no fields)
Used in:
Logs of a feature selection algorithm.
Used in:
Definition of the type, possible values and default values of the generic hyper parameters of a learner. Also contains some documentation (free text + links).
Individual fields / hyper-parameters. Also contains the per-fields documentation.
Documentation for the entire learner.
Conditional existence of a parameter. A parameter exist iff. the other parameter "control_field" satisfy "constraint".
Used in:
Name of the control parameter.
Constraint on the parent.
One of the following values.
Used in:
Documentation about the entire learner.
Used in:
Free text description of the learning algorithm.
Used in:
If set, this parameter exists conditionally on other parameter values.
If set, this parameter is mutually exclusive with other parameters.
Categorical hyper parameter i.e. the hyper parameter takes a values from a set of possible values.
Used in:
List of categorical values.
Used in:
(message has no fields)
Links to the documentation of the hyper-parameter.
Used in:
Path to the proto relative to YDF root directory.
Name of the proto field. If not specific, use "name" instead.
Free text description of the parameter.
When a field is deprecated.
Integer hyper parameter.
Used in:
Used in:
List of parameters this parameter is mutually exclusive with. Any parameter in this list must have this parameter in its `other_parameters` list.
True if this parameter is the default parameter of a list of mutually exclusive parameters.
Real hyper parameter.
Used in:
Generic hyper parameters of a learner. Learner hyper parameters are normally provided through the "TrainingConfig" proto extended by each learner. The "Generic hyper parameters" (the following message) is a parallel solution to specify the hyper parameters of a learner using a list of key-values. The "Generic hyper parameters" are designed for the interfacing with hyper-parameter optimization algorithms, while the "TrainingConfig" proto is designed for direct user input. For this reason, the generic hyper parameters are not guaranteed to be as expressive as the "TrainingConfig". However, the default values of the "Generic hyper parameters" are guaranteed to be equivalent to the default value of the training config.
Used in:
, , ,Unique id of the parameters. Might be missing if the parameters are generated by a user, or by a AbstractOptimizer that does not require ids.
Used in:
Hyper parameter name. Should match the "name" of the hyper parameter specification.
Used in:
,Hyper parameter value. Should match the type defined in the hyper parameter specification.
Used in:
Set of hyper-parameter-sets aka. hyper-parameter search space.
Used in:
,Used in:
If set, "weights" has the same number of elements as "possible_values". "weights[i]" is the weight of this specific value for the optimizer. Different optimizers can use this weight differently. Random optimizer: Weight of the field during random sampling. If not specified, all the hyper-parameter combinations have the same probability of sampling. It means that a possible value with conditional children will be more likely to be sampled.
Used in:
Name of the hyper parameter. Should match one of the generic hyper parameter of the model (use "GetGenericHyperParameterSpecification" for the list of generic hyper parameters).
Definition of the candidate values.
If this field has a parent field, then it is only activated if its parent's value is one of these.
List of child fields.
Used in:
Optimization steps ordered chronologically by evaluation_time.
Domain of search for the hyper-parameters.
Registered key for the hyperparameter optimizer.
The selected hyperparameters and its score. Note: It is possible that the best hyperparameters are not part of the "steps".
Used in:
Time, in seconds, relative to the start of the hyper-parameter tuning, of the consuption of the hyperparameters evaluation.
Tested hyperparameters.
Score (the higher, the better) of the hyperparameters. A NaN value indicates that the hyperparameters are unfeasible.
"Capabilities" of a learner. Describe the capabilities/constraints/properties of a learner (all called "capabilities"). Capabilities are non-restrictive i.e. enabling a capability cannot restrict the domain of use of a learner/model (i.e. use "support_tpu" instead of "require_tpu"). Using a learner with non-available capabilities raises an error.
Does the learner support the "maximum_training_duration_seconds" parameter in the TrainingConfig.
The learner can resume training of the model from the "cache_path" given in the deployment configuration.
If true, the algorithm uses a validation dataset for training (e.g. for early stopping) and support for the validation dataset to be passed to the training method (with the "valid_dataset" or "typed_valid_path" argument). If the learning algorithm has the "use_validation_dataset" capability and no validation dataset is given to the training function, the learning algorithm will extract a validation dataset from the training dataset.
If true, the algorithm supports training datasets in the "partial cache dataset" format.
If true, the algorithm supports training with a maximum model size (maximum_model_size_in_memory_in_bytes).
If true, the algorithm supports monotonic constraints over numerical features.
If true, the learner requires a label. If false, the learner does not require a label.
If true, the learner supports custom losses.
Information about the model.
Used in:
,Owner of the model. Default to the user who ran the training code if available.
Unix Timestamp of the model training. Expressed in seconds.
Unique identifier of the model.
Framework used to create the model.
Used in:
Monotonic constraints between model's output and numerical input features.
Used in:
,Regular expressions over the input features.
Used in:
Ensure the model output is monotonic increasing (non-strict) with the feature.
Ensure the model output is monotonic decreasing (non-strict) with the feature.
Used in:
If set, the attribute has a monotonic constraint. Note: monotonic_constraint.feature might not be set.
Returns a list of hyper-parameter sets that outperforms the default hyper-parameters (either generally or in specific scenarios). Like default hyper-parameters, existing pre-defined hyper-parameters cannot change.
Name of the template. Should be unique for a given learning algorithm.
Version of the template.
Free text describing how this template was created.
Effective hyper-parameters.
Generic prediction (prediction over a single example). Those are usually the output of a ML model. Optionally, it may contains the ground truth (e.g. the label value). When the ground truth is present, such a "Prediction" proto can be used for evaluation (see "metric.h").
Used in:
, ,Identifier about the example.
Used in:
Anomaly score between 0 (normal) and 1 (anomaly).
Used in:
Predicted class as indexed in the dataspec.
Predicted distribution over the possible classes. If specified, the following relation holds: "value == argmax_i(distribution[i])".
Used in:
Predicted relevance (the higher, the most likely to be selected).
Group of the predictions. Predictions with a same group are competing.
Group of the predictions. Can be a categorical or a hash value.
Used in:
Used in:
Predicted treatment effect. treatment_effect[i] is the effect of the "i+1"-th treatment (categorical value i+2) compared to the control group (0-th treatment; categorical value = 1). The treatment out-of-vocabulary item (value = 0) is not taken into account.
Applied treatment. The control group is treatment = 1. Other treatments are >1.
Outcome (with or without treatment).
Proto used to serialize / deserialize the model to / from string. See "SerializeModel" and "DeserializeModel". This message does not contains the entire model data.
Modeling task.
Used in:
, , , , ,In case of ranking, the label is expected to be between 0 and 4, and to have the NDCG semantic: 0: Completely unrelated. 4: Perfect match.
Predicts the incremental impact of a treatment on a categorical outcome. See https://en.wikipedia.org/wiki/Uplift_modelling.
Predicts the incremental impact of a treatment on a numerical outcome. See https://en.wikipedia.org/wiki/Uplift_modelling.
Predicts if an instance is similar to the majority of the training data or anomalous (a.k.a. an outlier). An anomaly detection prediction is a value between 0 and 1, where 0 indicates the possible most normal instance and 1 indicates the most possible anomalous instance.
Training configuration. Contains all the configuration for the training of a model e.g. label, input features, hyper-parameters.
Next ID: 13
Used in:
, , , , ,Identifier of the learner e.g. "RANDOM_FOREST". The learner should be registered i.e. injected as a dependency to the binary. The list of available learners is available with "AllRegisteredModels()" in "model_library.h".
List of regular expressions over the dataset columns defining the input features of the model. If empty, all the columns (with the exception of the label and cv_group) will be added as input features.
Label column.
Name of the column used to split the dataset for in-training cross-validation i.e. all the records with the same "cv_group" value are in the same cross-validation fold. If not specified, examples are randomly assigned to train and test. This field is ignored by learner that do not run in-training cross-validation.
Task / problem solved by the model.
Weighting of the training examples. If not specified, the weight is assumed uniform.
Random seed for the training of the model. Learners are expected to be deterministic by the random seed.
Column identifying the groups in a ranking task. For example, in a document/query ranking problem, the "ranking_group" will be the query. The ranking column can be either a HASH or a CATEGORICAL. HASH is recommended. If CATEGORICAL, ensure dictionary is not pruned (i.e. minimum number of observations = 0 and maximum numbers of items = -1 => infinity).
Maximum training duration of the training expressed in seconds. If the learner does not support constrained the training time, the training will fails immediately. Each learning algorithm is free to use this parameter as it see fit. Enabling maximum training duration makes the model training non-deterministic.
Limits the trained model by memory usage. Different algorithms can enforce this limit differently. Serialized or compiled models are generally much smaller. This limit can be fussy: The final model can be slightly larger.
Categorical column identifying the treatment group in an uplift task. For example, whether a patient received a treatment in a study about the impact of a medication. Only binary treatments are currently supported.
Metadata of the model. Non specified fields are automatically set. For example, if "metadata.date" is not set, it will be automatically set to the training date.
Clear the model from any information that is not required for model serving. This includes debugging, model interpretation and other meta-data. The size of the serialized model can be reduced significatively (50% model size reduction is common). This parameter has no impact on the quality, serving speed or RAM usage of model serving.
Set of monotonic constraints between the model's input features and output.
Resolution column string names into column indices. The column indies are defined in a given dataspec e.g. If dataspec.columns[5].name = "toto", then the column idx of "toto" is 5.
Used in:
Next ID: 10 Input features of the models.
Features of type NUMERICAL.
Label column.
Number categories of label (used for classification only).
Index of the column matching "cv_group" in the "TrainingConfig".
Index of the column matching "ranking_group" in the "TrainingConfig".
Index of the column matching "uplift_treatment" in the "TrainingConfig".
Data for specific dataset columns. This field is either empty, or contains exactly one value for each column in the dataset.
Description of the importance of a given attribute. The semantic of "importance" is variable.
Next ID: 3
Used in:
,Next ID: 2
Used in:
,