Get desktop application:
View/edit binary Protocol Buffers messages
Used in:
Configuration of the evaluation of a model. Describes how the evaluation should be done.
Used in:
Task of the model.
Evaluation configuration depending on the type of problem.
Fraction of sampled predictions. If no predictions need to be sampled (i.e. no part of the configuration needs it), this parameter is ignored and no prediction is sampled.
Number of bootstrapping samples used to evaluate metric confidence intervals and statistical test (i.e. all the metric ending with "[B]"). If <=0, bootstrapping estimation is disabled. Note: Bootstrapping is done on the sampled predictions (controlled by "prediction_sampling" parameter). Note: Bootstrapping is an expensive computation. Therefore, for quick experimentation with modeling, bootstrapping can be temporally reduced or disabled.
Weights of the examples. This field does not have to match the "weight_definition" in the model training. For example, the weighting can be enabled for evaluation and disabled for training. Such case is rare however.
Force usage of the slow engine for predictions. This option is ignored by functions that are called with a fast engine
Used in:
(message has no fields)
Used in:
Do compute the ROC metrics (or other metrics using the same type of computation e.g. PR-AUC, P@R).
Maximum number of points in the ROC curve.
List of recall values (between 0 and 1) for the evaluation of precision at given recall.
List of precision values (between 0 and 1) for the evaluation of recall at given precision.
List of volume values (between 0 and 1) for the evaluation of precision at given volume.
List of false positive rates for the evaluation of recall at given false positive rates.
List of false recall for the evaluation of false positive rate at given recall.
Next ID: 8
Used in:
Number of evaluated elements.
Rank cut-off at which Mean Reciprocal Rank is computed.
If false (default) and if all the predictions (items) are in the same group (i.e. there is only one group), raises an error.
Used in:
Do compute the regression plots (histogram of ground truth, residual and predictions, normality test of residual, conditional plots).
Used in:
(message has no fields)
Evaluation results of a model. This proto is generated by the "EvaluateLearner" or "model->Evaluate()" functions. For manual evaluation, this proto is best generated using the "InitializeEvaluation", "AddPrediction" and "FinalizeEvaluation" functions in "metric.h". This proto can be converted into human readable text with "AppendTextReport" or into a html+plot with "SaveEvaluationInDirectory". The html version contains more information that the raw text. Individual metrics can be extracted using the utility methods defined in "metrics.h" e.g. "Accuracy()", "LogLoss()", "RMSE()".
Used in:
, ,Number of predictions (weighted by example weight).
Number of predictions (without weights).
Samples predictions. Only sampled if necessary (e.g. if ROC is computed).
Number of sampled predictions (weighted by example weight).
Task of the model.
Evaluation results depending on the type of problem.
The dataspec of the label column. This field can contain information such as: The possible label values, the distribution of the label values, the string representation of the label value, etc.
Training time of the model. In case of cross-validation evaluation results, "training_duration_in_seconds" is the average training time of a single model.
Value of the loss function used to optimize the model. Not all machine learning algorithms are optimizing a loss function, and different loss functions can be compatible for a given task.
Number of folds used for the evaluation. The number of folds is 1 for train-and-test, and equals to the cross-validation number of folds in case of cross-validation.
User can use this field to store value for any customized metrics.
Next ID: 17
Used in:
(message has no fields)
Used in:
Confusion between the label and the predictions. Note that confusion tables are stored column major (which, admittedly, is confusing).
One-vs-other Receiver operating characteristic curve. Indexed by the categorical label value.
Sum of the log loss.
Accuracy of the model. If both "accuracy" and "confusion" is specified, they represent the same value.
Next ID: 6
Used in:
Fraction of examples were the highest predicted example is also the example with the highest relevance value.
Used in:
Sum for the squared error. For regression only.
Sum of the labels.
Sum of the square of the labels.
Lower and upper bounds of the RMSE computed using non-parametric bootstrapping.
Sum of absolute value of the error.
Next ID: 7
Used in:
Note: In the case of multi-treatments, the "auuc" and "qini" are the example weights average of the per-treatment AUUC and Qini. We use the implementation described in Guelman ("Optimal personalized treatment learning models with insurance applications") or in Betlei ("Treatment targeting by AUUC maximization with generalization guarantees") work.
Number of possible treatments. The treatment values (i.e. the value of the categorical column specifying the treatment) are in [1, num_treatments+1) with value "1" reserved for the control treatment. For example, in case of single-treatment vs control, "num_treatments=2" and the treatment value will be 1 (control) or 2 (treatment).
The Conditional Average Treatment Effect Calibration metrics (cate_calbration) computes the l2 expected calibration error of a binary treatment uplift model. Miscalibration is a phenomenon that magnitute of a treatment effect is overestimated due to overfitting CATE training data. Here we use the expected "l2 norm of difference between 1) predicted CATE, and 2) unbiased estimation of observed CATE" over all uplift values. The metrics value is greater than 0, with lower values being more desirable, i.e. "more calibrated". This metric is defined in equation (2.4) of paper "Calibration Error for Heterogeneous Treatment Effects", by Xu et al. (https://arxiv.org/pdf/2203.13364.pdf)
Reference a metric."MetricAccessor" is used as a parameter of the function "GetMetric" to extract metric values from evaluation results proto. Example: a = EvaluationResults { classification { accuracy:0.7 auc:0.8 ap:0.9 } } b = MetricAccessor { classification {}} GetMetric(a,b) -> 0.7
Used in:
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
Used in:
Used in:
Used in:
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Used in:
(message has no fields)
Used in:
(message has no fields)
Used in:
Estimated measure of a metric.
Used in:
Expected value.
Upper and lower 95% bound estimated using bootstrapping.
A receiver operating characteristic curve.
Used in:
Points sorted with decreasing recall (i.e. increasing threshold).
Sum of the tp+fp+tn+fn of one element (this is the same for all elements). "sum" is equal to "count_predictions" if the ROC is computed without sampling (i.e. roc_prediction_sampling==1).
Area under the curve.
Precision/Recall AUC.
Average Precision.
Metric X evaluated under constraint of a given metric Y value. These three fields have the same number of element as the fields of the same name in "EvaluationOptions::Classification".
Lower and upper bounds of all metrics computed using non-parametric percentile bootstrapping. Only available if bootstrapping is enabled i.e. num_bootstrapping_samples>=1.
Used in:
True/False Positive/Negative.
Value of a metric X (e.g. recall) for a given other metric Y value (e.g. FPR).
Used in: