Get desktop application:
View/edit binary Protocol Buffers messages
Options for aggregating multi-class / multi-label outputs. When used the associated MetricSpec metrics must be binary classification metrics (NOT multi-class classification metrics).
Used in:
Compute aggregate metrics by treating all examples being equal (i.e. flatten the prediction/label pairs across all classes and perform the computation as if they were separate examples in a binary classification problem). Micro is typically used with multi-class outputs.
Computes aggregate metrics by treating all classes being equal (i.e. compute binary classification metrics separately for each of the classes and then take the average). This approach is good for the case where each class is equally important and/or class labels distribution is balanced. Macro is typically used with multi-label outputs. If macro averaging is enabled without using top_k_list, class_weights must be configured in order to identify which classes the average will be computed for.
Compute aggregate metrics using macro averaging but weight the classes during aggregation by the ratio of positive labels for each class. If weighted macro averaging is enabled without using top_k_list, class_weights must be configured in order to identify which classes the average will be computed for.
Weights to apply to classes during aggregation (only supported if top_k_list is not used). Each key corresponds to a class ID. For micro aggregation the weights will be applied to each prediction/label pair. For macro aggregation the weights will be applied to the overall metric computed for each class prior to aggregation. If class_weights are configured, but some keys are not provided then their weights will be assumed to be 0.0. This allows the class_weights to be used to filter the classes used for aggregation. Note that for macro_average and weighted_macro_average when the top_k_list is not used, the class_weights are required. Also note that when used with weighted_macro_average, weights will be applied in two forms (from the ratio of positive labels and from the values provided here) which may or may not be desired (i.e. setting all the weights to 1.0 is the most common configuration for weighted_macro_average).
Performs aggregation based on the classes with the top k predicted values for each value of top k provided. If not set then all classes are used. Note that unlike the top k used with binarization this truncates the list of classes to only the top k values (i.e. it does not set non-top k to -inf).
LINT.IfChange Aggregation types used with AggregationOptions.
Used in:
See AggregationOptions.micro_average.
See AggregationOptions.macro_average.
See AggregationOptions.weighted_macro_average.
For metrics which return an array of values.
Used in:
Exactly one of these fields, corresponding to the data type, should be set.
Used in:
The slice key for the metrics.
The cross slice key for the metrics.
Attribution keys and values.
Used in:
Attribution values keyed by feature key (e.g. 'age', etc).
LINT.IfChange Attribution keys uniquely identify aggregated attribution values.
Used in:
Attribution metric name (e.g. 'mean', 'total', etc)
Optional model name (if multi-model evaluation).
Optional output name (for multi-output models).
Optional sub key associated with attribution (class_id, etc).
If true, the metric is weighted by examples. If false, then the metric is not weighted by examples. If unset then it is unknown as to whether the metric was weighted by examples or not (i.e. the metrics were defined inside of a model). See MetricsSpecs.example_weighted for more information.
If true, this is a diff of attributions based on comparison with baseline.
Options for binarizing multi-class / multi-label outputs. When used the associated MetricSpec metrics must be binary classification metrics (NOT multi-class classification metrics).
Used in:
Creates binary classification metrics based on one-vs-rest for each value of class_id provided.
Creates binary classification metrics based on the kth predicted value for each value of k provided.
Creates binary classification metrics based on the top k predicted values for each value of top k provided. How this is computed is up to each metric implementation. However, the default implementation is such that for a given top k setting, the input prediction arrays will be updated to set the non-top k predictions to -inf before flattening the resulting array into a single binarized value. This makes top k well suited to calculations such as precision@k or recall@k, but may not be well suited for other binary classification metrics unless special handing is provided. Note that precision@k and recall@k can also be configured directly as multi-class classification metrics by setting top_k on the metric itself.
Represents a real value which could be a pointwise estimate, optionally with approximate bounds of some sort. For instance, for AUC, these bounds could be the upper and lower Riemann sum of the integral.
Used in:
, ,The lower bound of the range.
The upper bound of the range.
Represents an exact value if the lower_bound and upper_bound are unset, else it's an approximate value. For the approximate value, it should be within the range [lower_bound, uppper_bound].
Optionally describe the methodology that was used to calculate the bounds.
Used in:
Used to calculate AUC, the upper and lower Riemann sum for an integral.
Used to calculate confidence intervals using Poisson bootstrapping. For more details, please see: http://www.unofficialgoogledatascience.com/2015/08/an-introduction-to-poisson-bootstrap26.html
Used in:
Used in:
Used in:
Each MetricValue field within this message will be populated with the same value type as in MetricKeyAndValue.value. This has the effect of creating a set of parallel data structures which provide elementwise confidence intervals. For example, if the MetricKeyAndValue.value contains an ArrayValue, then each of these fields will also contain an ArrayValue in which the array element at a given index will represent the lower bound, upper bound, and standard error for the MetricKeyAndValue.value element at that same index.
Used in:
The confidence interval method to use for all metrics.
Used in:
Confusion matrix at thresholds. Next tag: 24
Used in:
,Matrices has different types of value representations: bounded, t-distribution and double. 1. Bounded values will be provided if the metrices are calculated using bootstrapping (Note: Confidence level is set to 95%). 2. T distribution values will be provided if metrices are calculated using bootstrapping and confidence level isn't set. Hence user will config the confidece levels through the frontend to get the final confidence intervals. We will support both TDistributionValue and BoundedValue now. But BoundedValue will be eventually deprecated. 3. Double values is being deprecated.
Used in:
CrossSliceKey contains two slices which are compared with each other.
Used in:
, , ,Cross slice metric threshold.
Used in:
,A list of cross slicing specs to apply threshold to.
Used in:
Cross slicing specification.
Used in:
, , ,Tensorflow model analaysis config settings.
Used in:
,Model specifications for models used. Only one baseline is permitted.
A list specs where each spec represents a way to slice the data. An empty config means slice on overall data. Example usages: - slicing_specs: {} Slice consisting of overall data. - slicing_specs: { feature_keys: ["country"] } Slices for all values in feature "country". For example, we might get slices "country:us", "country:jp", etc. - slicing_specs: { feature_values: [{key: "country", value: "us"}] } Slice consisting of "country:us". - slicing_specs: { feature_keys: ["country", "city"] } Slices for all values in feature "country" crossed with all values in feature "city" (note this may be expensive). - slicing_specs: { feature_keys: ["country"] feature_values: [{key: "age", value: "20"}] } Slices for all values in feature "country" crossed with value "age:20".
A list of cross slicing specs where each spec represents a pair of slices whose associated outputs should be compared. By default slices will be created for both slicing_spec and baseline_spec if they do not already exist in slicing_specs.
Metrics specifications.
Additional configuration options.
Config and version.
Evaluation run containing config, version and input parameters. This should be structurally compatible with EvalConfigAndVersion such that a saved EvalRun can be read as an EvalConfigAndVersion.
Location of data used with evaluation run.
File format used with evaluation run.
Locations of model used with evaluation run.
Options for use of example weights in metric computations. These settings are only useful if an example weight key is being used.
Used in:
Set to true to enable weighted metrics. Setting weighted to false has no effect. If weighted is true but an example weight key was not provided, then a weight of 1.0 will be assumed (which is effectively the same as unweighted, but the metric keys will have weighted set to true).
Set to true to enable unweighted metrics. Setting unweighted to false has no effect.
Generic change threshold message.
Used in:
Let delta by determined as in the comments for Direction below. If delta > absolute, fail the validation.
Let delta by determined as in the comments for Direction below. If delta / X_old > relative, fail the validation.
Generic value threshold message. Fail the validation if the value does not lie in [lower_bound, upper_bound], both boundaries inclusive.
Used in:
Lower bound. Assumed to be -Infinity if not set.
Upper bound. Assumed to be +Infinity if not set.
Metric configuration.
Used in:
Name of a class derived for either tf.keras.metrics.Metric or tfma.metrics.Metric.
Optional name of module associated with class_name. If not set then class will be searched for under tfma.metrics followed by tf.keras.metrics.
Optional JSON encoded config settings associated with the class. The config settings are used to initialize the metric based on its associated from_config method. Typically the values that are used will be the same as the **kwarg values passed to the __init__ method for the class. For ease of use the leading and trailing '{' and '}' brackets may be omitted. Example: '"name": "my_metric", "thresholds": [0.5]'
Optional threshold for model validation on all slices.
Optional thresholds for model validation using specific slices.
Optional thresholds for model validation across slices.
Used in:
Sync with PerformanceStatistics because of b/110954446. LINT.IfChange A metric key uniquely identifies a metric.
Used in:
,Name of the metric ('auc', etc).
Optional model name associated with metric (if multi-model evaluation).
Optional output name associated with metric (for multi-output models).
Optional sub key associated with metric.
Optional type of aggregation (if AggregationOptions used).
If true, the metric is weighted by examples. If false, then the metric is not weighted by examples. If unset then it is unknown as to whether the metric was weighted by examples or not (i.e. the metrics were defined inside of a model). See MetricsSpecs.example_weighted for more information.
If true, this metric is a diff metric based on a comparison with the baseline.
Used in:
, , , ,It stores metrics values in different types, so that the frontend will know how to visualize the values based on the types.
Used in:
, , , ,bounded_value is deprecated for use as a confidence interval container. Only use to encode non-CI bounds, such as approximation bounds.
This field will contain a generic message to be used to communicate any extra information, such as in a scenario when no data is aggregated for a small data slice due to privacy concerns.
The slice key for the metrics.
The cross slice key for the metrics.
Metric keys and values.
A map to store metrics. Currently we convert the post_export_metric provided by TFMA to its appropriate type for better visualization, and map all other metrics to DoubleValue type.
Used in:
When the `confidence_interval` field is populated, the `value` field will contain the point estimate.
Metrics specification.
Used in:
List of metric configurations.
Names of models (as defined by model_specs) the metrics should be calculated for. If this list is empty then all the names defined in the model_specs will be assumed else these metrics will only be computed for the model names provided.
Names of outputs the metrics should be calculated for (required for multi-output models). See comment under the ModelSpec.prediction_key on the difference between output_name and prediction_key.
Optional weights to use when aggregating across outputs. Output aggregation will only be performed when weights are configured and only between outputs that have a weight set. For example, assume metrics contains 'auc' and the following output information was configured: output_names = ['output_1', 'output_2', 'output_3'] output_weights = {'output_1': 1.0, 'output_2': 1.0} An 'auc' metric will be computed for each output along with an overall auc metric calculated as (1.0*(auc output_1) + 1.0*(auc output_2)) / (1.0+1.0).
Optional binarization options for converting multi-class / multi-label model outputs into outputs suitable for binary classification metrics.
Optional aggregation options for computing overall aggregate metrics for multi-class / multi-label model outputs. Aggregation options are computed separately from binarization options so both can be set safely at the same time.
Optional example weight options. If no options are provided then the metrics will be weighted by default provided at least one of the models and outputs associated with this spec has an example_weight_key configured, otherwise the metrics will be unweighted by default. If weighted is enabled for the metrics, but an example_weight_key is not associated with a given model or output, then those metrics will still be considered weighted just using a weight value of 1.0.
Optional query key for query/ranking based metrics.
Thresholds defined here are intended to be used for metrics that were saved with the model and computed by default without requiring a metric config. All other thresholds should be defined in the MetricConfig associated with the metric. Optional thresholds for model validation on all slices (keyed by the associated metric name - e.g. 'auc', etc).
Optional thresholds for model validation using specific slices (keyed by the associated metric name - e.g. 'auc', etc).
Optional thresholds for model validation across slices (keyed by the associated metric name - e.g. 'auc', etc).
Used in:
SliceKey of a given slice.
CrossSliceKey for cross slice validations.
All failures under a slice.
Model specification.
Used in:
Name used to distinguish different models when multiple instances are being evaluated. Note that this name is not necessarily the name of the model as seen by a trainer, etc. This name is more of an alias for both a model name and a particular version and/or format. For example, common names to use here might be "candidate" or "baseline" when referring to different versions of the same model that are being evaluated for the purpose of model validation. Note also that if only a single ModelSpec is used in the config, then no model_name will be set in any metrics keys that are output regardless of whether a name was provided here or not.
The type of the model that is being evaluated. Supported types include "tf_keras", "tf_estimator", "tf_lite", "tf_js", and "tf_generic". If unset, automatically detects whether the model_type is "tf_keras", "tf_estimator", or "tf_generic" based on whether the model loads as a keras model followed by whether or not the signature_name is set to "eval".
Optional name of signature to use for inference (e.g. "serving_default"). For estimator based EvalSavedModels, this must be set to "eval". If not set, then the default depends on the model_type. For "tf_keras" models the model itself will be used for inference. For models that support signatures ("tf_generic", etc) "predict" (if it exists) or "serving_default" will be assumed. For models that don't use signatures ("tf_lite", etc) this setting will be ignored.
Optional names of preprocessing functions to run in the order that they should be invoked. Preprocessing functions are used to transform the features into the form required for inference and metrics evaluation. The output from preprocessing can also be used for slicing. Preprocessing functions can be saved as signatures or as attributes on the saved model. If no names are provided, the names "transformed_features" and "transformed_labels" will be searched for. The output of a preprocessing function will override the feature with the same name for label or example weight extraction purposes. If a preprocessing function outputs a non-dict value, then it will be stored as a feature under the preprocessing function name itself. For example, if a function called "transformed_labels" outputs a single array value then it will associated with the feature name "transformed_labels". This name can be used when setting the "label_key" or in slicing configs.
Label key (single-output model). The key can identify either a transformed feature (see preprocessing_function_names) or a raw input feature. Use one of label_key or label_keys.
Label keys (multi-output model) keyed by output_name. If all the outputs for a multi-output model use the same key, then a single key may also be used. Use one of label_key or label_keys.
oneof not allowed with maps
Optional prediction key (single_output model). The prediction key is used to distinguish between different values when the output from the predict call is a dict instead of a single tensor. For estimator models, this is always the case and the prediction key is automatically inferred -- the keys 'scores', 'logistic', 'predictions', and 'probabilities' are tried (in that order). For Keras models, outputs are typically not dicts but if they are, then prediction key is not inferred and so MUST be specified. Note: For multi-class predictions, a prediction key needs to be specified so that the metrics can be computed correctly per class. TODO(b/399156775): Remove once the bug is fixed. The prediction key is also used in cases where the predictions are pre-calculated and stored along side the features (a model is not used). In this case the prediction key refering to a key in the features dictionary must be provided. Use one of prediction_key or prediction_keys. Note that prediction_key is NOT the same as the output_name used in the MetricsSpec. The output_name refers to the name of an output for a multi-output model (for tf.Estimator's this is called the "head" whereas for keras the term output is used). Some outputs (typically tf.Estimator) are themselves made up of a dict of multiple tensors (e.g. 'classes', 'probabilities', etc). The predition_key specifies which key in the output contains the prediction values (i.e. 'probabilities', etc). For example, a tf.Estimator model might output the following: { 'head1': { 'classes': classes_tensor, 'class_ids': class_ids_tensor, 'logits': logits_tensor', 'probabilities': probabilities_tensor } 'head2': { 'classes': classes_tensor, 'class_ids': class_ids_tensor, 'logits': logits_tensor', 'probabilities': probabilities_tensor } } Here 'head1' or 'head2' would be the output_name, whereas 'probabilities' would be the prediction_key.
Optional prediction keys (multi-output model) keyed by output_name. Use one of prediction_key or prediction_keys. See comment under prediction_key on the difference between output_name and prediction_key.
oneof not allowed with maps
Optional example weight key (single-output model). The example_weight_key can identify either a transformed feature (see preprocessing_function_names) or raw input feature. Use one of example_weight_key or example_weight_keys.
Optional example weight keys (multi-output model) keyed by output_name. If all the outputs for a multi-output model use the same key, then a single key may also be used. Use one of example_weight_key or example_weight_keys.
oneof not allowed with maps
True if baseline model (otherwise candidate). Only one baseline is allowed per evaluation run.
Options for padding prediction and label arrays before feeding to metrics. Predictions and labels may not have the same length (for example, the model may pad the predictions so that a batch of predictions is aligned while labels are extracted from the input and is not padded) while metrics may require they be of the same length. TFMA can pad the shorter one with the configured values.
Batch size used by the inference implementation. This batch size is only used for inference with this model. It does not affect the batch size of other models, and it does not affect the batch size used in the rest of the pipeline. This is implemented for the ServoBeamPredictionsExtractor and TfxBslPredictionsExtractor.
Used in:
,Entries are sorted in order of threshold.
Used in:
Only entries with non-zero num_weighted_examples are included. If the top prediction was less than the threshold, then the predict_class_id will be set to -1. Entries are sorted in order of actual_class_id followed by predicted_class_id.
Used in:
Used in:
Entries are sorted in order of threshold.
Used in:
Only entries with no non-zero values are included. Entries are sorted in order of actual_class_id followed by predicted_class_id.
Used in:
Additional configuration options.
Used in:
True to include metrics saved with the model(s) (where possible) when calculating metrics. Any metrics defined in metrics_specs will override the metrics defined in the model if there are overlapping names.
True to calculate confidence intervals.
Int value to omit slices with example count < min_slice_size.
List of outputs that should not be written (e.g. 'metrics', 'plots', 'analysis', 'eval_config.json').
Options for padding prediction and label arrays before feeding to metrics. Predictions and labels may not have the same length (for example, the model may pad the predictions so that a batch of predictions is aligned, while labels are extracted from the input and is not padded) while metrics may require they be of the same length. TFMA can pad the shorter one with the configured values.
Used in:
If neither of the oneof is set, 0 will be used.
If neither of the oneof is set, 0 will be used.
Used in:
,A list of slicing specs to apply threshold to. An empty SlicingSpec represents the overall slice. NOTE: These are only references to slice definitions not new definitions. Slices must have been defined using EvalConfig.slicing_specs. See EvalConfig.slicing_specs for examples.
Used in:
Used in:
,For calibration plot and prediction distribution.
For auc curve and auprc curve.
For multi-class confusion matrix.
For multi-label confusion matrix.
This field will contain a generic message to be used to communicate any extra information, such as in a scenario when no data is aggregated for a small data slice due to privacy concerns.
Sync with PerformanceStatistics because of b/110954446. LINT.IfChange A plot key uniquely identifies a set of PlotData.
Used in:
Optional plot name associated with plot.
Optional model name associated with plot (if multi-model evaluation).
Optional output name associated with plot (for multi-output models).
Optional sub key associated with plot.
If true, the plot is weighted by examples. If false, then the plot is not weighted by examples. If unset then it is unknown as to whether the plot was weighted by examples or not. See MetricsSpecs.example_weighted for more information.
The slice key for the metrics.
The cross slice key for the metrics.
Plot keys and values.
The plot data--deprecated please use 'plots' instead.
Use this field instead of tfma_plots to support multiple plot evaluations in a single evaluator run. Note that each entry of TFMAPlotData should contain all plots for the same grouping. eg: for the same head of a multihead model or for the same class in the case of multiclass. For example, the key can be of the form 'post_export_metrics/head_name' for a multihead model.
Used in:
Repeated int32 value. Used to allow a default if no values are given.
Used in:
,Repeated string value. Used to allow a default if no values are given.
Used in:
A single slice key.
Used in:
A slice key, which may consist of multiple single slice keys.
Used in:
, , , ,Information about slices matched.
Used in:
LINT.IfChange Slicing specification.
Used in:
, , , ,Feature keys to slice on. Note that the feature key can be either a transformed feature key (see ModelSpec.preprocessing_function_names) or a raw feature key parsed directly from the inputs. If a transformed feature key and raw feature key use the same name, the transformed feature will take precedence. Note also that while transformed features are associated with the models that processed them, when it comes to slicing all the unique values across all models will be used.
Feature values to slice on keyed by associated feature keys. The same caveats that apply to feature_keys with respect to feature transformations and raw features apply to feature_values as well (see feature_keys for more information). Note that strings representing ints and floats will be automatically converted to ints and floats respectively and will be compared against both the string versions and int or float versions of the associated features.
This config is an alternative to the config above. It must have the pattern: "SELECT STRUCT({feature_name} [AS {slice_key}]) [FROM example.feature_name [, example.feature_name, ... ] [WHERE ... ]]" The “example.feature_name” inside the FROM statement is used to flatten the repeated fields. For non-repeated fields, you can directly write the config as follows: “SELECT STRUCT(non_repeated_feature_a, non_repeated_feature_b)”. When executing, this SQL expression will be further wrapped as: “SELECT ARRAY({slice_keys_sql}) as slices FROM Examples as example”. The resulting output of the query will have the same number of rows as the input dataset. Each row will only have one column named "slices". Each row is a list. Each element in the list will be a list of tuple with ('key', 'value') pairs representing a slice. For example, a single row could be: [[(‘gender’, ‘male’), (‘country’: ‘USA’)], [(‘zip_code’, ‘123456’)]] In the user’s SQL statement, the “example” is a key word that binds to each input "row". The semantics of this variable will depend on the decoding of the input data to the Arrow representation (e.g., for tf.Example, each key is decoded to a separate column). Thus, structured data can be readily accessed by iterating/unnesting the fields of the "example" variable. Example 1: slice_keys_sql="SELECT STRUCT(gender) FROM example.gender" - This equals to config: feature_keys=[gender] - the slice key and value will be: (gender, {gender_value}) Example 2: slice_keys_sql = "SELECT STRUCT(gender, country) FROM example.gender, example.country WHERE country = 'USA'" - This equals to config: feature_keys=[gender], feature_values={country:'USA'} - the slice key and value will be: (gender_x_country, {gender_value}_x_USA) Example 3 (background positive subgroup negative): slice_keys_sql= "SELECT STRUCT('male' as bpsn) FROM example WHERE ('male' not in UNNEST(example.gender) and 1 in UNNEST(example.label)) or ('male' in UNNEST(example.gender) and 0 in UNNEST(example.label))" - the slice key and value will be: (bpsn, male)
Sync with PerformanceStatistics because of b/110954446. LINT.IfChange A sub key identifies specialized sub-types of metrics and plots.
Used in:
, ,Used with multi-class metrics to identify a specific class ID.
Used with multi-class metrics to identify the kth predicted value.
Used with multi-class and ranking metrics to identify top-k predicted values.
Represents a t-distribution, which includes sample mean, sample standard deviation and degrees of freedom of samples. It's calculated when evaluation runs on multiple samples, which by default generated by the Poisson bootstrapping method: http://www.unofficialgoogledatascience.com/2015/08/an-introduction-to-poisson-bootstrap26.html
Used in:
, ,Sample Mean.
Sample Standard Deviation.
Number of degrees of freedom.
Represents the value of data if calculated without bootstrapping. This field is deprecated as going forward we will remove the TDistributionValue from the oneof in the MetricValue and the unsampled value will be populated in MetricValue.double_value
The value will be converted into an error message if we do not know its type.
Used in:
Extra details about validation.
Used in:
Information about failure per metric.
Used in:
True if there are no metric validation failures or missing slices, else false.
True if failure due to missing thresholds.
Information about which threshold is blocking which metric.
Information about missing slices.
Information about missing cross slices.
Extra details about validation performed.
True if this run is rubbertamped. A rubberstamped validation is one in which there was no baseline model, so diff thresholds were ignored, but in which the non-diff thresholds are still checked.
Value at cutoffs, e.g. for precision@K, recall@K
Used in:
Used in: