Get desktop application:
View/edit binary Protocol Buffers messages
Area under curve for the ROC-curve. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC
Used in:
(message has no fields)
Area under curve for the precision-recall-curve. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC
Used in:
(message has no fields)
Used in:
If unset, placeholders will be dropped.
Additional information about the schema or about a feature.
Used in:
,Tags can be used to mark features. For example, tag on user_age feature can be `user_feature`, tag on user_country feature can be `location_feature`, `user_feature`.
Free-text comments. This can be used as a description of the feature, developer notes etc.
Application-specific metadata may be attached here.
Message to represent the anomalies, which describe the mismatches (if any) between the stats and the schema.
The baseline schema that is used.
The format of the keys in anomaly_info. If absent, default is DEFAULT.
Information about feature-level anomalies.
Information about dataset-level anomalies.
True if numExamples == 0.
If drift / skew detection was conducted, this field will hold the comparison results for all the features compared, regardless whether a related anomaly was reported.
TODO(b/123519907): Remove this. The hook to attach any usage and tool specific metadata. Example: message SchemaStamp { // extension ID is any CL number that has not been used in an extension. extend proto2.bridge.MessageSet { optional StampedSchemaDiff message_set_extension = 123445554; } optional string schema_stamp = 1; } then, the following proto msg encodes an Anomalies with an embedded SchemaStamp: Anomalies { metadata { [SchemaStamp]: { schema_stamp: "stamp" } } } GOOGLE-LEGACY optional proto2.bridge.MessageSet metadata = 5;
Map from a column to the difference that it represents.
Used in:
At present, this indicates that the keys in anomaly_info refers to the raw field name in the Schema.
The serialized path to a struct.
Message to represent information about an individual anomaly.
Used in:
A path indicating where the anomaly occurred. Dataset-level anomalies do not have a path.
A description of the entire anomaly.
A shorter description, suitable for UI presentation. If there is a single reason for the anomaly, identical to reason[0].short_description. Otherwise, summarizes all the reasons.
The comparison between the existing schema and the fixed schema.
LINT.ThenChange(//tensorflow_data_validation/g3doc/anomalies.md) Reason for the anomaly. There may be more than one reason, e.g. the field might be missing sometimes AND a new value is present.
Used in:
A short description of an anomaly, suitable for UI presentation.
A longer description of an anomaly.
Used in:
Next ID: 89 LINT.IfChange
Used in:
Multiple reasons for anomaly.
Integer larger than 1
BYTES type when expected INT type
BYTES type when expected STRING type
FLOAT type when expected INT type
FLOAT type when expected STRING type
INT type when expected STRING type
Integer smaller than 0
STRING type when expected INT type
Expected a string, but not the string seen
Boolean had float values other than 0 and 1.
BoolDomain has invalid configuration.
BYTES type when expected STRING type
FLOAT type when expected STRING type
INT type when expected STRING type
Invalid UTF8 string observed
Unexpected string values
The number of values in a given example is too large
The fraction of examples containing a feature is too small
The number of examples containing a feature is too small
The number of values in a given example is too small
No examples contain the value
The feature is present as an empty list
The feature is repeated in an example, but was expected to be a singleton
The feature had too many unique values (string and categorical features only).
The feature had too few unique values (string and categorical features only).
The feature has a constraint on the number of unique values but is not of a type that has the number of unique values counted (i.e., is not string or categorical).
There is a float value that is too high
The type is not FLOAT
There is a float value that is too low
The feature is supposed to be floats encoded as strings, but there is a string that is not a float
The feature is supposed to be floats encoded as strings, but it was some other type (INT, BYTES, FLOAT)
The type is completely unknown
Float feature includes NaN values.
Float feature includes Inf or -Inf values.
There is an unexpectedly large integer
The type was supposed to be INT, but it was not.
The feature is supposed to be ints encoded as strings, but some string was not an int.
The type was supposed to be STRING, but it was not.
There is an unexpectedly small integer
The feature is supposed to be ints encoded as strings, but it was some other type (INT, BYTES, FLOAT)
Unknown type in stats proto
The fraction of examples containing TensorFlow supported images is lower than the threshold set in the Schema.
There are no stats for a column at all
There is a new column that is not in the schema.
Training serving skew issue
Expected STRING type, but it was FLOAT.
Expected STRING type, but it was INT.
Control data is missing (either scoring data or previous day).
Treatment data is missing (either treatment data or current day).
L infinity between treatment and control is high.
Approximate Jensen-Shannon divergence between treatment and control is high.
The normalized absolute difference between treatment and control is high.
No examples in the span.
The value feature of a sparse feature is missing and at least one feature defining the sparse feature is present.
An index feature of a sparse feature is missing and at least one feature defining the sparse feature is present.
The length of the features representing a sparse feature does not match.
Name collision between a sparse feature and raw feature.
Invalid custom semantic domain.
There are not enough examples in the current data as compared to a control dataset.
There are too many examples in the current data as compared to a control dataset.
There are not enough examples in the dataset.
There are too many examples in the dataset.
Name collision between a weighted feature and a raw feature.
The value feature of a weighted feature is missing on examples where the weight feature is present.
The weight feature of a weighted feature is missing on examples where the value feature is present.
The length of the features representing a weighted feature does not match.
The nesting level of the feature values does not match.
The domain specified is not compatible with the physical type.
Feature on schema has no name.
Feature on schema has no type.
Triggered for invalid schema specifications, e.g. min_fraction < 0.
Triggered for invalid domain specifications in schema.
The type of the data is inconsistent with the specified type.
A value did not show up the min number of times within a sequence.
A value showed up more the max number of times within a sequence.
A value did not show up in at least the min fraction of sequences.
A value showed up in greater than the max fraction of sequences.
Too small a fraction of feature values matched vocab entries.
The average token length was too short.
A sequence violated the location constraint.
A feature was specified as an embedding but was not a fixed dimension.
A feature contains an image that has more bytes than the max byte size.
A feature is supposed to be of a fixed shape but its valency stats do not agree.
Constraints are specified within the but cannot be verified because the corresponding stats are not available.
A derived feature had a schema lifecycle other than VALIDATION_DERIVED or DISABLED.
The following are experimental and subject to change.
A derived feature is represented in the schema with an invalid or missing validation_derived_source.
The following type is experimental and subject to change. The statistics did not specify a custom validation condition.
Used in:
Audio data.
Used in:
(message has no fields)
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/binary_accuracy
Used in:
(message has no fields)
Configuration for a binary classification task. The output is one of two possible class labels, encoded as the same type as the label column. BinaryClassification is the same as MultiClassClassification with n_classes = 2.
Used in:
The label column.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
(optional) The weight column.
(optional) specification of the positive and/or negative class value.
Defines which label value is the positive and/or negative class.
Used in:
This value is the positive class.
This value is the negative class.
Specifies a label's value which can be used for positive/negative class specification.
Used in:
Binary cross entropy as a metric is equal to the negative log likelihood (see logistic regression). In addition, when used to solve a binary classification task, binary cross entropy implies that the binary label will maximize binary accuracy. binary_crossentropy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/binary_crossentropy
Used in:
(message has no fields)
DEPRECATED
Used in:
Encodes information about the domain of a boolean attribute that encodes its TRUE/FALSE values as strings, or 0=false, 1=true. Note that FeatureType could be either INT or BYTES.
Used in:
Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
Strings values for TRUE/FALSE.
Statistics for a bytes feature in a dataset.
Used in:
The number of unique values
The average number of bytes in a value
The minimum number of bytes in a value
The maximum number of bytes in a value
The maximum number of bytes in a value, as an int. Float will start having a loss of precision for a large enough integer. This field preserves the precision.
categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/categorical_accuracy
Used in:
(message has no fields)
categorical_crossentropy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/categorical_crossentropy
Used in:
(message has no fields)
Used in:
Describes a chunk that represents changes in both artifacts over the same number of lines.
Used in:
Changed region in the left artifact, in terms of starting line number and contents.
Ditto for the right artifact.
Common statistics for all feature types. Statistics counting number of values (i.e., min_num_values, max_num_values, avg_num_values, and tot_num_values) include NaNs. For nested features with N nested levels (N > 1), the statistics counting number of values will rely on the innermost level.
Used in:
, , ,The number of examples that include this feature. Note that this includes examples that contain this feature with an explicitly empty list of values, which may be permitted for variable length features.
The number of examples missing this feature.
The minimum number of values in a single example for this feature.
The maximum number of values in a single example for this feature.
The average number of values in a single example for this feature. avg_num_values = tot_num_values / num_non_missing.
The total number of values in this feature.
The quantiles histogram for the number of values in this feature.
The histogram for the number of features in the feature list (only set if this feature is a non-context feature from a tf.SequenceExample). This is different from num_values_histogram, as num_values_histogram tracks the count of all values for a feature in an example, whereas this tracks the length of the feature list for this feature in an example (where each feature list can contain multiple values).
Contains presence and valency stats for each nest level of the feature. The first item corresponds to the outermost level, and by definition, the stats it contains equals to the corresponding stats defined above. May not be populated if the feature is of nest level 1.
If not empty, it's parallel to presence_and_valency_stats.
ContentChunk data.
Used in:
(message has no fields)
cosine(...) cosine_proximity(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/cosine_proximity DEPRECATED
Used in:
(message has no fields)
NextID: 8
Used in:
The path of feature x.
The path of feature y.
Number of occurrences of this feature cross in the data. If any of the features in the cross is missing, the example is ignored.
A custom metric. Prefer using or adding an explicit metric message and only use this generic message as a last resort. NEXT_TAG: 4
Used in:
The display name of a metric computed by the model. The name should match ^[a-zA-Z0-9\s]{1,25}$ and must be unique across all performance metrics. Trailing and leading spaces will be truncated before matching.
True if the metric is maximized: false if it is minimized. Must be specified if the CustomMetric is used as an objective.
Specification of the metric in the binary’s metric registry.
RegistrySpec is a full specification of the custom metric and its construction based on the binary’s metric registry. New custom metrics must be linked to the binary and registered in its metric registry to be identifiable via this specification.
Used in:
Identifier of the metric class in the metric registry of the binary.
Generic proto describing the configuration for the metric to be computed. It's upto the implementer of the metric to parse this configuration.
Stores the name and value of any custom statistic. The value can be a string, double, or histogram.
Used in:
Constraints on the entire dataset.
Used in:
Tests differences in number of examples between the current data and the previous span.
Tests comparisions in number of examples between the current data and the previous version of that data.
Minimum number of examples in the dataset.
Maximum number of examples in the dataset.
The feature statistics for a single dataset.
Used in:
The name of the dataset.
The number of examples in the dataset.
Only valid if the weight feature was specified. Treats a missing weighted feature as zero.
The feature statistics for the dataset.
Cross feature statistics for the dataset.
A list of features statistics for different datasets. If you wish to compare different datasets using this list, then the DatasetFeatureStatistics entries should all contain the same list of features. LINT.IfChange
Stores configuration for a variety of canned feature derivers. TODO(b/227478330): Consider validating config in merge_util.cc.
Used in:
DerivedFeatureSource tracks information about the source of a derived feature. Derived features are computed from ordinary features for the purposes of statistics collection and validation, but do not exist in the dataset. Experimental and subject to change. LINT.IfChange
Used in:
,The name of the deriver that generated this feature.
An optional description of the transformation.
The constituent features that went into generating this derived feature.
A DerivedFeatureSource that is declaratively configured represents an intent for downstream processing to generate a derived feature (in the schema), or tracks that a feature was generated from such a configuration (in statistics).
Optional configuration for canned derivers.
Describes a region in the comparison between two text artifacts. Note that a region also contains the contents of the two artifacts that correspond to the region.
Used in:
Details for the chunk.
An unchanged region of lines.
A region of lines removed from the left.
A region of lines added to the right.
A region of lines that are different in the two artifacts.
An unchanged region of lines whose contents are just hidden.
Models constraints on the distribution of a feature's values. TODO(martinz): replace min_domain_mass with max_off_domain (but slowly).
Used in:
The minimum fraction (in [0,1]) of values across all examples that should come from the feature's domain, e.g.: 1.0 => All values must come from the domain. .9 => At least 90% of the values must come from the domain.
Message to contain the result of the drift/skew measurements for a feature.
Used in:
Identifies the feature;
The drift/skew may be measured in the same invocation of TFDV, in which case both of the following fields are populated. Also the drift/skew may be quantified by different measurements, thus repeated.
Used in:
Type of the measurement.
Value of the measurement.
Threshold used to determine whether the measurement results in an anomaly.
Used in:
Specifies a dynamic multiclass/multi-label problem where the number of label classes is inferred from the data.
Used in:
,Optional. If specified, an Out-Of-Vocabulary (OOV) class is created and populated based on frequencies in the training set. If no OOV class is specified, the model's label vocabulary should consist of all labels that appear in the training set.
Note: it is up to a solution provider to implement support for OOV labels. Note: both a frequency_threshold and a top_k may be set. A class is grouped into the OOV class if it fails to meet either of the criteria below.
Used in:
If set, labels are grouped into the "OOV" class if they occur less than frequency_threshold times in the training dataset. If 0, labels that appear in test / validation splits but not in training would be still classified as the "OOV" class.
If set, only the top_k labels in the training set are used and all others are grouped into an "OOV" class.
Threshold to apply to a prediction to determine positive vs negative. Note: if the model is calibrated, the threshold can be thought of as a probability so the threshold has a stable, intuitive semantic. However, not all solutions may be calibrated, and not all computations of the metric may operate on a calibrated score. In AutoTFX, the final model metrics are computed on a calibrated score, but the metrics computed within the model selection process are uncalibrated. Be aware of this possible skew in the metrics between model selection and final model evaluation.
Threshold to apply to a prediction to determine positive vs negative. Note: if the model is calibrated, the threshold can be thought of as a probability so the threshold has a stable, intuitive semantic. However, not all solutions may be calibrated, and not all computations of the metric may operate on a calibrated score. In AutoTFX, the final model metrics are computed on a calibrated score, but the metrics computed within the model selection process are uncalibrated. Be aware of this possible skew in the metrics between model selection and final model evaluation.
Describes schema-level information about a specific feature. NextID: 39
Used in:
,The name of the feature.
required
This field is no longer supported. Instead, use: lifecycle_stage: DEPRECATED TODO(b/111450258): remove this.
Constraints on the presence of this feature in the examples.
Only used in the context of a "group" context, e.g., inside a sequence.
The shape of the feature which governs the number of values that appear in each example.
The feature has a fixed shape corresponding to a multi-dimensional tensor.
The feature doesn't have a well defined shape. All we know are limits on the minimum and maximum number of values.
Captures the same information as value_count but for features with nested values. A ValueCount is provided for each nest level.
Physical type of the feature's values. Note that you can have: type: BYTES int_domain: { min: 0 max: 3 } This would be a field that is syntactically BYTES (i.e. strings), but semantically an int, i.e. it would be "0", "1", "2", or "3".
Domain for the values of the feature.
Reference to a domain defined at the schema level. NOTE THAT TFDV ONLY SUPPORTS STRING DOMAINS AT THE TOP LEVEL. TODO(b/63664182): Support this.
Inline definitions of domains.
Supported semantic domains.
Constraints on the distribution of the feature values. Only supported for StringDomains.
Additional information about the feature for documentation purpose.
Tests comparing the distribution to the associated serving data.
Tests comparing the distribution between two consecutive spans (e.g. days).
List of environments this feature is present in. Should be disjoint from not_in_environment. This feature is in environment "foo" if: ("foo" is in in_environment or default_environment) AND "foo" is not in not_in_environment. See Schema::default_environment.
List of environments this feature is not present in. Should be disjoint from of in_environment. See Schema::default_environment and in_environment.
The lifecycle stage of a feature. It can also apply to its descendants. i.e., if a struct is DEPRECATED, its children are implicitly deprecated.
Constraints on the number of unique values for a given feature. This is supported for string and categorical features only.
If set, indicates that that this feature is derived, and stores metadata about its source. If this field is set, this feature should have a disabled stage (PLANNED, ALPHA, DEPRECATED, DISABLED, DEBUG_ONLY), or lifecycle_stage VALIDATION_DERIVED. Experimental and subject to change.
This field specifies if this feature could be treated as a sequence feature which has meaningful element order.
Used in:
Encodes vocabulary coverage constraints.
Used in:
Fraction of feature values that map to a vocab entry (i.e. are not oov).
Average length of tokens. Used for cases such as wordpiece that fallback to character-level tokenization.
String tokens to exclude when calculating min_coverage and min_avg_token_length. Useful for tokens such as [PAD].
Integer tokens to exclude when calculating min_coverage and min_avg_token_length.
String tokens to treat as oov tokens (e.g. [UNK]). These tokens are also excluded when calculating avg token length.
The complete set of statistics for a given feature name for a dataset. NextID: 11
Used in:
One can identify a field either by the name (for simple fields), or by a path (for structured fields). Note that: name: "foo" is equivalent to: path: {step:"foo"} Note: this oneof must be consistently either name or path across all FeatureNameStatistics in one DatasetFeatureStatistics.
The feature name
The path of the feature.
The data type of the feature
The statistics of the values of the feature.
Any custom statistics can be stored in this list.
If set, indicates that that this feature is derived for validation, and stores metadata about its source. Experimental and subject to change.
The types supported by the feature statistics. When aggregating tf.Examples, if the bytelist contains a string, it is recommended to encode it here as STRING instead of BYTES in order to calculate string-specific statistical measures.
Used in:
Describes constraints on the presence of the feature in the data.
Used in:
,Minimum fraction of examples that have this feature.
Minimum number of examples that have this feature.
Records constraints on the presence of a feature inside a "group" context (e.g., .presence inside a group of features that define a sequence).
Used in:
Describes the physical representation of a feature. It may be different than the logical representation, which is represented as a Domain.
Used in:
,Specifies a fixed shape for the feature's values. The immediate implication is that each feature has a fixed number of values. Moreover, these values can be parsed in a multi-dimensional tensor using the specified axis sizes. The FixedShape defines a lexicographical ordering of the data. For instance, if there is a FixedShape { dim {size:3} dim {size:2} } then tensor[0][0]=field[0] then tensor[0][1]=field[1] then tensor[1][0]=field[2] then tensor[1][1]=field[3] then tensor[2][0]=field[4] then tensor[2][1]=field[5] The FixedShape message is identical with the tensorflow.TensorShape proto message for fully defined shapes. The FixedShape message cannot represent unknown dimensions or an unknown rank.
Used in:
, , ,The dimensions that define the shape. The total number of values in each example is the product of sizes of each dimension.
An axis in a multi-dimensional feature representation.
Used in:
Optional name of the tensor dimension.
Encodes information for domains of float values. Note that FeatureType could be either INT or BYTES.
Used in:
,Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
Min and max values of the domain.
If true, feature should not contain NaNs.
If true, feature should not contain Inf or -Inf.
If True, this indicates that the feature is semantically an embedding. This can be useful for distinguishing fixed dimensional numeric features that should be fed to a model unmodified.
If true then the domain encodes categorical values (i.e., ids) rather than continuous values.
This field specifies the embedding dimension and is only applicable if is_embedding is true. It is useful for use cases such as restoring shapes for flattened sequence of embeddings.
Specifies the semantic type of the embedding e.g. sbv4_semantic or pulsar.
A chunk that represents identical lines, whose contents are hidden.
Used in:
Starting lines in the two artifacts.
Size of the region in terms of lines.
Linear Hinge Loss hinge(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/hinge DEPRECATED
Used in:
(message has no fields)
The data used to create a histogram of a numeric feature for a dataset.
Used in:
, , , , , ,The number of NaN values in the dataset.
The number of undefined values in the dataset.
A list of buckets in the histogram, sorted from lowest bucket to highest bucket.
The type of the histogram.
An optional descriptive name of the histogram, to be used for labeling.
Each bucket defines its low and high values along with its count. The low and high values must be a real number or positive or negative infinity. They cannot be NaN or undefined. Counts of those special values can be found in the numNaN and numUndefined fields.
Used in:
The low value of the bucket, exclusive except for the first bucket.
The high value of the bucket, inclusive.
The number of items in the bucket. Stored as a double to be able to handle weighted histograms.
The type of the histogram. A standard histogram has equal-width buckets. The quantiles type is used for when the histogram message is used to store quantile information (by using approximately equal-count buckets with variable widths).
Used in:
Used in:
Type controls the source of the histogram used for numeric drift and skew calculations. Currently the default is STANDARD. Calculations based on QUANTILES are more robust to outliers.
Used in:
Image data.
Used in:
If set, at least this fraction of values should be TensorFlow supported images.
If set, image should have less than this value of undecoded byte size.
Used in:
Checks that the L-infinity norm is below a certain threshold between the two discrete distributions. Since this is applied to a FeatureNameStatistics, it only considers the top k. L_infty(p,q) = max_i |p_i-q_i|
Used in:
The InfinityNorm is in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).
Encodes information for domains of integer values. Note that FeatureType could be either INT or BYTES.
Used in:
,Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
Min and max values for the domain.
If true then the domain encodes categorical values (i.e., ids) rather than ordinal values.
Checks that the approximate Jensen-Shannon Divergence is below a certain threshold between the two distributions.
Used in:
The JensenShannonDivergence will be in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).
kld(...) kullback_leibler_divergence(...) KLD(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/kullback_leibler_divergence DEPRECATED
Used in:
(message has no fields)
LifecycleStage. Only UNKNOWN_STAGE, BETA, PRODUCTION, and VALIDATION_DERIVED features are actually validated. PLANNED, ALPHA, DISABLED, and DEBUG are treated as DEPRECATED.
Used in:
, ,Unknown stage.
Planned feature, may not be created yet.
Prototype feature, not used in experiments yet.
Used in user-facing experiments.
Used in a significant fraction of user traffic.
No longer supported: do not use in new models.
Only exists for debugging purposes.
Generic indication that feature is disabled / excluded from models, regardless of specific reason.
Indicates that this feature was derived from ordinary features for the purposes of statistics generation or validation. Consumers should expect that this feature may be present in DatasetFeatureStatistics, but not in input data. Experimental and subject to change.
Container for lift information for a specific y-value.
Used in:
The particular value of path_y corresponding to this LiftSeries. Each element in lift_values corresponds to the lift a different x_value and this specific y_value.
The number of examples in which y_value appears.
The lifts for a each path_x value and this y_value.
A bucket for referring to binned numeric features.
Used in:
The low value of the bucket, inclusive.
The high value of the bucket, exclusive (unless the high_value is positive infinity).
A container for lift information about a specific value of path_x.
Used in:
P(path_y=y|path_x=x) / P(path_y=y) for x_value and the enclosing y_value. In terms of concrete fields, this number represents: (x_and_y_count / x_count) / (y_count / num_examples)
The number of examples in which x_value appears.
The number of examples in which x_value appears and y_value appears.
Used in:
Lift information for each value of path_y. Lift is defined for each pair of values (x,y) as P(path_y=y|path_x=x)/P(path_y=y).
Weighted lift information for each value of path_y. Weighted lift is defined for each pair of values (x,y) as P(path_y=y|path_x=x)/P(path_y=y) where probabilities are computed over weighted example space.
AKA the negative log likelihood or log loss. Given a label y\in {0,1} and a predicted probability p in [0,1]: -yln(p)-(1-y)ln(1-p) TODO(martinz): if this is interpreted the same as binary_cross_entropy, we may need to revisit the semantics. DEPRECATED
Used in:
(message has no fields)
Knowledge graph ID, see: https://www.wikidata.org/wiki/Property:P646
Used in:
(message has no fields)
https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/losses/MMDLoss
Kernel to apply to the predictions. Currently supported values are 'gaussian' and 'laplace'. Defaults to 'gaussian'.
MAE(...) mae(...) mean_absolute_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_absolute_error
Used in:
(message has no fields)
MAPE(...) mape(...) mean_absolute_percentage_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_absolute_percentage_error
Used in:
(message has no fields)
Used in:
(message has no fields)
MSE(...) mse(...) mean_squared_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_squared_error
Used in:
(message has no fields)
msle(...) MSLE(...) mean_squared_logarithmic_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_squared_logarithmic_error
Used in:
(message has no fields)
The high-level objectives described by this problem statement. These objectives provide a basis for ranking models and can be optimized by a meta optimizer (e.g. a grid search over hyperparameters). A solution provider may also directly use the meta optimization targets to heuristically select losses, etc without any meta-optimization process. If not specified, the high-level meta optimization target is inferred from the task. These objectives do not need to be differentiable, as the solution provider may use proxy function to optimize model weights. Target definitions include tasks, metrics, and any weighted combination of them.
Used in:
The name of a task in this problem statement producing the prediction or classification for the metric.
The performance metric to be evaluated. The prediction or classification is based upon the task. The label is from the type of the task, or from the override_task.
Describes how to combine with other objectives.
If a model spec has multiple meta optimization targets, the weight of each can be specified. The final objective is then a weighted combination of the multiple objectives. If not specified, value is 1.
Secondary meta optimization targets can be thresholded, meaning that the optimization process prefers solutions above (or below) the threshold, but need not prefer solutions higher (or lower) on the metric if the threshold is met.
Configuration for thresholded meta-optimization targets.
Used in:
If specified, indicates a threshold that the user wishes the metric to stay under (for MINIMIZE type), or above (for MAXIMIZE type). The optimization process need not prefer models that are higher (or lower) on the thresholded metric so long as the threshold is respected. E.g., if `threshold` for a MAXIMIZE type metric X is .9, the optimization process will prefer a solution with X = .92 over a solution with X = .88, but may not prefer a solution with X = .95 over a solution with X = .92. Unless otherwise specified by the PerformanceMetric, threshold is best effort. It does not provide a hard guarantee about the properties of the final model, but rather serves as a "target" to guide the optimization process. The user is responsible for validating that final model metrics are in an acceptable range for the application. A problem statement may, however, be rejected if the specified target is impossible to achieve. Keep this in mind if running the optimization on a recurring basis, as shifts in the data could push a previously achievable target to being unachievable (and thus yield no solution). The units and range for the threshold will be the same as the valid output range of the associated performance_metric.
Metric type indicates which direction of a real-valued metric is "better". For most message types, this is invariant. For custom message types, is_maximized == true is like MAXIMIZE, and otherwise MINIMIZE.
Maximize the metric (i.e. a utility).
Minimize the metric (i.e. a loss).
Look for a field is_maximized.
Area under ROC-curve calculated globally for MultiClassClassification (model predicts a single label) or MultiLabelClassification (model predicts class probabilities). The area is calculated by treating the entire set of data as an aggregate result, and computing a single metric rather than k metrics (one for each target label) that get averaged together. For example, the FPR and TPR at a given point on the AUC curve for k targer labels are: FPR = (FP1 + FP2 + ... + FPk) / ((FP1 + FP2 + ... + FPk) + (TN1 + TN2 + ... + TNk)) TPR = (TP1 + TP2 + ... +TPk) / ((TP1 + TP2 + ... + TPk) + (FN1 + FN2 + ... + FNk))
Used in:
(message has no fields)
Configuration for a multi-class classification task. In this problem type, there are n_classes possible label values, and the model predicts a single label. The output is one of the class labels, out of n_classes possible classes. The output type will correspond to the label column type.
Used in:
The label column. There's only a single label per example. If the label column is a BoolDomain, use the BinaryClassification Type instead.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
The weight column.
The exact number of label classes.
The number of label classes that should be inferred dynamically from the data.
A multi-dimensional regression task. Similar to OneDimensionalRegression, MultiDimensionalRegression predicts continuous real numbers. However instead of predicting a single scalar value per example, we predict a fixed dimensional vector of values. By default the range is any float -inf to inf, but specific sub-types (e.g. probability) define more narrow ranges.
The label column.
oneof label_id is required.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
(optional) The weight column.
When set means the label is a probability in range [0..1].
Defines a regression problem where labels are in [0, 1] and represent a probability (e.g: probability of click).
Used in:
By default, MultiDimensionalRegression assumes that each value in the predicted vector is independent. If predictions_sum_to_1 is true, this indicates that the vector of values represent mutually exclusive rather than independent probabilities (for example, the probabilities of classes in a multi-class scenario). When this is set to true, we use softmax instead of sigmoid in the loss function.
Configuration for a multi-label classification task. In this problem type there are n_classes unique possible label values overall. There can be from zero up to n_classes unique labels per example. The output, which is of type real number, is class probabilities associated with each class. It will be of n_classes dimension for each example, if n_classes is specified. Otherwise, the dimension will be set to the number of unique class labels that are dynamically inferred from the data based on dynamic_class_spec.
Used in:
The label column. There can be one or more labels per example.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
The weight column.
The exact number of unique class labels.
The maximal number of label classes that should be inferred dynamically from the data.
Cross entropy for MultiLabelClassification where each target and prediction is the probabily of belonging to that class independent of other classes.
Used in:
(message has no fields)
Natural language text.
Used in:
Name of the vocabulary associated with the NaturalLanguageDomain. When computing and validating stats using TFDV, tfdv.StatsOptions.vocab_paths should map this name to a vocabulary file.
Statistics for a feature containing a NL domain.
Fraction of feature input tokens considered in-vocab.
Average token length of tokens used by the feature.
Histogram containing the distribution of token lengths.
Min / max sequence lengths.
Histogram containing the distribution of sequence lengths.
Number of of sequences which do not match the location constraint.
Reported sequences that are sampled from the input and have small avg_token_length, low feature converage, or do not match the location regex.
Statistics for specified tokens. TokenStatistics are only reported for tokens specified in SequenceValueConstraints in the schema.
The rank histogram for the tokens of the feature. The rank is used to measure of how commonly the token is found in the dataset. The most common token would have a rank of 1, with the second-most common value having a rank of 2, and so on.
Used in:
,Token for which the statistics are reported.
The number of times the value occurs. Stored as a double to be able to handle weighted features.
Fraction of sequences containing the token.
Min number of token occurrences within a sequence.
Average number of token occurrences within a sequence.
Maximum number of token occurrences within a sequence.
Token positions within a sequence. Normalized by sequence length. (e.g. a token that occurres in position 0.5 occurs in the middle of a sequence).
Checks that the absolute count difference relative to the total count of both datasets is small. This metric is appropriate for comparing datasets that are expected to have similar absolute counts, and not necessarily just similar distributions. Computed as max_i | x_i - y_i | / sum_i(x_i + y_i) for aligned datasets x and y. Results will be in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).
Used in:
Used in:
Pearson product-moment correlation coefficient.
Standard covariance. E[(X-E[X])*(Y-E[Y])]
Statistics for a numeric feature in a dataset.
Used in:
The mean of the values
The standard deviation of the values
The number of values that equal 0
The minimum value
The median value
The maximum value
The histogram(s) of the feature values.
Weighted statistics for the feature, if the values have weights.
Checks that the ratio of the current value to the previous value is not below the min_fraction_threshold or above the max_fraction_threshold. That is, previous value * min_fraction_threshold <= current value <= previous value * max_fraction_threshold. To specify that the value cannot change, set both min_fraction_threshold and max_fraction_threshold to 1.0.
Used in:
A one-dimensional regression task. The output is a single real number, whose range is dependent upon the objective.
Used in:
The label column.
oneof label_id is required.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
(optional) The weight column.
When set means the label is a probability in range [0..1].
When set the label corresponds to counts from a poisson distribution. Eg: Number of googlers contributing to memegen each year.
Defines a regression problem where the labels are counts i.e. integers >=0.
Used in:
(message has no fields)
Defines a regression problem where labels are in [0, 1] and represent a probability (e.g: probability of click).
Used in:
(message has no fields)
Describes a chunk that applies to only one of the two artifacts.
Used in:
Starting line.
Contents.
A path is a more general substitute for the name of a field or feature that can be used for flat examples as well as structured data. For example, if we had data in a protocol buffer: message Person { int age = 1; optional string gender = 2; repeated Person parent = 3; } Thus, here the path {step:["parent", "age"]} in statistics would refer to the age of a parent, and {step:["parent", "parent", "age"]} would refer to the age of a grandparent. This allows us to distinguish between the statistics of parents' ages and grandparents' ages. In general, repeated messages are to be preferred to linked lists of arbitrary length. For SequenceExample, if we have a feature list "foo", this is represented by {step:["##SEQUENCE##", "foo"]}.
Used in:
, , , , , , , , , , , ,Any string is a valid step. However, whenever possible have a step be [A-Za-z0-9_]+.
Performance metrics measure the quality of a model. They need not be differentiable.
Used in:
,poisson(...) DEPRECATED
Used in:
(message has no fields)
Used in:
(message has no fields)
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/PrecisionAtRecall
Used in:
Minimal required recall, (0.0, 1.0).
The mean of the prediction across the dataset.
(message has no fields)
Statistics about the presence and valency of feature values. Feature values could be nested lists. A feature in tf.Examples or other "flat" datasets has values of nest level 1 -- they are lists of primitives. A nest level N (N > 1) feature value is a list of lists of nest level (N - 1). This proto can be used to describe the presence and valency of values at each level.
Used in:
Note: missing and non-missing counts are conditioned on the upper level being non-missing (i.e. if the upper level is missing/null, all the levels nested below are by definition missing, but not counted). Number non-missing (not-null) values.
Number of missing (null) values.
Minimum length of the values (note that nulls are not considered).
Maximum length of the values.
Total number of values.
Description of the problem statement. For example, should describe how the problem statement was arrived at: what experiments were run, what side-by-sides were considered.
The environment of the ProblemStatement (optional). Specifies an environment string in the SchemaProto.
The target used for meta-optimization. This is used to compare multiple solutions for this problem. For example, if two solutions have different candidates, a tuning tool can use meta_optimization_target to decide which candidate performs the best. A repeated meta-optimization target implies the weighted sum of the meta_optimization targets of any non-thresholded metrics.
Tasks for heads of the generated model. This field is repeated because some models are multi-task models. Each task should have a unique name. If you wish to directly optimize this problem statement, you need to specify the objective in the task.
The data used to create a rank histogram of a non-numeric feature of a dataset. The rank of a value in a feature can be used as a measure of how commonly the value is found in the entire dataset. With bucket sizes of one, this becomes a distribution function of all feature values.
Used in:
, , , ,A list of buckets in the histogram, sorted from lowest-ranked bucket to highest-ranked bucket.
An optional descriptive name of the histogram, to be used for labeling.
Each bucket defines its start and end ranks along with its count.
Used in:
The low rank of the bucket, inclusive.
The high rank of the bucket, exclusive.
The label for the bucket. Can be used to list or summarize the values in this rank bucket.
The number of items in the bucket. Stored as a double to be able to handle weighted histograms.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/RecallAtPrecision
Used in:
Minimal required precision, (0.0, 1.0).
Used in:
Message to represent schema information. NextID: 15
Used in:
Features described in this schema.
Sparse features described in this schema.
Weighted features described in this schema.
String domains referenced in the features.
TOP LEVEL FLOAT AND INT DOMAINS ARE UNSUPPORTED IN TFDV. TODO(b/63664182): Support this. top level float domains that can be reused by features
top level int domains that can be reused by features
Default environments for each feature. An environment represents both a type of location (e.g. a server or phone) and a time (e.g. right before model X is run). In the standard scenario, 99% of the features should be in the default environments TRAINING, SERVING, and the LABEL (or labels) AND WEIGHT is only available at TRAINING (not at serving). Other possible variations: 1. There may be TRAINING_MOBILE, SERVING_MOBILE, TRAINING_SERVICE, and SERVING_SERVICE. 2. If one is ensembling three models, where the predictions of the first three models are available for the ensemble model, there may be TRAINING, SERVING_INITIAL, SERVING_ENSEMBLE. See FeatureProto::not_in_environment and FeatureProto::in_environment.
Whether to represent variable length features as RaggedTensors. By default they are represented as ragged left-alighned SparseTensors. RaggedTensor representation is more memory efficient. Therefore, turning this on will likely yield data processing performance improvement. Experimental and may be subject to change.
Additional information about the schema as a whole. Features may also be annotated individually.
Dataset-level constraints. This is currently used for specifying information about changes in num_examples.
TensorRepresentation groups. The keys are the names of the groups. Key "" (empty string) denotes the "default" group, which is what should be used when a group name is not provided. See the documentation at TensorRepresentationGroup for more info. Under development.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/SensitivityAtSpecificity
Used in:
Minimal required specificity, (0.0, 1.0).
Encodes constraints on sequence lengths.
Used in:
Token values (int and string) that are excluded when calculating sequence length.
Min / max sequence length.
Used in:
An arbitrary string defining a "group" of features that could be modeled as a single joint sequence. For example, consider a dataset that contains three sequential features "purchase_time", "product_id", "purchase_price". These belong to the same sequence of purchases and could be modeled jointly. Specifying joint_group = "purchase" on all three sequences would communicate that the features can be considered part of a single conceptual sequence.
Specifies the maximum sequence length that should be processed. Sequences may exceed this limit but are expected to be truncated by modeling layers.
This enum specifies whether to treat the feature as a sequence which has meaningful element order.
Used in:
Encodes constraints on specific values in sequences.
Used in:
The value which to express constraints for. Can be either an integer or a string.
Min / max number of times the value can occur in a sequence.
Min / max fraction of sequences that must contain the value.
Used in:
Sql expression used to create a derived feature based on the extracted slice keys. It must return result of STRUCT type.
Value type of the derived feature. The default type is string.
Indicates whether to drop struct name in the generated output.
Set default feature value when slice query fails. If the slice query fails and no default value is provided, the TFDV statistics generation pipeline will fail.
Used in:
Default type is string
A sparse feature represents a sparse tensor that is encoded with a combination of raw features, namely index features and a value feature. Each index feature defines a list of indices in a different dimension.
Used in:
,Name for the sparse feature. This should not clash with other features in the same schema.
required
This field is no longer supported. Instead, use: lifecycle_stage: DEPRECATED TODO(b/111450258): remove this.
The lifecycle_stage determines where a feature is expected to be used, and therefore how important issues with it are.
Constraints on the presence of this feature in examples. Deprecated, this is inferred by the referred features.
Shape of the sparse tensor that this SparseFeature represents. Currently not supported. TODO(b/109669962): Consider deriving this from the referred features.
Features that represent indexes. Should be integers >= 0.
at least one
If true then the index values are already sorted lexicographically.
required
Type of value feature. Deprecated, this is inferred by the referred features.
Used in:
Name of the index-feature. This should be a reference to an existing feature in the schema.
Used in:
Name of the value-feature. This should be a reference to an existing feature in the schema.
sparse_top_k_categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/sparse_top_k_categorical_accuracy DEPRECATED
Used in:
(message has no fields)
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/SpecificityAtSensitivity
Used in:
Minimal required sensitivity, (0.0, 1.0).
squared_hinge(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/squared_hinge DEPRECATED
Used in:
(message has no fields)
Encodes information for domains of string values.
Used in:
,Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
The values appearing in the domain.
Currently unused StringDomain consists of Categorical. This enum allows the user to specify the whether to treat the feature as categorical.
Used in:
Statistics for a string feature in a dataset.
Used in:
The number of unique values
A sorted list of the most-frequent values and their frequencies, with the most-frequent being first.
The average length of the values
The rank histogram for the values of the feature. The rank is used to measure of how commonly the value is found in the dataset. The most common value would have a rank of 1, with the second-most common value having a rank of 2, and so on.
Weighted statistics for the feature, if the values have weights.
A vocabulary file, used for vocabularies too large to store in the proto itself. Note that the file may be relative to some context-dependent directory. E.g. in TFX the feature statistics will live in a PPP and vocabulary file names will be relative to this PPP.
Counts the number of invalid utf8 strings present in leaf arrays for this feature. Validation is only performed for byte- or string-like features ( those having type BYTES or STRING).
Used in:
,The number of times the value occurs. Stored as a double to be able to handle weighted features.
Domain for a recursive struct. NOTE: If a feature with a StructDomain is deprecated, then all the child features (features and sparse_features of the StructDomain) are also considered to be deprecated. Similarly child features can only be in environments of the parent feature.
Used in:
Used in:
Describes a single task in a model and all its properties. A task corresponds to a single output of the model. Multiple tasks in the same problem statement correspond to different outputs of the model.
Used in:
Specification of the label and weight columns, and the type of the prediction or classification.
The task name. Tasks within the same ProblemStatement should have unique names. This a REQUIRED field in case of multi-task learning problems.
If a Problem is composed of multiple sub-tasks, the weight of each task determines the importance of solving each sub-task. It is used to rank and select the best solution for multi-task problems. Not meaningful for a problem with one task. If the problem has multiple tasks and all task_weight=0 (unset) then all tasks are weighted equally.
This field includes performance metrics of this head that are important to the problem owner and need to be monitored and reported. However, unlike fields such as "meta_optimization_target", these metrics are not not automatically used in meta-optimization.
True to indicate the task is an auxiliary task in a multi-task setting. Auxiliary tasks are of minor relevance for the application and they are added only to improve the performance on a primary task (by providing additional regularization or data augmentation), and thus are not considered in the meta optimization process (but may be utilized in the learner optimization).
A TensorRepresentation captures the intent for converting columns in a dataset to TensorFlow Tensors (or more generally, tf.CompositeTensors). Note that one tf.CompositeTensor may consist of data from multiple columns, for example, a N-dimensional tf.SparseTensor may need N + 1 columns to provide the sparse indices and values. Note that the "column name" that a TensorRepresentation needs is a string, not a Path -- it means that the column name identifies a top-level Feature in the schema (i.e. you cannot specify a Feature nested in a STRUCT Feature).
Used in:
Used in:
Note that the data column might be of a shorter integral type. It's the user's responsitiblity to make sure the default value fits that type.
uint_value should only be used if the default value can't fit in a int64 (`int_value`).
A tf.Tensor
Used in:
Identifies the column in the dataset that provides the values of this Tensor.
The shape of each row of the data (i.e. does not include the batch dimension)
If this column is missing values in a row, the default_value will be used to fill that row.
A tf.RaggedTensor that models nested lists. Currently there is no way for the user to specify the shape of the leaf value (the innermost value tensor of the RaggedTensor). The leaf value will always be a 1-D tensor.
Used in:
Identifies the leaf feature that provides values of the RaggedTensor. struct type sub fields. The first step of the path refers to a top-level feature in the data. The remaining steps refer to STRUCT features under the top-level feature, recursively. If the feature has N outer ragged lists, they will become the first N dimensions of the resulting RaggedTensor and the contents will become the flat_values.
required.
The result RaggedTensor would be of shape: [B, D_0, D_1, ..., D_N, P_0, P_1, ..., P_M, U_0, U_1, ..., U_P] Where the dimensions belong to different categories: * B: Batch size dimension * D_n: Dimensions specified by the nested structure specified by the value path until the leaf node. n>=1. * P_m: Dimensions specified by the partitions that do not define any fixed diomension size. m>=0. * U_0: Dimensions specified by the latest partitions of type uniform_row_length that can define the fixed inner shape of the tensor. If iterationg the partitions from the end to the beginning, these dimensions are defined by all the continuous uniform_row_length partitions present. p>=0.
The data type of the ragged tensor's row partitions. This will default to INT64 if it is not specified.
Further partition of the feature values at the leaf level.
Used in:
If the final element(s) of partition are uniform_row_lengths [U0, U1, ...] , then the result RaggedTensor will have their flat values (a dense tensor) being of shape [U0, U1, ...]. Otherwise, a uniform_row_length simply means a ragged dimension with row_lengths [uniform_row_length]*nrows.
Identifies a leaf feature who share the same parent of value_feature_path that contains the partition row lengths.
RaggedTensor consists of RowPartitions. This enum allows the user to specify the dtype of those RowPartitions. If it is UNSPECIFIED, then we default to INT64.
Used in:
A tf.SparseTensor whose indices and values come from separate data columns. This will replace Schema.sparse_feature eventually. The index columns must be of INT type, and all the columns must co-occur and have the same valency at the same row.
Used in:
The dense shape of the resulting SparseTensor (does not include the batch dimension).
The columns constitute the coordinates of the values. indices_column[i][j] contains the coordinate of the i-th dimension of the j-th value.
The column that contains the values.
Specify whether the values are already sorted by their index position.
A ragged tf.SparseTensor that models nested lists.
Used in:
Identifies the column in the dataset that should be converted to the VarLenSparseTensor.
A TensorRepresentationGroup is a collection of TensorRepresentations with names. These names may serve as identifiers when converting the dataset to a collection of Tensors or tf.CompositeTensors. For example, given the following group: { key: "dense_tensor" tensor_representation { dense_tensor { column_name: "univalent_feature" shape { dim { size: 1 } } default_value { float_value: 0 } } } } { key: "varlen_sparse_tensor" tensor_representation { varlen_sparse_tensor { column_name: "multivalent_feature" } } } Then the schema is expected to have feature "univalent_feature" and "multivalent_feature", and when a batch of data is converted to Tensors using this TensorRepresentationGroup, the result may be the following dict: { "dense_tensor": tf.Tensor(...), "varlen_sparse_tensor": tf.SparseTensor(...), }
Used in:
Configuration for a text generation task where the model should predict a sequence of natural language text.
Used in:
(optional) The weight column.
Time or date representation.
Used in:
Expected format that contains a combination of regular characters and special format specifiers. Format specifiers are a subset of the strptime standard.
Expected format of integer times.
Used in:
Number of days since 1970-01-01.
Time of day, without a particular date.
Used in:
Expected format that contains a combination of regular characters and special format specifiers. Format specifiers are a subset of the strptime standard.
Expected format of integer times.
Used in:
Time values, containing hour/minute/second/nanos, encoded into 8-byte bit fields following the ZetaSQL convention: 6 5 4 3 2 1 MSB 3210987654321098765432109876543210987654321098765432109876543210 LSB | H || M || S ||---------- nanos -----------|
top_k_categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/top_k_categorical_accuracy
Used in:
(message has no fields)
Configuration for a top-K classification task. In this problem type, there are n_classes possible label values, and the model predicts n_predicted_labels labels. The output is a sequence of n_predicted_labels labels, out of n_classes possible classes. The order of the predicted output labels is determined by the predictions_order field. (*) MultiClassClassification is the same as TopKClassification with n_predicted_labels = 1. (*) TopKClassification does NOT mean multi-class multi-label classification: e.g., the output contains a sequence of labels, all coming from the same label column in the data.
Used in:
The label column.
The name of the label. Assumes the label is a flat, top-level field.
A path can be used instead of a flat string if the label is nested.
(optional) The weight column.
(optional) The number of label classes. If unset, the solution provider is expected to infer the number of classes from the data.
(optional) The number of class labels to predict. If unset, we assume 1.
Used in:
Predictions are ordered from the most likely to least likely.
Predictions are ordered from the least likely to most likely.
The type of a head or meta-objective. Specifies the label, weight, and output type of the head. TODO(martinz): add logistic regression.
Used in:
A URL, see: https://en.wikipedia.org/wiki/URL
Used in:
(message has no fields)
Describes a chunk that is the same in the two artifacts.
Used in:
The starting lines of the chunk in the two artifacts.
The contents of the chunk. These are the same in both artifacts.
Checks that the number of unique values is greater than or equal to the min, and less than or equal to the max.
Used in:
Limits on maximum and minimum number of values in a single example (when the feature is present). Use this when the minimum value count can be different than the maximum value count. Otherwise prefer FixedShape.
Used in:
,Used in:
Video data.
Used in:
(message has no fields)
Common weighted statistics for all feature types. Statistics counting number of values (i.e., avg_num_values and tot_num_values) include NaNs. If the weighted column is missing, then this counts as a weight of 1 for that example. For nested features with N nested levels (N > 1), the statistics counting number of values will rely on the innermost level.
Used in:
Weighted number of examples not missing.
Weighted number of examples missing. Note that if the weighted column is zero, this does not count as missing.
average number of values, weighted by the number of examples. avg_num_values = tot_num_values / num_non_missing.
The total number of values in this feature.
Represents a weighted feature that is encoded as a combination of raw base features. The `weight_feature` should be a float feature with identical shape as the `feature`. This is useful for representing weights associated with categorical tokens (e.g. a TFIDF weight associated with each token). TODO(b/142122960): Handle WeightedCategorical end to end in TFX (validation, TFX Unit Testing, etc)
Used in:
Name for the weighted feature. This should not clash with other features in the same schema.
required
Path of a base feature to be weighted. Required.
Path of weight feature to associate with the base feature. Must be same shape as feature. Required.
The lifecycle_stage determines where a feature is expected to be used, and therefore how important issues with it are.
Statistics for a weighted feature with an NL domain.
Used in:
Weighted feature coverage.
Weighted average token length.
Histogram containing the distribution of token lengths.
Histogram containing the distribution of sequence lengths.
Weighted number of sequences that do not match the location constraint.
Per-token weighted statistics.
The rank histogram with the weighted tokens for the feature.
Statistics for a weighted numeric feature in a dataset.
Used in:
The weighted mean of the values
The weighted standard deviation of the values
The weighted median of the values
The histogram(s) of the weighted feature values.
Statistics for a weighted string feature in a dataset.
Used in:
A sorted list of the most-frequent values and their weighted frequencies, with the most-frequent being first.
The rank histogram for the weighted values of the feature.