package tensorflow.metadata.v0

Get desktop application:
View/edit binary Protocol Buffers messages

Area under curve for the ROC-curve. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC

Used in: PerformanceMetric

(message has no fields)

Area under curve for the precision-recall-curve. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC

Used in: PerformanceMetric

(message has no fields)

Used in: DerivedFeatureConfig

repeated bytes allowed_bytes_value = 1
bytes placeholder_value = 2
If unset, placeholders will be dropped.

Additional information about the schema or about a feature.

Used in: Feature, Schema

repeated string tag = 1
Tags can be used to mark features. For example, tag on user_age feature can be `user_feature`, tag on user_country feature can be `location_feature`, `user_feature`.
repeated string comment = 2
Free-text comments. This can be used as a description of the feature, developer notes etc.
repeated google.protobuf.Any extra_metadata = 3
Application-specific metadata may be attached here.

Message to represent the anomalies, which describe the mismatches (if any) between the stats and the schema.

oneof baseline_schema
The baseline schema that is used.
- Schema baseline = 1
- Schema baseline_v1 = 6
optional Anomalies.AnomalyNameFormat anomaly_name_format = 7
The format of the keys in anomaly_info. If absent, default is DEFAULT.
map<string, AnomalyInfo> anomaly_info = 2
Information about feature-level anomalies.
optional AnomalyInfo dataset_anomaly_info = 8
Information about dataset-level anomalies.
optional bool data_missing = 3
True if numExamples == 0.
repeated DriftSkewInfo drift_skew_info = 9
If drift / skew detection was conducted, this field will hold the comparison results for all the features compared, regardless whether a related anomaly was reported.
TODO(b/123519907): Remove this. The hook to attach any usage and tool specific metadata. Example: message SchemaStamp { // extension ID is any CL number that has not been used in an extension. extend proto2.bridge.MessageSet { optional StampedSchemaDiff message_set_extension = 123445554; } optional string schema_stamp = 1; } then, the following proto msg encodes an Anomalies with an embedded SchemaStamp: Anomalies { metadata { [SchemaStamp]: { schema_stamp: "stamp" } } } GOOGLE-LEGACY optional proto2.bridge.MessageSet metadata = 5;

Map from a column to the difference that it represents.

Used in: Anomalies

UNKNOWN = 0
At present, this indicates that the keys in anomaly_info refers to the raw field name in the Schema.
SERIALIZED_PATH = 1
The serialized path to a struct.

Message to represent information about an individual anomaly.

Used in: Anomalies

optional Path path = 8
A path indicating where the anomaly occurred. Dataset-level anomalies do not have a path.
optional AnomalyInfo.Severity severity = 5
optional string description = 2
A description of the entire anomaly.
optional string short_description = 6
A shorter description, suitable for UI presentation. If there is a single reason for the anomaly, identical to reason[0].short_description. Otherwise, summarizes all the reasons.
repeated DiffRegion diff_regions = 4
The comparison between the existing schema and the fixed schema.
repeated AnomalyInfo.Reason reason = 7

LINT.ThenChange(//tensorflow_data_validation/g3doc/anomalies.md) Reason for the anomaly. There may be more than one reason, e.g. the field might be missing sometimes AND a new value is present.

Used in: AnomalyInfo

optional Type type = 1
optional string short_description = 2
A short description of an anomaly, suitable for UI presentation.
optional string description = 3
A longer description of an anomaly.

Used in: AnomalyInfo

UNKNOWN = 0
WARNING = 1
ERROR = 2

Next ID: 89 LINT.IfChange

Used in: Reason

UNKNOWN_TYPE = 0
MULTIPLE_REASONS = 82
Multiple reasons for anomaly.
BOOL_TYPE_BIG_INT = 1
Integer larger than 1
BOOL_TYPE_BYTES_NOT_INT = 2
BYTES type when expected INT type
BOOL_TYPE_BYTES_NOT_STRING = 3
BYTES type when expected STRING type
BOOL_TYPE_FLOAT_NOT_INT = 4
FLOAT type when expected INT type
BOOL_TYPE_FLOAT_NOT_STRING = 5
FLOAT type when expected STRING type
BOOL_TYPE_INT_NOT_STRING = 6
INT type when expected STRING type
BOOL_TYPE_SMALL_INT = 7
Integer smaller than 0
BOOL_TYPE_STRING_NOT_INT = 8
STRING type when expected INT type
BOOL_TYPE_UNEXPECTED_STRING = 9
Expected a string, but not the string seen
BOOL_TYPE_UNEXPECTED_FLOAT = 52
Boolean had float values other than 0 and 1.
BOOL_TYPE_INVALID_CONFIG = 88
BoolDomain has invalid configuration.
ENUM_TYPE_BYTES_NOT_STRING = 10
BYTES type when expected STRING type
ENUM_TYPE_FLOAT_NOT_STRING = 11
FLOAT type when expected STRING type
ENUM_TYPE_INT_NOT_STRING = 12
INT type when expected STRING type
ENUM_TYPE_INVALID_UTF8 = 13
Invalid UTF8 string observed
ENUM_TYPE_UNEXPECTED_STRING_VALUES = 14
Unexpected string values
FEATURE_TYPE_HIGH_NUMBER_VALUES = 15
The number of values in a given example is too large
FEATURE_TYPE_LOW_FRACTION_PRESENT = 16
The fraction of examples containing a feature is too small
FEATURE_TYPE_LOW_NUMBER_PRESENT = 17
The number of examples containing a feature is too small
FEATURE_TYPE_LOW_NUMBER_VALUES = 18
The number of values in a given example is too small
FEATURE_TYPE_NOT_PRESENT = 19
No examples contain the value
FEATURE_TYPE_NO_VALUES = 20
The feature is present as an empty list
FEATURE_TYPE_UNEXPECTED_REPEATED = 21
The feature is repeated in an example, but was expected to be a singleton
FEATURE_TYPE_HIGH_UNIQUE = 59
The feature had too many unique values (string and categorical features only).
FEATURE_TYPE_LOW_UNIQUE = 60
The feature had too few unique values (string and categorical features only).
FEATURE_TYPE_NO_UNIQUE = 61
The feature has a constraint on the number of unique values but is not of a type that has the number of unique values counted (i.e., is not string or categorical).
FLOAT_TYPE_BIG_FLOAT = 22
There is a float value that is too high
FLOAT_TYPE_NOT_FLOAT = 23
The type is not FLOAT
FLOAT_TYPE_SMALL_FLOAT = 24
There is a float value that is too low
FLOAT_TYPE_STRING_NOT_FLOAT = 25
The feature is supposed to be floats encoded as strings, but there is a string that is not a float
FLOAT_TYPE_NON_STRING = 26
The feature is supposed to be floats encoded as strings, but it was some other type (INT, BYTES, FLOAT)
FLOAT_TYPE_UNKNOWN_TYPE_NUMBER = 27
The type is completely unknown
FLOAT_TYPE_HAS_NAN = 53
Float feature includes NaN values.
FLOAT_TYPE_HAS_INF = 62
Float feature includes Inf or -Inf values.
INT_TYPE_BIG_INT = 28
There is an unexpectedly large integer
INT_TYPE_INT_EXPECTED = 29
The type was supposed to be INT, but it was not.
INT_TYPE_NOT_INT_STRING = 30
The feature is supposed to be ints encoded as strings, but some string was not an int.
INT_TYPE_NOT_STRING = 31
The type was supposed to be STRING, but it was not.
INT_TYPE_SMALL_INT = 32
There is an unexpectedly small integer
INT_TYPE_STRING_EXPECTED = 33
The feature is supposed to be ints encoded as strings, but it was some other type (INT, BYTES, FLOAT)
INT_TYPE_UNKNOWN_TYPE_NUMBER = 34
Unknown type in stats proto
LOW_SUPPORTED_IMAGE_FRACTION = 64
The fraction of examples containing TensorFlow supported images is lower than the threshold set in the Schema.
SCHEMA_MISSING_COLUMN = 35
There are no stats for a column at all
SCHEMA_NEW_COLUMN = 36
There is a new column that is not in the schema.
SCHEMA_TRAINING_SERVING_SKEW = 37
Training serving skew issue
STRING_TYPE_NOW_FLOAT = 38
Expected STRING type, but it was FLOAT.
STRING_TYPE_NOW_INT = 39
Expected STRING type, but it was INT.
COMPARATOR_CONTROL_DATA_MISSING = 40
Control data is missing (either scoring data or previous day).
COMPARATOR_TREATMENT_DATA_MISSING = 41
Treatment data is missing (either treatment data or current day).
COMPARATOR_L_INFTY_HIGH = 42
L infinity between treatment and control is high.
COMPARATOR_JENSEN_SHANNON_DIVERGENCE_HIGH = 63
Approximate Jensen-Shannon divergence between treatment and control is high.
COMPARATOR_NORMALIZED_ABSOLUTE_DIFFERENCE_HIGH = 87
The normalized absolute difference between treatment and control is high.
NO_DATA_IN_SPAN = 43
No examples in the span.
SPARSE_FEATURE_MISSING_VALUE = 44
The value feature of a sparse feature is missing and at least one feature defining the sparse feature is present.
SPARSE_FEATURE_MISSING_INDEX = 45
An index feature of a sparse feature is missing and at least one feature defining the sparse feature is present.
SPARSE_FEATURE_LENGTH_MISMATCH = 46
The length of the features representing a sparse feature does not match.
SPARSE_FEATURE_NAME_COLLISION = 47
Name collision between a sparse feature and raw feature.
SEMANTIC_DOMAIN_UPDATE = 48
Invalid custom semantic domain.
COMPARATOR_LOW_NUM_EXAMPLES = 49
There are not enough examples in the current data as compared to a control dataset.
COMPARATOR_HIGH_NUM_EXAMPLES = 50
There are too many examples in the current data as compared to a control dataset.
DATASET_LOW_NUM_EXAMPLES = 51
There are not enough examples in the dataset.
DATASET_HIGH_NUM_EXAMPLES = 58
There are too many examples in the dataset.
WEIGHTED_FEATURE_NAME_COLLISION = 54
Name collision between a weighted feature and a raw feature.
WEIGHTED_FEATURE_MISSING_VALUE = 55
The value feature of a weighted feature is missing on examples where the weight feature is present.
WEIGHTED_FEATURE_MISSING_WEIGHT = 56
The weight feature of a weighted feature is missing on examples where the value feature is present.
WEIGHTED_FEATURE_LENGTH_MISMATCH = 57
The length of the features representing a weighted feature does not match.
VALUE_NESTEDNESS_MISMATCH = 65
The nesting level of the feature values does not match.
DOMAIN_INVALID_FOR_TYPE = 66
The domain specified is not compatible with the physical type.
FEATURE_MISSING_NAME = 67
Feature on schema has no name.
FEATURE_MISSING_TYPE = 68
Feature on schema has no type.
INVALID_SCHEMA_SPECIFICATION = 69
Triggered for invalid schema specifications, e.g. min_fraction < 0.
INVALID_DOMAIN_SPECIFICATION = 81
Triggered for invalid domain specifications in schema.
UNEXPECTED_DATA_TYPE = 70
The type of the data is inconsistent with the specified type.
SEQUENCE_VALUE_TOO_FEW_OCCURRENCES = 71
A value did not show up the min number of times within a sequence.
SEQUENCE_VALUE_TOO_MANY_OCCURRENCES = 72
A value showed up more the max number of times within a sequence.
SEQUENCE_VALUE_TOO_SMALL_FRACTION = 73
A value did not show up in at least the min fraction of sequences.
SEQUENCE_VALUE_TOO_LARGE_FRACTION = 74
A value showed up in greater than the max fraction of sequences.
FEATURE_COVERAGE_TOO_LOW = 75
Too small a fraction of feature values matched vocab entries.
FEATURE_COVERAGE_TOO_SHORT_AVG_TOKEN_LENGTH = 76
The average token length was too short.
NLP_WRONG_LOCATION = 77
A sequence violated the location constraint.
EMBEDDING_SHAPE_INVALID = 78
A feature was specified as an embedding but was not a fixed dimension.
MAX_IMAGE_BYTE_SIZE_EXCEEDED = 79
A feature contains an image that has more bytes than the max byte size.
INVALID_FEATURE_SHAPE = 80
A feature is supposed to be of a fixed shape but its valency stats do not agree.
STATS_NOT_AVAILABLE = 83
Constraints are specified within the but cannot be verified because the corresponding stats are not available.
A derived feature had a schema lifecycle other than VALIDATION_DERIVED or DISABLED.
DERIVED_FEATURE_BAD_LIFECYCLE = 84
The following are experimental and subject to change.
DERIVED_FEATURE_INVALID_SOURCE = 85
A derived feature is represented in the schema with an invalid or missing validation_derived_source.
CUSTOM_VALIDATION = 86
The following type is experimental and subject to change. The statistics did not specify a custom validation condition.

Used in: DerivedFeatureConfig

uint32 k = 1

Audio data.

Used in: Feature

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/keras/metrics/binary_accuracy

Used in: PerformanceMetric

(message has no fields)

Configuration for a binary classification task. The output is one of two possible class labels, encoded as the same type as the label column. BinaryClassification is the same as MultiClassClassification with n_classes = 2.

Used in: Type

oneof label_id
The label column.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 3
  A path can be used instead of a flat string if the label is nested.
string example_weight = 2
(optional) The weight column.
optional BinaryClassification.PositiveNegativeSpec positive_negative_spec = 4
(optional) specification of the positive and/or negative class value.

Defines which label value is the positive and/or negative class.

Used in: BinaryClassification

optional PositiveNegativeSpec.LabelValue positive_class_value = 1
This value is the positive class.
optional PositiveNegativeSpec.LabelValue negative_class_value = 2
This value is the negative class.

Specifies a label's value which can be used for positive/negative class specification.

Used in: PositiveNegativeSpec

oneof value_type
- string string_value = 1

Binary cross entropy as a metric is equal to the negative log likelihood (see logistic regression). In addition, when used to solve a binary classification task, binary cross entropy implies that the binary label will maximize binary accuracy. binary_crossentropy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/binary_crossentropy

Used in: PerformanceMetric

(message has no fields)

DEPRECATED

Used in: PerformanceMetric

repeated double weight = 1

Encodes information about the domain of a boolean attribute that encodes its TRUE/FALSE values as strings, or 0=false, 1=true. Note that FeatureType could be either INT or BYTES.

Used in: Feature

optional string name = 1
Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
optional string true_value = 2
Strings values for TRUE/FALSE.
optional string false_value = 3

Statistics for a bytes feature in a dataset.

Used in: FeatureNameStatistics

optional CommonStatistics common_stats = 1
uint64 unique = 2
The number of unique values
float avg_num_bytes = 3
The average number of bytes in a value
float min_num_bytes = 4
The minimum number of bytes in a value
float max_num_bytes = 5
The maximum number of bytes in a value
int64 max_num_bytes_int = 6
The maximum number of bytes in a value, as an int. Float will start having a loss of precision for a large enough integer. This field preserves the precision.

categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/categorical_accuracy

Used in: PerformanceMetric

(message has no fields)

categorical_crossentropy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/categorical_crossentropy

Used in: PerformanceMetric

(message has no fields)

Used in: CrossFeatureStatistics

optional LiftStatistics lift = 1

Describes a chunk that represents changes in both artifacts over the same number of lines.

Used in: DiffRegion

optional int32 left_start = 1
Changed region in the left artifact, in terms of starting line number and contents.
repeated string left_contents = 2
optional int32 right_start = 3
Ditto for the right artifact.
repeated string right_contents = 4

Common statistics for all feature types. Statistics counting number of values (i.e., min_num_values, max_num_values, avg_num_values, and tot_num_values) include NaNs. For nested features with N nested levels (N > 1), the statistics counting number of values will rely on the innermost level.

Used in: BytesStatistics, NumericStatistics, StringStatistics, StructStatistics

uint64 num_non_missing = 1
The number of examples that include this feature. Note that this includes examples that contain this feature with an explicitly empty list of values, which may be permitted for variable length features.
uint64 num_missing = 2
The number of examples missing this feature.
uint64 min_num_values = 3
The minimum number of values in a single example for this feature.
uint64 max_num_values = 4
The maximum number of values in a single example for this feature.
float avg_num_values = 5
The average number of values in a single example for this feature. avg_num_values = tot_num_values / num_non_missing.
uint64 tot_num_values = 8
The total number of values in this feature.
optional Histogram num_values_histogram = 6
The quantiles histogram for the number of values in this feature.
optional WeightedCommonStatistics weighted_common_stats = 7
optional Histogram feature_list_length_histogram = 9
The histogram for the number of features in the feature list (only set if this feature is a non-context feature from a tf.SequenceExample). This is different from num_values_histogram, as num_values_histogram tracks the count of all values for a feature in an example, whereas this tracks the length of the feature list for this feature in an example (where each feature list can contain multiple values).
repeated PresenceAndValencyStatistics presence_and_valency_stats = 10
Contains presence and valency stats for each nest level of the feature. The first item corresponds to the outermost level, and by definition, the stats it contains equals to the corresponding stats defined above. May not be populated if the feature is of nest level 1.
repeated WeightedCommonStatistics weighted_presence_and_valency_stats = 11
If not empty, it's parallel to presence_and_valency_stats.

ContentChunk data.

Used in: Feature

(message has no fields)

cosine(...) cosine_proximity(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/cosine_proximity DEPRECATED

Used in: PerformanceMetric

(message has no fields)

NextID: 8

Used in: DatasetFeatureStatistics

optional Path path_x = 1
The path of feature x.
optional Path path_y = 2
The path of feature y.
uint64 count = 3
Number of occurrences of this feature cross in the data. If any of the features in the cross is missing, the example is ignored.
oneof cross_stats
- NumericCrossStatistics num_cross_stats = 4
- CategoricalCrossStatistics categorical_cross_stats = 5

A custom metric. Prefer using or adding an explicit metric message and only use this generic message as a last resort. NEXT_TAG: 4

Used in: PerformanceMetric

string name = 1
The display name of a metric computed by the model. The name should match ^[a-zA-Z0-9\s]{1,25}$ and must be unique across all performance metrics. Trailing and leading spaces will be truncated before matching.
bool is_maximized = 2
True if the metric is maximized: false if it is minimized. Must be specified if the CustomMetric is used as an objective.
optional CustomMetric.RegistrySpec registry_spec = 3
Specification of the metric in the binary’s metric registry.

RegistrySpec is a full specification of the custom metric and its construction based on the binary’s metric registry. New custom metrics must be linked to the binary and registered in its metric registry to be identifiable via this specification.

Used in: CustomMetric

string key = 1
Identifier of the metric class in the metric registry of the binary.
optional google.protobuf.Any config = 2
Generic proto describing the configuration for the metric to be computed. It's upto the implementer of the metric to parse this configuration.

Stores the name and value of any custom statistic. The value can be a string, double, or histogram.

Used in: FeatureNameStatistics

string name = 1
oneof val
- double num = 2
- string str = 3
- Histogram histogram = 4
- RankHistogram rank_histogram = 5
- google.protobuf.Any any = 6

Constraints on the entire dataset.

Used in: Schema

optional NumericValueComparator num_examples_drift_comparator = 1
Tests differences in number of examples between the current data and the previous span.
optional NumericValueComparator num_examples_version_comparator = 2
Tests comparisions in number of examples between the current data and the previous version of that data.
optional int64 min_examples_count = 3
Minimum number of examples in the dataset.
optional int64 max_examples_count = 4
Maximum number of examples in the dataset.

The feature statistics for a single dataset.

Used in: DatasetFeatureStatisticsList

string name = 1
The name of the dataset.
uint64 num_examples = 2
The number of examples in the dataset.
double weighted_num_examples = 4
Only valid if the weight feature was specified. Treats a missing weighted feature as zero.
repeated FeatureNameStatistics features = 3
The feature statistics for the dataset.
repeated CrossFeatureStatistics cross_features = 5
Cross feature statistics for the dataset.

A list of features statistics for different datasets. If you wish to compare different datasets using this list, then the DatasetFeatureStatistics entries should all contain the same list of features. LINT.IfChange

repeated DatasetFeatureStatistics datasets = 1

Stores configuration for a variety of canned feature derivers. TODO(b/227478330): Consider validating config in merge_util.cc.

Used in: DerivedFeatureSource

oneof type
- AllowlistDeriver allowlist = 1
- ArgmaxTopK argmax_top_k = 2
- ReduceOp reduce_op = 3
- SliceSql slice_sql = 4
- ImageQualityDeriver image_quality = 5

DerivedFeatureSource tracks information about the source of a derived feature. Derived features are computed from ordinary features for the purposes of statistics collection and validation, but do not exist in the dataset. Experimental and subject to change. LINT.IfChange

Used in: Feature, FeatureNameStatistics

string deriver_name = 1
The name of the deriver that generated this feature.
string description = 2
An optional description of the transformation.
repeated Path source_path = 3
The constituent features that went into generating this derived feature.
bool declaratively_configured = 4
A DerivedFeatureSource that is declaratively configured represents an intent for downstream processing to generate a derived feature (in the schema), or tracks that a feature was generated from such a configuration (in statistics).
optional DerivedFeatureConfig config = 5
Optional configuration for canned derivers.

Describes a region in the comparison between two text artifacts. Note that a region also contains the contents of the two artifacts that correspond to the region.

Used in: AnomalyInfo

oneof details
Details for the chunk.
- UnchangedRegion unchanged = 1
  An unchanged region of lines.
- OneSideRegion removed = 2
  A region of lines removed from the left.
- OneSideRegion added = 3
  A region of lines added to the right.
- ChangedRegion changed = 4
  A region of lines that are different in the two artifacts.
- HiddenRegion hidden = 5
  An unchanged region of lines whose contents are just hidden.

Models constraints on the distribution of a feature's values. TODO(martinz): replace min_domain_mass with max_off_domain (but slowly).

Used in: Feature

optional double min_domain_mass = 1
The minimum fraction (in [0,1]) of values across all examples that should come from the feature's domain, e.g.: 1.0 => All values must come from the domain. .9 => At least 90% of the values must come from the domain.

Message to contain the result of the drift/skew measurements for a feature.

Used in: Anomalies

optional Path path = 1
Identifies the feature;
repeated DriftSkewInfo.Measurement drift_measurements = 2
The drift/skew may be measured in the same invocation of TFDV, in which case both of the following fields are populated. Also the drift/skew may be quantified by different measurements, thus repeated.
repeated DriftSkewInfo.Measurement skew_measurements = 3

Used in: DriftSkewInfo

optional Measurement.Type type = 1
Type of the measurement.
optional double value = 2
Value of the measurement.
optional double threshold = 3
Threshold used to determine whether the measurement results in an anomaly.

Used in: Measurement

UNKNOWN = 0
L_INFTY = 1
JENSEN_SHANNON_DIVERGENCE = 2
NORMALIZED_ABSOLUTE_DIFFERENCE = 3

Specifies a dynamic multiclass/multi-label problem where the number of label classes is inferred from the data.

Used in: MultiClassClassification, MultiLabelClassification

optional DynamicClassSpec.OovClassSpec oov_class_spec = 1
Optional. If specified, an Out-Of-Vocabulary (OOV) class is created and populated based on frequencies in the training set. If no OOV class is specified, the model's label vocabulary should consist of all labels that appear in the training set.

Note: it is up to a solution provider to implement support for OOV labels. Note: both a frequency_threshold and a top_k may be set. A class is grouped into the OOV class if it fails to meet either of the criteria below.

Used in: DynamicClassSpec

int64 frequency_threshold = 1
If set, labels are grouped into the "OOV" class if they occur less than frequency_threshold times in the training dataset. If 0, labels that appear in test / validation splits but not in training would be still classified as the "OOV" class.
int64 top_k = 2
If set, only the top_k labels in the training set are used and all others are grouped into an "OOV" class.

optional google.protobuf.DoubleValue threshold = 1
Threshold to apply to a prediction to determine positive vs negative. Note: if the model is calibrated, the threshold can be thought of as a probability so the threshold has a stable, intuitive semantic. However, not all solutions may be calibrated, and not all computations of the metric may operate on a calibrated score. In AutoTFX, the final model metrics are computed on a calibrated score, but the metrics computed within the model selection process are uncalibrated. Be aware of this possible skew in the metrics between model selection and final model evaluation.

optional google.protobuf.DoubleValue threshold = 1
Threshold to apply to a prediction to determine positive vs negative. Note: if the model is calibrated, the threshold can be thought of as a probability so the threshold has a stable, intuitive semantic. However, not all solutions may be calibrated, and not all computations of the metric may operate on a calibrated score. In AutoTFX, the final model metrics are computed on a calibrated score, but the metrics computed within the model selection process are uncalibrated. Be aware of this possible skew in the metrics between model selection and final model evaluation.

Describes schema-level information about a specific feature. NextID: 39

Used in: Schema, StructDomain

optional string name = 1
The name of the feature.
required
optional bool deprecated = 2
This field is no longer supported. Instead, use: lifecycle_stage: DEPRECATED TODO(b/111450258): remove this.
oneof presence_constraints
- FeaturePresence presence = 14
  Constraints on the presence of this feature in the examples.
- FeaturePresenceWithinGroup group_presence = 17
  Only used in the context of a "group" context, e.g., inside a sequence.
oneof shape_type
The shape of the feature which governs the number of values that appear in each example.
- FixedShape shape = 23
  The feature has a fixed shape corresponding to a multi-dimensional tensor.
- ValueCount value_count = 5
  The feature doesn't have a well defined shape. All we know are limits on the minimum and maximum number of values.
- ValueCountList value_counts = 32
  Captures the same information as value_count but for features with nested values. A ValueCount is provided for each nest level.
optional FeatureType type = 6
Physical type of the feature's values. Note that you can have: type: BYTES int_domain: { min: 0 max: 3 } This would be a field that is syntactically BYTES (i.e. strings), but semantically an int, i.e. it would be "0", "1", "2", or "3".
oneof domain_info
Domain for the values of the feature.
- string domain = 7
  Reference to a domain defined at the schema level. NOTE THAT TFDV ONLY SUPPORTS STRING DOMAINS AT THE TOP LEVEL. TODO(b/63664182): Support this.
- IntDomain int_domain = 9
  Inline definitions of domains.
- FloatDomain float_domain = 10
- StringDomain string_domain = 11
- BoolDomain bool_domain = 13
- StructDomain struct_domain = 29
- NaturalLanguageDomain natural_language_domain = 24
  Supported semantic domains.
- ImageDomain image_domain = 25
- AudioDomain audio_domain = 36
- VideoDomain video_domain = 37
- ContentChunkDomain content_chunk_domain = 38
- MIDDomain mid_domain = 26
- URLDomain url_domain = 27
- TimeDomain time_domain = 28
- TimeOfDayDomain time_of_day_domain = 30
optional DistributionConstraints distribution_constraints = 15
Constraints on the distribution of the feature values. Only supported for StringDomains.
optional Annotation annotation = 16
Additional information about the feature for documentation purpose.
optional FeatureComparator skew_comparator = 18
Tests comparing the distribution to the associated serving data.
optional FeatureComparator drift_comparator = 21
Tests comparing the distribution between two consecutive spans (e.g. days).
repeated string in_environment = 20
List of environments this feature is present in. Should be disjoint from not_in_environment. This feature is in environment "foo" if: ("foo" is in in_environment or default_environment) AND "foo" is not in not_in_environment. See Schema::default_environment.
repeated string not_in_environment = 19
List of environments this feature is not present in. Should be disjoint from of in_environment. See Schema::default_environment and in_environment.
optional LifecycleStage lifecycle_stage = 22
The lifecycle stage of a feature. It can also apply to its descendants. i.e., if a struct is DEPRECATED, its children are implicitly deprecated.
optional UniqueConstraints unique_constraints = 31
Constraints on the number of unique values for a given feature. This is supported for string and categorical features only.
optional DerivedFeatureSource validation_derived_source = 34
If set, indicates that that this feature is derived, and stores metadata about its source. If this field is set, this feature should have a disabled stage (PLANNED, ALPHA, DEPRECATED, DISABLED, DEBUG_ONLY), or lifecycle_stage VALIDATION_DERIVED. Experimental and subject to change.
optional SequenceMetadata sequence_metadata = 35
This field specifies if this feature could be treated as a sequence feature which has meaningful element order.

Used in: Feature

optional InfinityNorm infinity_norm = 1
optional JensenShannonDivergence jensen_shannon_divergence = 2
optional NormalizedAbsoluteDifference normalized_abs_difference = 3

Encodes vocabulary coverage constraints.

Used in: NaturalLanguageDomain

optional float min_coverage = 1
Fraction of feature values that map to a vocab entry (i.e. are not oov).
optional float min_avg_token_length = 2
Average length of tokens. Used for cases such as wordpiece that fallback to character-level tokenization.
repeated string excluded_string_tokens = 3
String tokens to exclude when calculating min_coverage and min_avg_token_length. Useful for tokens such as [PAD].
repeated int64 excluded_int_tokens = 4
Integer tokens to exclude when calculating min_coverage and min_avg_token_length.
repeated string oov_string_tokens = 5
String tokens to treat as oov tokens (e.g. [UNK]). These tokens are also excluded when calculating avg token length.

The complete set of statistics for a given feature name for a dataset. NextID: 11

Used in: DatasetFeatureStatistics

oneof field_id
One can identify a field either by the name (for simple fields), or by a path (for structured fields). Note that: name: "foo" is equivalent to: path: {step:"foo"} Note: this oneof must be consistently either name or path across all FeatureNameStatistics in one DatasetFeatureStatistics.
- string name = 1
  The feature name
- Path path = 8
  The path of the feature.
FeatureNameStatistics.Type type = 2
The data type of the feature
oneof stats
The statistics of the values of the feature.
- NumericStatistics num_stats = 3
- StringStatistics string_stats = 4
- BytesStatistics bytes_stats = 5
- StructStatistics struct_stats = 7
repeated CustomStatistic custom_stats = 6
Any custom statistics can be stored in this list.
optional DerivedFeatureSource validation_derived_source = 10
If set, indicates that that this feature is derived for validation, and stores metadata about its source. Experimental and subject to change.

The types supported by the feature statistics. When aggregating tf.Examples, if the bytelist contains a string, it is recommended to encode it here as STRING instead of BYTES in order to calculate string-specific statistical measures.

Used in: FeatureNameStatistics

INT = 0
FLOAT = 1
STRING = 2
BYTES = 3
STRUCT = 4

Describes constraints on the presence of the feature in the data.

Used in: Feature, SparseFeature

optional double min_fraction = 1
Minimum fraction of examples that have this feature.
optional int64 min_count = 2
Minimum number of examples that have this feature.

Records constraints on the presence of a feature inside a "group" context (e.g., .presence inside a group of features that define a sequence).

Used in: Feature

optional bool required = 1

Describes the physical representation of a feature. It may be different than the logical representation, which is represented as a Domain.

Used in: Feature, SparseFeature

TYPE_UNKNOWN = 0
BYTES = 1
INT = 2
FLOAT = 3
STRUCT = 4

Specifies a fixed shape for the feature's values. The immediate implication is that each feature has a fixed number of values. Moreover, these values can be parsed in a multi-dimensional tensor using the specified axis sizes. The FixedShape defines a lexicographical ordering of the data. For instance, if there is a FixedShape { dim {size:3} dim {size:2} } then tensor[0][0]=field[0] then tensor[0][1]=field[1] then tensor[1][0]=field[2] then tensor[1][1]=field[3] then tensor[2][0]=field[4] then tensor[2][1]=field[5] The FixedShape message is identical with the tensorflow.TensorShape proto message for fully defined shapes. The FixedShape message cannot represent unknown dimensions or an unknown rank.

Used in: Feature, SparseFeature, TensorRepresentation.DenseTensor, TensorRepresentation.SparseTensor

repeated FixedShape.Dim dim = 2
The dimensions that define the shape. The total number of values in each example is the product of sizes of each dimension.

An axis in a multi-dimensional feature representation.

Used in: FixedShape

optional int64 size = 1
optional string name = 2
Optional name of the tensor dimension.

Encodes information for domains of float values. Note that FeatureType could be either INT or BYTES.

Used in: Feature, Schema

optional string name = 1
Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
optional float min = 3
Min and max values of the domain.
optional float max = 4
optional bool disallow_nan = 5
If true, feature should not contain NaNs.
optional bool disallow_inf = 6
If true, feature should not contain Inf or -Inf.
optional bool is_embedding = 7
If True, this indicates that the feature is semantically an embedding. This can be useful for distinguishing fixed dimensional numeric features that should be fed to a model unmodified.
optional bool is_categorical = 8
If true then the domain encodes categorical values (i.e., ids) rather than continuous values.
optional int64 embedding_dim = 9
This field specifies the embedding dimension and is only applicable if is_embedding is true. It is useful for use cases such as restoring shapes for flattened sequence of embeddings.
optional string embedding_type = 10
Specifies the semantic type of the embedding e.g. sbv4_semantic or pulsar.

A chunk that represents identical lines, whose contents are hidden.

Used in: DiffRegion

optional int32 left_start = 1
Starting lines in the two artifacts.
optional int32 right_start = 2
optional int32 size = 3
Size of the region in terms of lines.

Linear Hinge Loss hinge(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/hinge DEPRECATED

Used in: PerformanceMetric

(message has no fields)

The data used to create a histogram of a numeric feature for a dataset.

Used in: CommonStatistics, CustomStatistic, NaturalLanguageStatistics, NaturalLanguageStatistics.TokenStatistics, NumericStatistics, WeightedNaturalLanguageStatistics, WeightedNumericStatistics

uint64 num_nan = 1
The number of NaN values in the dataset.
uint64 num_undefined = 2
The number of undefined values in the dataset.
repeated Histogram.Bucket buckets = 3
A list of buckets in the histogram, sorted from lowest bucket to highest bucket.
Histogram.HistogramType type = 4
The type of the histogram.
string name = 5
An optional descriptive name of the histogram, to be used for labeling.

Each bucket defines its low and high values along with its count. The low and high values must be a real number or positive or negative infinity. They cannot be NaN or undefined. Counts of those special values can be found in the numNaN and numUndefined fields.

Used in: Histogram

double low_value = 1
The low value of the bucket, exclusive except for the first bucket.
double high_value = 2
The high value of the bucket, inclusive.
double sample_count = 4
The number of items in the bucket. Stored as a double to be able to handle weighted histograms.

The type of the histogram. A standard histogram has equal-width buckets. The quantiles type is used for when the histogram message is used to store quantile information (by using approximately equal-count buckets with variable widths).

Used in: Histogram

STANDARD = 0
QUANTILES = 1

Used in: JensenShannonDivergence

optional HistogramSelection.Type type = 1

Type controls the source of the histogram used for numeric drift and skew calculations. Currently the default is STANDARD. Calculations based on QUANTILES are more robust to outliers.

Used in: HistogramSelection

DEFAULT = 0
QUANTILES = 1
STANDARD = 2

Image data.

Used in: Feature

optional float minimum_supported_image_fraction = 1
If set, at least this fraction of values should be TensorFlow supported images.
optional int64 max_image_byte_size = 2
If set, image should have less than this value of undecoded byte size.

Used in: DerivedFeatureConfig

string model_name = 1

Checks that the L-infinity norm is below a certain threshold between the two discrete distributions. Since this is applied to a FeatureNameStatistics, it only considers the top k. L_infty(p,q) = max_i |p_i-q_i|

Used in: FeatureComparator

optional double threshold = 1
The InfinityNorm is in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).

Encodes information for domains of integer values. Note that FeatureType could be either INT or BYTES.

Used in: Feature, Schema

optional string name = 1
Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
optional int64 min = 3
Min and max values for the domain.
optional int64 max = 4
optional bool is_categorical = 5
If true then the domain encodes categorical values (i.e., ids) rather than ordinal values.

Checks that the approximate Jensen-Shannon Divergence is below a certain threshold between the two distributions.

Used in: FeatureComparator

optional double threshold = 1
The JensenShannonDivergence will be in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).
optional HistogramSelection source = 2

kld(...) kullback_leibler_divergence(...) KLD(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/kullback_leibler_divergence DEPRECATED

Used in: PerformanceMetric

(message has no fields)

LifecycleStage. Only UNKNOWN_STAGE, BETA, PRODUCTION, and VALIDATION_DERIVED features are actually validated. PLANNED, ALPHA, DISABLED, and DEBUG are treated as DEPRECATED.

Used in: Feature, SparseFeature, WeightedFeature

UNKNOWN_STAGE = 0
Unknown stage.
PLANNED = 1
Planned feature, may not be created yet.
ALPHA = 2
Prototype feature, not used in experiments yet.
BETA = 3
Used in user-facing experiments.
PRODUCTION = 4
Used in a significant fraction of user traffic.
DEPRECATED = 5
No longer supported: do not use in new models.
DEBUG_ONLY = 6
Only exists for debugging purposes.
DISABLED = 7
Generic indication that feature is disabled / excluded from models, regardless of specific reason.
VALIDATION_DERIVED = 9
Indicates that this feature was derived from ordinary features for the purposes of statistics generation or validation. Consumers should expect that this feature may be present in DatasetFeatureStatistics, but not in input data. Experimental and subject to change.

Container for lift information for a specific y-value.

Used in: LiftStatistics

oneof y_value
The particular value of path_y corresponding to this LiftSeries. Each element in lift_values corresponds to the lift a different x_value and this specific y_value.
- int32 y_int = 1
- string y_string = 2
- LiftSeries.Bucket y_bucket = 3
oneof y_count_value
The number of examples in which y_value appears.
- uint64 y_count = 4
- double weighted_y_count = 5
repeated LiftSeries.LiftValue lift_values = 6
The lifts for a each path_x value and this y_value.

A bucket for referring to binned numeric features.

Used in: LiftSeries

double low_value = 1
The low value of the bucket, inclusive.
double high_value = 2
The high value of the bucket, exclusive (unless the high_value is positive infinity).

A container for lift information about a specific value of path_x.

Used in: LiftSeries

oneof x_value
- int32 x_int = 1
- string x_string = 2
double lift = 3
P(path_y=y|path_x=x) / P(path_y=y) for x_value and the enclosing y_value. In terms of concrete fields, this number represents: (x_and_y_count / x_count) / (y_count / num_examples)
oneof x_count_value
The number of examples in which x_value appears.
- uint64 x_count = 4
- double weighted_x_count = 5
oneof x_and_y_count_value
The number of examples in which x_value appears and y_value appears.
- uint64 x_and_y_count = 6
- double weighted_x_and_y_count = 7

Used in: CategoricalCrossStatistics

repeated LiftSeries lift_series = 1
Lift information for each value of path_y. Lift is defined for each pair of values (x,y) as P(path_y=y|path_x=x)/P(path_y=y).
repeated LiftSeries weighted_lift_series = 2
Weighted lift information for each value of path_y. Weighted lift is defined for each pair of values (x,y) as P(path_y=y|path_x=x)/P(path_y=y) where probabilities are computed over weighted example space.

AKA the negative log likelihood or log loss. Given a label y\in {0,1} and a predicted probability p in [0,1]: -yln(p)-(1-y)ln(1-p) TODO(martinz): if this is interpreted the same as binary_cross_entropy, we may need to revisit the semantics. DEPRECATED

Used in: PerformanceMetric

(message has no fields)

Knowledge graph ID, see: https://www.wikidata.org/wiki/Property:P646

Used in: Feature

(message has no fields)

https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/losses/MMDLoss

string kernel = 1
Kernel to apply to the predictions. Currently supported values are 'gaussian' and 'laplace'. Defaults to 'gaussian'.

MAE(...) mae(...) mean_absolute_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_absolute_error

Used in: PerformanceMetric

(message has no fields)

MAPE(...) mape(...) mean_absolute_percentage_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_absolute_percentage_error

Used in: PerformanceMetric

(message has no fields)

Used in: PerformanceMetric

(message has no fields)

MSE(...) mse(...) mean_squared_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_squared_error

Used in: PerformanceMetric

(message has no fields)

msle(...) MSLE(...) mean_squared_logarithmic_error(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/mean_squared_logarithmic_error

Used in: PerformanceMetric

(message has no fields)

The high-level objectives described by this problem statement. These objectives provide a basis for ranking models and can be optimized by a meta optimizer (e.g. a grid search over hyperparameters). A solution provider may also directly use the meta optimization targets to heuristically select losses, etc without any meta-optimization process. If not specified, the high-level meta optimization target is inferred from the task. These objectives do not need to be differentiable, as the solution provider may use proxy function to optimize model weights. Target definitions include tasks, metrics, and any weighted combination of them.

Used in: ProblemStatement

string task_name = 1
The name of a task in this problem statement producing the prediction or classification for the metric.
optional PerformanceMetric performance_metric = 3
The performance metric to be evaluated. The prediction or classification is based upon the task. The label is from the type of the task, or from the override_task.
oneof objective_combination
Describes how to combine with other objectives.
- double weight = 4
  If a model spec has multiple meta optimization targets, the weight of each can be specified. The final objective is then a weighted combination of the multiple objectives. If not specified, value is 1.
- MetaOptimizationTarget.ThresholdConfig threshold_config = 5
  Secondary meta optimization targets can be thresholded, meaning that the optimization process prefers solutions above (or below) the threshold, but need not prefer solutions higher (or lower) on the metric if the threshold is met.

Configuration for thresholded meta-optimization targets.

Used in: MetaOptimizationTarget

oneof type
- double threshold = 1
  If specified, indicates a threshold that the user wishes the metric to stay under (for MINIMIZE type), or above (for MAXIMIZE type). The optimization process need not prefer models that are higher (or lower) on the thresholded metric so long as the threshold is respected. E.g., if `threshold` for a MAXIMIZE type metric X is .9, the optimization process will prefer a solution with X = .92 over a solution with X = .88, but may not prefer a solution with X = .95 over a solution with X = .92. Unless otherwise specified by the PerformanceMetric, threshold is best effort. It does not provide a hard guarantee about the properties of the final model, but rather serves as a "target" to guide the optimization process. The user is responsible for validating that final model metrics are in an acceptable range for the application. A problem statement may, however, be rejected if the specified target is impossible to achieve. Keep this in mind if running the optimization on a recurring basis, as shifts in the data could push a previously achievable target to being unachievable (and thus yield no solution). The units and range for the threshold will be the same as the valid output range of the associated performance_metric.

Metric type indicates which direction of a real-valued metric is "better". For most message types, this is invariant. For custom message types, is_maximized == true is like MAXIMIZE, and otherwise MINIMIZE.

UNKNOWN = 0
MAXIMIZE = 1
Maximize the metric (i.e. a utility).
MINIMIZE = 2
Minimize the metric (i.e. a loss).
CUSTOM = 3
Look for a field is_maximized.

Area under ROC-curve calculated globally for MultiClassClassification (model predicts a single label) or MultiLabelClassification (model predicts class probabilities). The area is calculated by treating the entire set of data as an aggregate result, and computing a single metric rather than k metrics (one for each target label) that get averaged together. For example, the FPR and TPR at a given point on the AUC curve for k targer labels are: FPR = (FP1 + FP2 + ... + FPk) / ((FP1 + FP2 + ... + FPk) + (TN1 + TN2 + ... + TNk)) TPR = (TP1 + TP2 + ... +TPk) / ((TP1 + TP2 + ... + TPk) + (FN1 + FN2 + ... + FNk))

Used in: PerformanceMetric

(message has no fields)

Configuration for a multi-class classification task. In this problem type, there are n_classes possible label values, and the model predicts a single label. The output is one of the class labels, out of n_classes possible classes. The output type will correspond to the label column type.

Used in: Type

oneof label_id
The label column. There's only a single label per example. If the label column is a BoolDomain, use the BinaryClassification Type instead.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 5
  A path can be used instead of a flat string if the label is nested.
string example_weight = 2
The weight column.
oneof class_spec
- uint64 n_classes = 3
  The exact number of label classes.
- DynamicClassSpec dynamic_class_spec = 4
  The number of label classes that should be inferred dynamically from the data.

A multi-dimensional regression task. Similar to OneDimensionalRegression, MultiDimensionalRegression predicts continuous real numbers. However instead of predicting a single scalar value per example, we predict a fixed dimensional vector of values. By default the range is any float -inf to inf, but specific sub-types (e.g. probability) define more narrow ranges.

oneof label_id
The label column.
oneof label_id is required.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 3
  A path can be used instead of a flat string if the label is nested.
string weight = 2
(optional) The weight column.
oneof label_type
- MultiDimensionalRegression.Probability probability = 4
  When set means the label is a probability in range [0..1].

Defines a regression problem where labels are in [0, 1] and represent a probability (e.g: probability of click).

Used in: MultiDimensionalRegression

bool predictions_sum_to_1 = 1
By default, MultiDimensionalRegression assumes that each value in the predicted vector is independent. If predictions_sum_to_1 is true, this indicates that the vector of values represent mutually exclusive rather than independent probabilities (for example, the probabilities of classes in a multi-class scenario). When this is set to true, we use softmax instead of sigmoid in the loss function.

Configuration for a multi-label classification task. In this problem type there are n_classes unique possible label values overall. There can be from zero up to n_classes unique labels per example. The output, which is of type real number, is class probabilities associated with each class. It will be of n_classes dimension for each example, if n_classes is specified. Otherwise, the dimension will be set to the number of unique class labels that are dynamically inferred from the data based on dynamic_class_spec.

Used in: Type

oneof label_id
The label column. There can be one or more labels per example.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 5
  A path can be used instead of a flat string if the label is nested.
string example_weight = 2
The weight column.
oneof class_spec
- uint64 n_classes = 3
  The exact number of unique class labels.
- DynamicClassSpec dynamic_class_spec = 4
  The maximal number of label classes that should be inferred dynamically from the data.

Cross entropy for MultiLabelClassification where each target and prediction is the probabily of belonging to that class independent of other classes.

Used in: PerformanceMetric

(message has no fields)

Natural language text.

Used in: Feature

optional string vocabulary = 1
Name of the vocabulary associated with the NaturalLanguageDomain. When computing and validating stats using TFDV, tfdv.StatsOptions.vocab_paths should map this name to a vocabulary file.
optional FeatureCoverageConstraints coverage = 2
repeated SequenceValueConstraints token_constraints = 3
optional SequenceLengthConstraints sequence_length_constraints = 5

Statistics for a feature containing a NL domain.

double feature_coverage = 1
Fraction of feature input tokens considered in-vocab.
double avg_token_length = 2
Average token length of tokens used by the feature.
optional Histogram token_length_histogram = 3
Histogram containing the distribution of token lengths.
int64 min_sequence_length = 10
Min / max sequence lengths.
int64 max_sequence_length = 11
optional Histogram sequence_length_histogram = 9
Histogram containing the distribution of sequence lengths.
int64 location_misses = 4
Number of of sequences which do not match the location constraint.
repeated string reported_sequences = 5
Reported sequences that are sampled from the input and have small avg_token_length, low feature converage, or do not match the location regex.
repeated NaturalLanguageStatistics.TokenStatistics token_statistics = 6
Statistics for specified tokens. TokenStatistics are only reported for tokens specified in SequenceValueConstraints in the schema.
optional RankHistogram rank_histogram = 7
The rank histogram for the tokens of the feature. The rank is used to measure of how commonly the token is found in the dataset. The most common token would have a rank of 1, with the second-most common value having a rank of 2, and so on.
optional WeightedNaturalLanguageStatistics weighted_nl_statistics = 8

Used in: NaturalLanguageStatistics, WeightedNaturalLanguageStatistics

oneof token
Token for which the statistics are reported.
- string string_token = 1
- int64 int_token = 2
double frequency = 3
The number of times the value occurs. Stored as a double to be able to handle weighted features.
double fraction_of_sequences = 4
Fraction of sequences containing the token.
double per_sequence_min_frequency = 5
Min number of token occurrences within a sequence.
double per_sequence_avg_frequency = 6
Average number of token occurrences within a sequence.
double per_sequence_max_frequency = 7
Maximum number of token occurrences within a sequence.
optional Histogram positions = 8
Token positions within a sequence. Normalized by sequence length. (e.g. a token that occurres in position 0.5 occurs in the middle of a sequence).

Checks that the absolute count difference relative to the total count of both datasets is small. This metric is appropriate for comparing datasets that are expected to have similar absolute counts, and not necessarily just similar distributions. Computed as max_i | x_i - y_i | / sum_i(x_i + y_i) for aligned datasets x and y. Results will be in the interval [0.0, 1.0] so sensible bounds should be in the interval [0.0, 1.0).

Used in: FeatureComparator

optional double threshold = 1

Used in: CrossFeatureStatistics

float correlation = 1
Pearson product-moment correlation coefficient.
float covariance = 2
Standard covariance. E[(X-E[X])*(Y-E[Y])]

Statistics for a numeric feature in a dataset.

Used in: FeatureNameStatistics

optional CommonStatistics common_stats = 1
double mean = 2
The mean of the values
double std_dev = 3
The standard deviation of the values
uint64 num_zeros = 4
The number of values that equal 0
double min = 5
The minimum value
double median = 6
The median value
double max = 7
The maximum value
repeated Histogram histograms = 8
The histogram(s) of the feature values.
optional WeightedNumericStatistics weighted_numeric_stats = 9
Weighted statistics for the feature, if the values have weights.

Checks that the ratio of the current value to the previous value is not below the min_fraction_threshold or above the max_fraction_threshold. That is, previous value * min_fraction_threshold <= current value <= previous value * max_fraction_threshold. To specify that the value cannot change, set both min_fraction_threshold and max_fraction_threshold to 1.0.

Used in: DatasetConstraints

optional double min_fraction_threshold = 1
optional double max_fraction_threshold = 2

A one-dimensional regression task. The output is a single real number, whose range is dependent upon the objective.

Used in: Type

oneof label_id
The label column.
oneof label_id is required.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 3
  A path can be used instead of a flat string if the label is nested.
string weight = 2
(optional) The weight column.
oneof label_type
- OneDimensionalRegression.Probability probability = 4
  When set means the label is a probability in range [0..1].
- OneDimensionalRegression.Counts counts = 5
  When set the label corresponds to counts from a poisson distribution. Eg: Number of googlers contributing to memegen each year.

Defines a regression problem where the labels are counts i.e. integers >=0.

Used in: OneDimensionalRegression

(message has no fields)

Defines a regression problem where labels are in [0, 1] and represent a probability (e.g: probability of click).

Used in: OneDimensionalRegression

(message has no fields)

Describes a chunk that applies to only one of the two artifacts.

Used in: DiffRegion

optional int32 start = 1
Starting line.
repeated string contents = 2
Contents.

A path is a more general substitute for the name of a field or feature that can be used for flat examples as well as structured data. For example, if we had data in a protocol buffer: message Person { int age = 1; optional string gender = 2; repeated Person parent = 3; } Thus, here the path {step:["parent", "age"]} in statistics would refer to the age of a parent, and {step:["parent", "parent", "age"]} would refer to the age of a grandparent. This allows us to distinguish between the statistics of parents' ages and grandparents' ages. In general, repeated messages are to be preferred to linked lists of arbitrary length. For SequenceExample, if we have a feature list "foo", this is represented by {step:["##SEQUENCE##", "foo"]}.

Used in: AnomalyInfo, BinaryClassification, CrossFeatureStatistics, DerivedFeatureSource, DriftSkewInfo, FeatureNameStatistics, MultiClassClassification, MultiDimensionalRegression, MultiLabelClassification, OneDimensionalRegression, TensorRepresentation.RaggedTensor, TopKClassification, WeightedFeature

repeated string step = 1
Any string is a valid step. However, whenever possible have a step be [A-Za-z0-9_]+.

Performance metrics measure the quality of a model. They need not be differentiable.

Used in: MetaOptimizationTarget, Task

oneof performance_metric
- AUC auc = 1
- AUCPrecisionRecall auc_precision_recall = 26
- BinaryAccuracy binary_accuracy = 2
- BinaryCrossEntropy binary_cross_entropy = 3
- BlockUtility block_utility = 4
- CategoricalAccuracy categorical_accuracy = 5
- CategoricalCrossEntropy categorical_cross_entropy = 6
- Cosine cosine = 7
- Hinge hinge = 8
- KullbackLeiblerDivergence kullback_leibler_divergence = 9
- LogisticRegression logistic_regression = 10
- MeanAbsoluteError mean_absolute_error = 11
- MeanAbsolutePercentageError mean_absolute_percentage_error = 12
- MeanSquaredError squared_error = 13
- MeanSquaredLogarithmicError mean_squared_logarithmic_error = 14
- MeanReciprocalRank mean_reciprocal_rank = 15
- MicroAUC micro_auc = 27
- MultilabelCrossEntropy multi_label_cross_entropy = 28
- Poisson poisson = 16
- PrecisionAtK precision_at_k = 17
- SquaredHinge squared_hinge = 18
- SparseTopKCategoricalAccuracy sparse_top_k_categorical_accuracy = 19
- TopKCategoricalAccuracy top_k_categorical_accuracy = 20
- CustomMetric custom_metric = 21
- SensitivityAtSpecificity sensitivity_at_specificity = 22
- SpecificityAtSensitivity specificity_at_sensitivity = 23
- PrecisionAtRecall precision_at_recall = 24
- RecallAtPrecision recall_at_precision = 25

poisson(...) DEPRECATED

Used in: PerformanceMetric

(message has no fields)

Used in: PerformanceMetric

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/keras/metrics/PrecisionAtRecall

Used in: PerformanceMetric

optional google.protobuf.DoubleValue recall = 1
Minimal required recall, (0.0, 1.0).

The mean of the prediction across the dataset.

(message has no fields)

Statistics about the presence and valency of feature values. Feature values could be nested lists. A feature in tf.Examples or other "flat" datasets has values of nest level 1 -- they are lists of primitives. A nest level N (N > 1) feature value is a list of lists of nest level (N - 1). This proto can be used to describe the presence and valency of values at each level.

Used in: CommonStatistics

uint64 num_non_missing = 1
Note: missing and non-missing counts are conditioned on the upper level being non-missing (i.e. if the upper level is missing/null, all the levels nested below are by definition missing, but not counted). Number non-missing (not-null) values.
uint64 num_missing = 2
Number of missing (null) values.
uint64 min_num_values = 3
Minimum length of the values (note that nulls are not considered).
uint64 max_num_values = 4
Maximum length of the values.
uint64 tot_num_values = 5
Total number of values.

string description = 2
Description of the problem statement. For example, should describe how the problem statement was arrived at: what experiments were run, what side-by-sides were considered.
repeated string owner = 3
string environment = 4
The environment of the ProblemStatement (optional). Specifies an environment string in the SchemaProto.
repeated MetaOptimizationTarget meta_optimization_target = 7
The target used for meta-optimization. This is used to compare multiple solutions for this problem. For example, if two solutions have different candidates, a tuning tool can use meta_optimization_target to decide which candidate performs the best. A repeated meta-optimization target implies the weighted sum of the meta_optimization targets of any non-thresholded metrics.
bool multi_objective = 8
repeated Task tasks = 9
Tasks for heads of the generated model. This field is repeated because some models are multi-task models. Each task should have a unique name. If you wish to directly optimize this problem statement, you need to specify the objective in the task.

The data used to create a rank histogram of a non-numeric feature of a dataset. The rank of a value in a feature can be used as a measure of how commonly the value is found in the entire dataset. With bucket sizes of one, this becomes a distribution function of all feature values.

Used in: CustomStatistic, NaturalLanguageStatistics, StringStatistics, WeightedNaturalLanguageStatistics, WeightedStringStatistics

repeated RankHistogram.Bucket buckets = 1
A list of buckets in the histogram, sorted from lowest-ranked bucket to highest-ranked bucket.
string name = 2
An optional descriptive name of the histogram, to be used for labeling.

Each bucket defines its start and end ranks along with its count.

Used in: RankHistogram

uint64 low_rank = 1
The low rank of the bucket, inclusive.
uint64 high_rank = 2
The high rank of the bucket, exclusive.
string label = 4
The label for the bucket. Can be used to list or summarize the values in this rank bucket.
double sample_count = 5
The number of items in the bucket. Stored as a double to be able to handle weighted histograms.

https://www.tensorflow.org/api_docs/python/tf/keras/metrics/RecallAtPrecision

Used in: PerformanceMetric

optional google.protobuf.DoubleValue precision = 1
Minimal required precision, (0.0, 1.0).

Used in: DerivedFeatureConfig

string op_name = 1

Message to represent schema information. NextID: 15

Used in: Anomalies

repeated Feature feature = 1
Features described in this schema.
repeated SparseFeature sparse_feature = 6
Sparse features described in this schema.
repeated WeightedFeature weighted_feature = 12
Weighted features described in this schema.
repeated StringDomain string_domain = 4
String domains referenced in the features.
repeated FloatDomain float_domain = 9
TOP LEVEL FLOAT AND INT DOMAINS ARE UNSUPPORTED IN TFDV. TODO(b/63664182): Support this. top level float domains that can be reused by features
repeated IntDomain int_domain = 10
top level int domains that can be reused by features
repeated string default_environment = 5
Default environments for each feature. An environment represents both a type of location (e.g. a server or phone) and a time (e.g. right before model X is run). In the standard scenario, 99% of the features should be in the default environments TRAINING, SERVING, and the LABEL (or labels) AND WEIGHT is only available at TRAINING (not at serving). Other possible variations: 1. There may be TRAINING_MOBILE, SERVING_MOBILE, TRAINING_SERVICE, and SERVING_SERVICE. 2. If one is ensembling three models, where the predictions of the first three models are available for the ensemble model, there may be TRAINING, SERVING_INITIAL, SERVING_ENSEMBLE. See FeatureProto::not_in_environment and FeatureProto::in_environment.
optional bool represent_variable_length_as_ragged = 14
Whether to represent variable length features as RaggedTensors. By default they are represented as ragged left-alighned SparseTensors. RaggedTensor representation is more memory efficient. Therefore, turning this on will likely yield data processing performance improvement. Experimental and may be subject to change.
optional Annotation annotation = 8
Additional information about the schema as a whole. Features may also be annotated individually.
optional DatasetConstraints dataset_constraints = 11
Dataset-level constraints. This is currently used for specifying information about changes in num_examples.
map<string, TensorRepresentationGroup> tensor_representation_group = 13
TensorRepresentation groups. The keys are the names of the groups. Key "" (empty string) denotes the "default" group, which is what should be used when a group name is not provided. See the documentation at TensorRepresentationGroup for more info. Under development.

https://www.tensorflow.org/api_docs/python/tf/keras/metrics/SensitivityAtSpecificity

Used in: PerformanceMetric

optional google.protobuf.DoubleValue specificity = 1
Minimal required specificity, (0.0, 1.0).

Encodes constraints on sequence lengths.

Used in: NaturalLanguageDomain

repeated int64 excluded_int_value = 1
Token values (int and string) that are excluded when calculating sequence length.
repeated string excluded_string_value = 2
optional int64 min_sequence_length = 3
Min / max sequence length.
optional int64 max_sequence_length = 4

Used in: Feature

optional SequenceMetadata.SequentialStatus sequential_status = 3
optional string joint_group = 4
An arbitrary string defining a "group" of features that could be modeled as a single joint sequence. For example, consider a dataset that contains three sequential features "purchase_time", "product_id", "purchase_price". These belong to the same sequence of purchases and could be modeled jointly. Specifying joint_group = "purchase" on all three sequences would communicate that the features can be considered part of a single conceptual sequence.
optional int64 sequence_truncation_limit = 5
Specifies the maximum sequence length that should be processed. Sequences may exceed this limit but are expected to be truncated by modeling layers.

This enum specifies whether to treat the feature as a sequence which has meaningful element order.

Used in: SequenceMetadata

SEQUENTIAL_UNSPECIFIED = 0
SEQUENTIAL_YES = 1
SEQUENTIAL_NO = 2

Encodes constraints on specific values in sequences.

Used in: NaturalLanguageDomain

oneof value
The value which to express constraints for. Can be either an integer or a string.
- int64 int_value = 1
- string string_value = 2
optional int64 min_per_sequence = 3
Min / max number of times the value can occur in a sequence.
optional int64 max_per_sequence = 4
optional float min_fraction_of_sequences = 5
Min / max fraction of sequences that must contain the value.
optional float max_fraction_of_sequences = 6

Used in: DerivedFeatureConfig

string expression = 1
Sql expression used to create a derived feature based on the extracted slice keys. It must return result of STRUCT type.
SliceValueTypes feature_value_type = 2
Value type of the derived feature. The default type is string.
bool drop_struct_name = 3
Indicates whether to drop struct name in the generated output.
oneof default_feature_value_for_failed_sql
Set default feature value when slice query fails. If the slice query fails and no default value is provided, the TFDV statistics generation pipeline will fail.
- int64 int64_default_feature_value = 4
- float float_default_feature_value = 5
- string string_default_feature_value = 6

Used in: SliceSql

VALUE_TYPE_DEFAULT = 0
Default type is string
VALUE_TYPE_INTEGER = 1
VALUE_TYPE_FLOAT = 2
VALUE_TYPE_UNSUPPORTED = 3

A sparse feature represents a sparse tensor that is encoded with a combination of raw features, namely index features and a value feature. Each index feature defines a list of indices in a different dimension.

Used in: Schema, StructDomain

optional string name = 1
Name for the sparse feature. This should not clash with other features in the same schema.
required
optional bool deprecated = 2
This field is no longer supported. Instead, use: lifecycle_stage: DEPRECATED TODO(b/111450258): remove this.
optional LifecycleStage lifecycle_stage = 7
The lifecycle_stage determines where a feature is expected to be used, and therefore how important issues with it are.
optional FeaturePresence presence = 4
Constraints on the presence of this feature in examples. Deprecated, this is inferred by the referred features.
optional FixedShape dense_shape = 5
Shape of the sparse tensor that this SparseFeature represents. Currently not supported. TODO(b/109669962): Consider deriving this from the referred features.
repeated SparseFeature.IndexFeature index_feature = 6
Features that represent indexes. Should be integers >= 0.
at least one
optional bool is_sorted = 8
If true then the index values are already sorted lexicographically.
optional SparseFeature.ValueFeature value_feature = 9
required
optional FeatureType type = 10
Type of value feature. Deprecated, this is inferred by the referred features.

Used in: SparseFeature

optional string name = 1
Name of the index-feature. This should be a reference to an existing feature in the schema.

Used in: SparseFeature

optional string name = 1
Name of the value-feature. This should be a reference to an existing feature in the schema.

sparse_top_k_categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/sparse_top_k_categorical_accuracy DEPRECATED

Used in: PerformanceMetric

(message has no fields)

https://www.tensorflow.org/api_docs/python/tf/keras/metrics/SpecificityAtSensitivity

Used in: PerformanceMetric

optional google.protobuf.DoubleValue sensitivity = 1
Minimal required sensitivity, (0.0, 1.0).

squared_hinge(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/squared_hinge DEPRECATED

Used in: PerformanceMetric

(message has no fields)

Encodes information for domains of string values.

Used in: Feature, Schema

optional string name = 1
Id of the domain. Required if the domain is defined at the schema level. If so, then the name must be unique within the schema.
repeated string value = 2
The values appearing in the domain.
optional StringDomain.Categorical is_categorical = 3

Currently unused StringDomain consists of Categorical. This enum allows the user to specify the whether to treat the feature as categorical.

Used in: StringDomain

CATEGORICAL_UNSPECIFIED = 0
CATEGORICAL_YES = 1
CATEGORICAL_NO = 2

Statistics for a string feature in a dataset.

Used in: FeatureNameStatistics

optional CommonStatistics common_stats = 1
uint64 unique = 2
The number of unique values
repeated StringStatistics.FreqAndValue top_values = 3
A sorted list of the most-frequent values and their frequencies, with the most-frequent being first.
float avg_length = 4
The average length of the values
optional RankHistogram rank_histogram = 5
The rank histogram for the values of the feature. The rank is used to measure of how commonly the value is found in the dataset. The most common value would have a rank of 1, with the second-most common value having a rank of 2, and so on.
optional WeightedStringStatistics weighted_string_stats = 6
Weighted statistics for the feature, if the values have weights.
string vocabulary_file = 7
A vocabulary file, used for vocabularies too large to store in the proto itself. Note that the file may be relative to some context-dependent directory. E.g. in TFX the feature statistics will live in a PPP and vocabulary file names will be relative to this PPP.
uint64 invalid_utf8_count = 8
Counts the number of invalid utf8 strings present in leaf arrays for this feature. Validation is only performed for byte- or string-like features ( those having type BYTES or STRING).

Used in: StringStatistics, WeightedStringStatistics

string value = 2
double frequency = 3
The number of times the value occurs. Stored as a double to be able to handle weighted features.

Domain for a recursive struct. NOTE: If a feature with a StructDomain is deprecated, then all the child features (features and sparse_features of the StructDomain) are also considered to be deprecated. Similarly child features can only be in environments of the parent feature.

Used in: Feature

repeated Feature feature = 1
repeated SparseFeature sparse_feature = 2

Used in: FeatureNameStatistics

optional CommonStatistics common_stats = 1

Describes a single task in a model and all its properties. A task corresponds to a single output of the model. Multiple tasks in the same problem statement correspond to different outputs of the model.

Used in: ProblemStatement

optional Type type = 1
Specification of the label and weight columns, and the type of the prediction or classification.
string name = 5
The task name. Tasks within the same ProblemStatement should have unique names. This a REQUIRED field in case of multi-task learning problems.
double task_weight = 2
If a Problem is composed of multiple sub-tasks, the weight of each task determines the importance of solving each sub-task. It is used to rank and select the best solution for multi-task problems. Not meaningful for a problem with one task. If the problem has multiple tasks and all task_weight=0 (unset) then all tasks are weighted equally.
repeated PerformanceMetric performance_metric = 4
This field includes performance metrics of this head that are important to the problem owner and need to be monitored and reported. However, unlike fields such as "meta_optimization_target", these metrics are not not automatically used in meta-optimization.
bool is_auxiliary = 6
True to indicate the task is an auxiliary task in a multi-task setting. Auxiliary tasks are of minor relevance for the application and they are added only to improve the performance on a primary task (by providing additional regularization or data augmentation), and thus are not considered in the meta optimization process (but may be utilized in the learner optimization).

UNKNOWN_TYPE = 0
BINARY_CLASSIFICATION = 1
MULTI_CLASS_CLASSIFICATION = 2
TOP_K_CLASSIFICATION = 3
ONE_DIMENSIONAL_REGRESSION = 4
MULTI_LABEL_CLASSIFICATION = 5
MULTI_DIMENSIONAL_REGRESSION = 6
TEXT_GENERATION = 7

A TensorRepresentation captures the intent for converting columns in a dataset to TensorFlow Tensors (or more generally, tf.CompositeTensors). Note that one tf.CompositeTensor may consist of data from multiple columns, for example, a N-dimensional tf.SparseTensor may need N + 1 columns to provide the sparse indices and values. Note that the "column name" that a TensorRepresentation needs is a string, not a Path -- it means that the column name identifies a top-level Feature in the schema (i.e. you cannot specify a Feature nested in a STRUCT Feature).

Used in: TensorRepresentationGroup

oneof kind
- TensorRepresentation.DenseTensor dense_tensor = 1
- TensorRepresentation.VarLenSparseTensor varlen_sparse_tensor = 2
- TensorRepresentation.SparseTensor sparse_tensor = 3
- TensorRepresentation.RaggedTensor ragged_tensor = 4

Used in: DenseTensor

oneof kind
- double float_value = 1
- int64 int_value = 2
  Note that the data column might be of a shorter integral type. It's the user's responsitiblity to make sure the default value fits that type.
- bytes bytes_value = 3
- uint64 uint_value = 4
  uint_value should only be used if the default value can't fit in a int64 (`int_value`).

A tf.Tensor

Used in: TensorRepresentation

optional string column_name = 1
Identifies the column in the dataset that provides the values of this Tensor.
optional FixedShape shape = 2
The shape of each row of the data (i.e. does not include the batch dimension)
optional DefaultValue default_value = 3
If this column is missing values in a row, the default_value will be used to fill that row.

A tf.RaggedTensor that models nested lists. Currently there is no way for the user to specify the shape of the leaf value (the innermost value tensor of the RaggedTensor). The leaf value will always be a 1-D tensor.

Used in: TensorRepresentation

optional Path feature_path = 1
Identifies the leaf feature that provides values of the RaggedTensor. struct type sub fields. The first step of the path refers to a top-level feature in the data. The remaining steps refer to STRUCT features under the top-level feature, recursively. If the feature has N outer ragged lists, they will become the first N dimensions of the resulting RaggedTensor and the contents will become the flat_values.
required.
repeated RaggedTensor.Partition partition = 3
The result RaggedTensor would be of shape: [B, D_0, D_1, ..., D_N, P_0, P_1, ..., P_M, U_0, U_1, ..., U_P] Where the dimensions belong to different categories: * B: Batch size dimension * D_n: Dimensions specified by the nested structure specified by the value path until the leaf node. n>=1. * P_m: Dimensions specified by the partitions that do not define any fixed diomension size. m>=0. * U_0: Dimensions specified by the latest partitions of type uniform_row_length that can define the fixed inner shape of the tensor. If iterationg the partitions from the end to the beginning, these dimensions are defined by all the continuous uniform_row_length partitions present. p>=0.
optional RowPartitionDType row_partition_dtype = 2
The data type of the ragged tensor's row partitions. This will default to INT64 if it is not specified.

Further partition of the feature values at the leaf level.

Used in: RaggedTensor

oneof kind
- int64 uniform_row_length = 1
  If the final element(s) of partition are uniform_row_lengths [U0, U1, ...] , then the result RaggedTensor will have their flat values (a dense tensor) being of shape [U0, U1, ...]. Otherwise, a uniform_row_length simply means a ragged dimension with row_lengths [uniform_row_length]*nrows.
- string row_length = 2
  Identifies a leaf feature who share the same parent of value_feature_path that contains the partition row lengths.

RaggedTensor consists of RowPartitions. This enum allows the user to specify the dtype of those RowPartitions. If it is UNSPECIFIED, then we default to INT64.

Used in: RaggedTensor

UNSPECIFIED = 0
INT64 = 1
INT32 = 2

A tf.SparseTensor whose indices and values come from separate data columns. This will replace Schema.sparse_feature eventually. The index columns must be of INT type, and all the columns must co-occur and have the same valency at the same row.

Used in: TensorRepresentation

optional FixedShape dense_shape = 1
The dense shape of the resulting SparseTensor (does not include the batch dimension).
repeated string index_column_names = 2
The columns constitute the coordinates of the values. indices_column[i][j] contains the coordinate of the i-th dimension of the j-th value.
optional string value_column_name = 3
The column that contains the values.
optional bool already_sorted = 4
Specify whether the values are already sorted by their index position.

A ragged tf.SparseTensor that models nested lists.

Used in: TensorRepresentation

optional string column_name = 1
Identifies the column in the dataset that should be converted to the VarLenSparseTensor.

A TensorRepresentationGroup is a collection of TensorRepresentations with names. These names may serve as identifiers when converting the dataset to a collection of Tensors or tf.CompositeTensors. For example, given the following group: { key: "dense_tensor" tensor_representation { dense_tensor { column_name: "univalent_feature" shape { dim { size: 1 } } default_value { float_value: 0 } } } } { key: "varlen_sparse_tensor" tensor_representation { varlen_sparse_tensor { column_name: "multivalent_feature" } } } Then the schema is expected to have feature "univalent_feature" and "multivalent_feature", and when a batch of data is converted to Tensors using this TensorRepresentationGroup, the result may be the following dict: { "dense_tensor": tf.Tensor(...), "varlen_sparse_tensor": tf.SparseTensor(...), }

Used in: Schema

map<string, TensorRepresentation> tensor_representation = 1

Configuration for a text generation task where the model should predict a sequence of natural language text.

Used in: Type

string targets = 1
string example_weight = 2
(optional) The weight column.

Time or date representation.

Used in: Feature

oneof format
- string string_format = 1
  Expected format that contains a combination of regular characters and special format specifiers. Format specifiers are a subset of the strptime standard.
- TimeDomain.IntegerTimeFormat integer_format = 2
  Expected format of integer times.

Used in: TimeDomain

FORMAT_UNKNOWN = 0
UNIX_DAYS = 5
Number of days since 1970-01-01.
UNIX_SECONDS = 1
UNIX_MILLISECONDS = 2
UNIX_MICROSECONDS = 3
UNIX_NANOSECONDS = 4

Time of day, without a particular date.

Used in: Feature

oneof format
- string string_format = 1
  Expected format that contains a combination of regular characters and special format specifiers. Format specifiers are a subset of the strptime standard.
- TimeOfDayDomain.IntegerTimeOfDayFormat integer_format = 2
  Expected format of integer times.

Used in: TimeOfDayDomain

FORMAT_UNKNOWN = 0
PACKED_64_NANOS = 1
Time values, containing hour/minute/second/nanos, encoded into 8-byte bit fields following the ZetaSQL convention: 6 5 4 3 2 1 MSB 3210987654321098765432109876543210987654321098765432109876543210 LSB | H || M || S ||---------- nanos -----------|

top_k_categorical_accuracy(...) https://www.tensorflow.org/api_docs/python/tf/keras/metrics/top_k_categorical_accuracy

Used in: PerformanceMetric

(message has no fields)

Configuration for a top-K classification task. In this problem type, there are n_classes possible label values, and the model predicts n_predicted_labels labels. The output is a sequence of n_predicted_labels labels, out of n_classes possible classes. The order of the predicted output labels is determined by the predictions_order field. (*) MultiClassClassification is the same as TopKClassification with n_predicted_labels = 1. (*) TopKClassification does NOT mean multi-class multi-label classification: e.g., the output contains a sequence of labels, all coming from the same label column in the data.

Used in: Type

oneof label_id
The label column.
- string label = 1
  The name of the label. Assumes the label is a flat, top-level field.
- Path label_path = 6
  A path can be used instead of a flat string if the label is nested.
string example_weight = 2
(optional) The weight column.
uint64 n_classes = 3
(optional) The number of label classes. If unset, the solution provider is expected to infer the number of classes from the data.
uint64 n_predicted_labels = 4
(optional) The number of class labels to predict. If unset, we assume 1.
TopKClassification.Order predictions_order = 5

Used in: TopKClassification

UNSPECIFIED = 0
SCORE_DESC = 1
Predictions are ordered from the most likely to least likely.
SCORE_ASC = 2
Predictions are ordered from the least likely to most likely.

The type of a head or meta-objective. Specifies the label, weight, and output type of the head. TODO(martinz): add logistic regression.

Used in: Task

oneof task_type
- BinaryClassification binary_classification = 1
- OneDimensionalRegression one_dimensional_regression = 2
- MultiClassClassification multi_class_classification = 3
- TopKClassification top_k_classification = 4
- MultiLabelClassification multi_label_classification = 5
- TextGeneration text_generation = 6

A URL, see: https://en.wikipedia.org/wiki/URL

Used in: Feature

(message has no fields)

Describes a chunk that is the same in the two artifacts.

Used in: DiffRegion

optional int32 left_start = 1
The starting lines of the chunk in the two artifacts.
optional int32 right_start = 2
repeated string contents = 3
The contents of the chunk. These are the same in both artifacts.

Checks that the number of unique values is greater than or equal to the min, and less than or equal to the max.

Used in: Feature

optional int64 min = 1
optional int64 max = 2

Limits on maximum and minimum number of values in a single example (when the feature is present). Use this when the minimum value count can be different than the maximum value count. Otherwise prefer FixedShape.

Used in: Feature, ValueCountList

optional int64 min = 1
optional int64 max = 2

Used in: Feature

repeated ValueCount value_count = 1

Video data.

Used in: Feature

(message has no fields)

Common weighted statistics for all feature types. Statistics counting number of values (i.e., avg_num_values and tot_num_values) include NaNs. If the weighted column is missing, then this counts as a weight of 1 for that example. For nested features with N nested levels (N > 1), the statistics counting number of values will rely on the innermost level.

Used in: CommonStatistics

double num_non_missing = 1
Weighted number of examples not missing.
double num_missing = 2
Weighted number of examples missing. Note that if the weighted column is zero, this does not count as missing.
double avg_num_values = 3
average number of values, weighted by the number of examples. avg_num_values = tot_num_values / num_non_missing.
double tot_num_values = 4
The total number of values in this feature.

Represents a weighted feature that is encoded as a combination of raw base features. The `weight_feature` should be a float feature with identical shape as the `feature`. This is useful for representing weights associated with categorical tokens (e.g. a TFIDF weight associated with each token). TODO(b/142122960): Handle WeightedCategorical end to end in TFX (validation, TFX Unit Testing, etc)

Used in: Schema

optional string name = 1
Name for the weighted feature. This should not clash with other features in the same schema.
required
optional Path feature = 2
Path of a base feature to be weighted. Required.
optional Path weight_feature = 3
Path of weight feature to associate with the base feature. Must be same shape as feature. Required.
optional LifecycleStage lifecycle_stage = 4
The lifecycle_stage determines where a feature is expected to be used, and therefore how important issues with it are.

Statistics for a weighted feature with an NL domain.

Used in: NaturalLanguageStatistics

double feature_coverage = 1
Weighted feature coverage.
double avg_token_length = 2
Weighted average token length.
optional Histogram token_length_histogram = 3
Histogram containing the distribution of token lengths.
optional Histogram sequence_length_histogram = 9
Histogram containing the distribution of sequence lengths.
double location_misses = 4
Weighted number of sequences that do not match the location constraint.
optional NaturalLanguageStatistics.TokenStatistics token_statistics = 5
Per-token weighted statistics.
optional RankHistogram rank_histogram = 6
The rank histogram with the weighted tokens for the feature.

Statistics for a weighted numeric feature in a dataset.

Used in: NumericStatistics

double mean = 1
The weighted mean of the values
double std_dev = 2
The weighted standard deviation of the values
double median = 3
The weighted median of the values
repeated Histogram histograms = 4
The histogram(s) of the weighted feature values.

Statistics for a weighted string feature in a dataset.

Used in: StringStatistics

repeated StringStatistics.FreqAndValue top_values = 1
A sorted list of the most-frequent values and their weighted frequencies, with the most-frequent being first.
optional RankHistogram rank_histogram = 2
The rank histogram for the weighted values of the feature.

package tensorflow.metadata.v0

message AUC

message AUCPrecisionRecall

message AllowlistDeriver

repeated bytes allowed_bytes_value = 1

bytes placeholder_value = 2

message Annotation

repeated string tag = 1

repeated string comment = 2

repeated google.protobuf.Any extra_metadata = 3

message Anomalies

oneof baseline_schema

Schema baseline = 1

Schema baseline_v1 = 6

optional Anomalies.AnomalyNameFormat anomaly_name_format = 7

map<string, AnomalyInfo> anomaly_info = 2

optional AnomalyInfo dataset_anomaly_info = 8

optional bool data_missing = 3

repeated DriftSkewInfo drift_skew_info = 9

enum Anomalies.AnomalyNameFormat

UNKNOWN = 0

SERIALIZED_PATH = 1

message AnomalyInfo

optional Path path = 8

optional AnomalyInfo.Severity severity = 5

optional string description = 2

optional string short_description = 6

repeated DiffRegion diff_regions = 4

repeated AnomalyInfo.Reason reason = 7

message AnomalyInfo.Reason

optional Type type = 1

optional string short_description = 2

optional string description = 3

enum AnomalyInfo.Severity

UNKNOWN = 0

WARNING = 1

ERROR = 2

enum AnomalyInfo.Type

UNKNOWN_TYPE = 0

MULTIPLE_REASONS = 82

BOOL_TYPE_BIG_INT = 1

BOOL_TYPE_BYTES_NOT_INT = 2

BOOL_TYPE_BYTES_NOT_STRING = 3

BOOL_TYPE_FLOAT_NOT_INT = 4

BOOL_TYPE_FLOAT_NOT_STRING = 5

BOOL_TYPE_INT_NOT_STRING = 6

BOOL_TYPE_SMALL_INT = 7

BOOL_TYPE_STRING_NOT_INT = 8

BOOL_TYPE_UNEXPECTED_STRING = 9

BOOL_TYPE_UNEXPECTED_FLOAT = 52

BOOL_TYPE_INVALID_CONFIG = 88

ENUM_TYPE_BYTES_NOT_STRING = 10

ENUM_TYPE_FLOAT_NOT_STRING = 11

ENUM_TYPE_INT_NOT_STRING = 12

ENUM_TYPE_INVALID_UTF8 = 13

ENUM_TYPE_UNEXPECTED_STRING_VALUES = 14

FEATURE_TYPE_HIGH_NUMBER_VALUES = 15

FEATURE_TYPE_LOW_FRACTION_PRESENT = 16

FEATURE_TYPE_LOW_NUMBER_PRESENT = 17

FEATURE_TYPE_LOW_NUMBER_VALUES = 18

FEATURE_TYPE_NOT_PRESENT = 19

FEATURE_TYPE_NO_VALUES = 20

FEATURE_TYPE_UNEXPECTED_REPEATED = 21

FEATURE_TYPE_HIGH_UNIQUE = 59

FEATURE_TYPE_LOW_UNIQUE = 60

FEATURE_TYPE_NO_UNIQUE = 61

FLOAT_TYPE_BIG_FLOAT = 22

FLOAT_TYPE_NOT_FLOAT = 23

FLOAT_TYPE_SMALL_FLOAT = 24

FLOAT_TYPE_STRING_NOT_FLOAT = 25

FLOAT_TYPE_NON_STRING = 26

FLOAT_TYPE_UNKNOWN_TYPE_NUMBER = 27

FLOAT_TYPE_HAS_NAN = 53

FLOAT_TYPE_HAS_INF = 62

INT_TYPE_BIG_INT = 28

INT_TYPE_INT_EXPECTED = 29

INT_TYPE_NOT_INT_STRING = 30

INT_TYPE_NOT_STRING = 31

INT_TYPE_SMALL_INT = 32

INT_TYPE_STRING_EXPECTED = 33