Get desktop application:
View/edit binary Protocol Buffers messages
Specification of a boolean column.
Used in:
Number of true values.
Number of false values.
Used in:
Minimum frequency of an categorical value not to be replaced by the <RARE> special value.
Maximum number of unique categorical values. If more values are present, the less frequent values are considered <OOD>.
If is_already_integerized=false, a dictionary is build for the feature. Even if the feature is an integer or a float. If is_already_integerized=true, the value is directly interpreted as an index and should follow the following convention: - The value should be greater or equal to -1. - The value -1 is the "missing value". - The value 0 is the "out-of-dictionary value". - Several YDF algorithms assume this is a "dense index" i.e. if the column is an input feature, it is best to have it being dense.
If "is_already_integerized=true" and if "number_of_already_integerized_values" is set, "number_of_already_integerized_values" is the number of unique values. Such attribute accepts values in [-1, number_of_already_integerized_values). Values outside of this range will be considered "out-of-vocabulary". Note that if the dataset used to infer the dataspec contains an example with a value > number_of_already_integerized_values, the example value will be used instead of "number_of_already_integerized_values".
If set, replaces the most_frequent_item item. The most frequent item is used by the global imputation algorithm to handle missing values. That is, missing values will be treated as the most frequent item. Overriding the most frequent item is only allowed on columns not containing any missing values.
Used in:
Overriding is only possible for non-integerized columns.
Specification of a categorical column.
Used in:
The most frequent value.
The number of unique values (including the reserved OOD(=0) value). All the values should be 0 <= value < number_of_unique_values. The value "0" is reserved for the out-of-dictionary value. Therefore, in the case of a categorical column with two possible values "X" and "Y", the proto will be: number_of_unique_values = 3 is_already_integerized=false items { key: "OOD" value { index: 0 }} items { key: "X" value { index: 1 }} items { key: "Y" value { index: 2 }} Missing values are implicit and take index=-1. They don't need to be specified in "items".
Minimum frequency of a value not to be replaced by the <OOD> special value. Used when computing value dictionary.
Maximum number of unique categorical values. If more values are present, the less frequent values are considered <OOD>. Used when computing value dictionary. If "max_number_of_unique_values" == -1, the items are not pruned.
If true, values are interpreted directed as an integer. If false, values are indexed in the "items" dictionary.
Dictionary of values. Only available if is_already_integerized=false. In this case, items.size() is equal to number_of_unique_values.
If true, integer categorical values provided by the user have been offset by 1. Such pre-processing is done in TensorFlow Decision Forests. See "CATEGORICAL_INTEGER_OFFSET".
This is an alternative to the CategoricalSpec without using a map. This message is for internal use only. It is binary compatible to CategoricalSpec. This message may be removed at any point without warning.
Used in:
Possible value of a non integerized categorical, categorical set, or categorical list attribute.
Used in:
, , ,Index of the value.
Frequency of the value.
Definition of a column in a dataset.
Used in:
,Type of data.
Column unique name.
If true, the type is set manually by the user (instead of been automatically detected). This field is purely used for debugging purpose and has no impact on the computation. Note that if a column guide matches this column, and if this column guide does not contain a type, is_manual_type is set to false (as if there were no column guide match).
Tokenization. For non-integerized list or sets columns (numerical or categorical).
Data for numerical (simple, list or set) attribute types.
Data for categorical (simple, list or set) attribute types.
Number of NAs (i.e. not available) record when building the dataspec.
Numerical value stored as an index + a dictionary.
Data for boolean attribute types.
For all the types defined as a collection of multiple values.
For data of type NUMERICAL_VECTOR_SEQUENCE.
Is the feature derived from unstacking a multi-dimensional dimension?
Storage representation of a column. Internally, feature representation is determined by its semantic. For instance, a NUMERICAL feature is always stored as a float32. DTypes are used to record the feature representation fed to YDF, and then used for APIs without automatic casting.
Used in:
Regular expression on the column name.
Type of the column.
If "tokenizer" is specified, and if the dataset container can represent a list of token natively (i.e. list of strings e.g. tf.Example), the first string entry (if any) will be tokenized. If the attribute contains more than one entry, an error will be raised.
If true, a column can be matched against multiple different "ColumnGuide" with the last ColumnGuide having higher priority. For example, it the "type" is set in two matching column guides, the type defined in the last column guide will be used. If false, an error will be raised if more than one column guide is matching a column.
If true, matching columns are ignored and won't be in the dataspec.
Type of dataset columns.
Used in:
, ,A numerical vector sequence value is a sequence (e.g. a list) of numerical vectors. A numerical vector is a sequence of floats. The number of vectors in a sequence can vary from one example to another. Some examples can have empty sequences.The length of vectors is fixed for all the vectors in a dataset (i.e., not just for the vectors in a sequence). Semantically, all the i-th value of all vectors are expected to represent the same type of data. An empty sequence is different than an unknown value sequence (i.e., there is a sequence but we don't know what it is). Numerical vector sequence can be used to represent multivariate time series or sequences of embeddings (such as the one in transformer architectures).
Storage representation of a column.
Used in:
Specification of the columns of a dataset. List the available columns ( including their name, type, and extra information e.g. dictionaries).
Used in:
, , , ,The columns.
The number of rows of the dataset used to create this dataspec (if a dataset was used).
Meta-data about features that were unstacked e.g. with the "unstack_numerical_set_as_numericals" control field.
Structure containing intermediary information for the computation of a DataSpecification.
Used in:
Sum and sum of error for the Kahan summation. Used for numerical columns.
Mapping between float values (represented as an uint32) and the number of times this value was saw. Note: Map don't allow float indexed maps.
Configuration for the automated "inference" logic of the data specification (see header for the definition of data specification). For example, the DataSpecificationGuide allows to express the following: - The column called "feature_1" is NUMERICAL. - The columns matching the regex "num_feature_.*" are NUMERICAL. - Ignore the column called "feature_1". - Ignore the columns matching the regex "num_feature_.*". - Ignore the columns matching none of the set rules. - The column called "feature_1" is a CATEGORICAL_SET and should be tokenized by commas. - The column called "feature_1" is a CATEGORICAL and the categorical values seen less than 50 times should be ignored (considered out-of-bag). - The size of the CATEGORICAL and CATEGORICAL_SET column dictionaries should not have more than 1000 items. - Column that look BOOLEAN should be interpreted as NUMERICAL. - Use the first 100'000 record in the dataset to best infer the semantic of the columns.
Used in:
Guide applied to one or a sub-set of columns according to a regular expression match.
Default guide for all columns. Also apply to columns matched with "column_guides", but with a lower priority. For example, if an configuration option is set both in "default_column_guide" and "column_guides", the value is "column_guides" will be used.
If true, columns that don't match any "column_guides" regular expression are ignored.
Maximum number of rows to scan to infer the column types. Set the value "-1" to use all rows (i.e. use the entire dataset). Note: The type inference logic is only used if the user does not specify the type manually.
If true, columns initially detected as BOOLEAN (i.e. only containing "0" and "1" values) will be detected as NUMERICAL.
Detects numerical values (i.e. NUMERICAL) as DISCRETIZED_NUMERICAL. DISCRETIZED_NUMERICAL values are discretized at loading time. Some algorithms (e.g. the YDF decision forest algorithms) will handle NUMERICAL and DISCRETIZED_NUMERICAL types differently. Generally, discretized columns are faster to train but can lead to sub-optimal models.
Maximum number of rows to scan to compute column statistics (e.g. dictionary, ratio of missing values, mean value). Set the value "-1" to use all rows (i.e. use the entire dataset).
If true, unstack numerical sets are multiple numerical features. This operation is useful to consume multi-dimensional numerical vectors i.e. list of numerical values with always the same size and semantic per dimension.
Remove columns of unknown type. For example, if the column has no values (all the values are missing) and its type is not specified by the user.
Allow automatic inference of the CATEGORICAL_SET type by applying tokenization. If not set, the inference code will still set the type to CATEGORICAL_SET if the (default) column guide asks for it.
Supported dataset formats.
Used in:
Minimum number of examples in a bin.
Specification of a discretized numerical column. A "discretized numerical" value "i" is encoded as index (integer) between -1 (inclusive) and "n = boundaries.size()" (also inclusive). If i==-1, the value is missing. If i==0, the original numerical value is lower (strictly) than "boundaries.front()". If i==boundaries.size(), the original value is higher (non strictly) to "boundaries.back()". If i \in [1, boundaries.size()[, the original value is in between "boundaries[i-1]" and "boundaries[i]". Because encoding a numerical value into a discretized numerical value is lossy, the original numerical value cannot be recovered. In this case, the following logic is applied: If i==-1, the numercal value is "std::nan" (corresponding to a missing value). If i==0, the numerical value is "boundaries.front()-1". If i==boundaries.size(), the numerical value is "boundaries.back()+1". If i \in [1, boundaries.size()[, the numerical value is "(boundaries[i-1]+boundaries[i])/2".
Used in:
Boundaries in between the bins. The number of bins is boundaries.size() + 1.
Number of unique numerical values before the discretization.
Maximum number of bins (at construction time). // Defaults to 255 bins, that is 254 boundaries.
Minimum number of examples in a bin.
One Example (also called observation/record/example/sample).
Used in:
Attribute values indexed by the attribute index defined in the dataspec.
Example index.
Attribute value.
Used in:
,Note: This value will be loaded as a "DiscretizedNumericalIndex = uint16".
Value for multi-dimensional categorical attributes.
Used in:
Value for multi-dimensional numerical attributes.
Used in:
An ordered sequence of numerical vectors.
Used in:
Used in:
Internal linked version of the weight definition. The attributes and values are indexed according to the dataspec.
Used in:
,Attribute index used to compute the weight.
Weight definition if the controlling attribute is a numerical attribute.
Weight definition if the controlling attribute is a categorical attribute.
Used in:
Index of "categorical_mapping". Maps a weight value for each categorical attribute value. See the dataspec for the mapping attribute value string to attribute value index.
Used in:
(message has no fields)
Specification for types with multiple values.
Used in:
Maximum number of observed items.
Minimum number of observed items.
Note: Depending on the type of the column, observations with more values than "max_observed_size" or less values than "min_observed_size" might still be valid.
Used in:
(message has no fields)
Specification of a numerical column.
Used in:
Mean value (excluding the NaN).
Used in:
The length of the vectors.
Number of value (i.e., float) seen.
Minimum and maximum number of vectors seen.
Options for the synthetic generation of dataset.
Next ID: 21
Number of examples in the dataset.
Name of the label column.
Name of the feature columns, with "{type}" being the short feature type (e.g. "num(erical)", "cat(egorical)") and "{index}" being the feature index (among other features of the same type).
Number of features by semantic. "num_categorical" and "num_categorical_set" are used each twice for the string and integer representations e.g. categorical_string, categorical_int.
Dictionary sizes.
Number of dimensions of "multidimensional_numerical" features.
If false, numerical values represented as float. If true, they are represented as integers.
If true, the value zero (0) of categorical and categorical set values (both for features and labels) is used to represent a out-of-vocabulary value (and the first real value is 1). If false, zero (0) is a categorical value like others.
Average number of items in a categorical set feature.
Probability for a feature value to be missing.
How much noise to inject in the label. The problem can be perfectly solved with "label_noise_ratio=0", and not be solved better than random for "label_noise_ratio=1" (is there are not other sources of noise).
Seed used to initialize the random generator used to generate the dataset. If set to -1, the random generator is initialized using std::random_device.
Number of accumulators. Accumulator are internal structures used to generate the dataset. Increasing the value will increase the "conditional independence" of the dataset i.e. having more tuples <FS1, FS2, X> such that "Label ⊥ FS1 | FS2=X" with FS1 and FS2 two sets of features. Decreasing the value will make the dataset more "naive independent" i.e. increasing the tendency of "P(Label Fi | Fj) == P(Label Fi) if j!=i". The value should be odd and in between 1 and the total number of features (i.e. sum "num_{numerical, categorical, ...}"). Even values will be rounded down. The exact use of the accumulators is described in "synthetic_dataset.h".
The task represented by the labels.
Default.
Number of examples to inject in each shards. Requires for the dataset paths to be sharded (i.e. ends with @<number of shards>). Set to -1 to disable dataset sharding.
Used in:
Number of label classes. 2 => binary classification.
Is the label stored a string or an integer.
Used in:
Name of the column containing the group index. In document/query scoring, the group would be the queries.
Number of examples in each group. The last group might have less examples if num_examples % group_size != 0.
Used in:
(message has no fields)
Tokenization parameters.
Used in:
,How to convert a string into a list/set of symbols.
Separator characters. Used if splitter=SEPARATOR.
Splitting regular expression. Used if splitter=REGEX_MATCH.
Cast strings to lower case before tokenization.
Grouping of the tokens.
Used in:
Possible string tokenization algorithms.
Used in:
Split a string according to the user specified separator.
Split a string by extracting token using the user specified regular expression.
Split a string into individual characters. Does not remove spaces and non-printable characters.
Never split a string. Useful if CATEGORICAL_SET features should be avoided.
Used in:
Information about unstacked column. An unstacked column is a multi-dimensional column (e.g. an embedding) that has been split into multiple scalar columns.
Used in:
Name of the column that was unstacked.
Index of the first column containing the unstacked feature.
Number of unstacked elements.
Type of the columns.
Used in:
,[Required] Name of the attribute that controls the weights of the examples.
The attribute is interpreted as a numerical value.
The attribute is interpreted as a categorical attribute. A weight is defined for each possible value.
Solve the following mapping to get the weight.
Used in:
Pair of categorical value and weight.
Used in:
[Required] A value to map to a corresponding weight.
[Required] The weight.
The weight is directly the numerical value. Note that for Ranking problems, the ranking is per group and all weights of the same group should be identical.
Used in:
(message has no fields)