Get desktop application:
View/edit binary Protocol Buffers messages
Metadata relative to the entire cache.
Number of examples in the entire cache.
Index of the bit used to encode change in between successive values in example idxs.
Number of shards i.e. the data of each features is divided into "num_shards" files.
Number of shards for the index cache.
Information about the columns.
Index of the label column in the dataspec.
Index of the group column in the dataspec.
Index of the weight column in the dataspec. Not set if the training is non-weighted.
Used in:
Used in:
Used in:
Is the column available in the cache.
Type-specific column information.
Used in:
(message has no fields)
Used in:
, ,Configuration for the creation of a cache.
Used in:
Indicative size of a file in the index cache, expressed in bytes.
20MB
Optional index of the label column in the dataspec.
Optional index to the group column in the dataspec. Only used for ranking training.
Optional index of the weight column in the dataspec. Not set if the training is non-weighted.
If true, and if "weight_column_idx" is set, the cache does not includes examples with weight = 0.
Maximum number of unique value of a numerical feature to allow its pre-discretization. In case of large datasets, discretized numerical features with a small number of unique values are more efficient to learn than classical / non-discretized numerical features. This parameter does not impact the final model. However, it can speed-up or slown the training.
If false, only the numerical column safisfying "max_unique_values_for_discretized_numerical" will be discretized. If true, all the numerical columns will be discretized. Columns with more than "max_unique_values_for_discretized_numerical" unique values will be approximated with "max_unique_values_for_discretized_numerical" bins. This parameter will impact the model training.
Used in:
Indices of the features accessible in reading operations.
If set, loads all the available features. In this case, "features" should be empty.
Number of values to read at each reading call.
Load an read the cache from memory, or read the cache from disk.
Partial metadata from a subset of observations of a given column obtained during the creation of a dataset cache.
Number of example in the shard.
Used in:
Same as "number_of_unique_values" in the dataspec. Integer values should be in [-1, number_of_unique_values).
Same as "items" in the dataspec. Dictionary for a categorical string feature.
Used in:
Partial metadata during the creation of a dataset cache.
Information relative to one shard in the feature cache.
Number of example in the shard.
Information relative to one pre-sorted column in the index cache.
Request message of the workers.
Each of the following actions are made available buy a function of the same name in "column_cache.h".
Separate the columns of a dataset into individual files (per column and per shards).
Sort or discretize a numerical column (depending on the number of unique values). Sorting a numerical column consist in exporting both the unique attribute values (sorted by value) as well as the example indices (ordered by the attribute value). Discretizing a numerical column consists in export the unique attribute values (sorted by value; same as before) and the index of the attribute value in this table (sorted by example index i.e. the original dataset ordering). Currently, this stage is implemented with in-memory sorting i.e. worker should be able to load an entire column in memory i.e. 4 bytes * number of examples. If the number of unique values is small enough, this stage can compute the optimal discretization (using the sorted values) and then export the discretized values.
Copy and transform the raw data in the partial cache format into the raw data in the (final) cache format.
Used in:
Used in:
Used in:
Same as "items" in the dataspec. Dictionary for a categorical string feature of the result.
Used in:
Replace the NaN values.
Used in:
Part to dataset (or subset of dataset).
Output directory.
Columns to exact. The other columns are ignored.
Corresponding shard index in the feature cache.
If set, index of a numerical column. The example with zero value (for this column) are removed from the cache.
Number of shards for the feature.
Used in:
Output directory.
Total number of example in the column.
Delta bit.
Index of the column.
Number of shards in the input data.
Number of examples to write in each output shard.
Depending on the number of unique values, the numerical column will exported as pre-sorted or pre-discretized. If the number of unique values is > max_unique_values_for_discretized_numerical, the output will contains sorted numerical values. Otherwise, the output will contain in-order discretizerd numerical values.
Value replacing missing values in the input. This is only used to compute the returned meta data message (as missing values have been filtered in the previous stage).
If true, force the discretization of the column. If the column contains more than "max_unique_values_for_discretized_numerical" unique values, the column is discretized using "max_unique_values_for_discretized_numerical" quantiles.
Number of expected shards in the output.
Result message of the worker.
Used in:
(message has no fields)
Used in:
Used in:
Welcome message of the worker.
No welcome data yet.
(message has no fields)