package yggdrasil_decision_forests.model.distributed_decision_tree.dataset_cache.proto

Get desktop application:
View/edit binary Protocol Buffers messages

Metadata relative to the entire cache.

optional int64 num_examples = 1
Number of examples in the entire cache.
optional int32 delta_bit_idx = 7
Index of the bit used to encode change in between successive values in example idxs.
optional int32 num_shards_in_feature_cache = 2
Number of shards i.e. the data of each features is divided into "num_shards" files.
optional int32 num_shards_in_index_cache = 3
Number of shards for the index cache.
repeated CacheMetadata.Column columns = 4
Information about the columns.
optional int32 label_column_idx = 5
Index of the label column in the dataspec.
optional int32 group_column_idx = 8
Index of the group column in the dataspec.
optional int32 weight_column_idx = 6
Index of the weight column in the dataspec. Not set if the training is non-weighted.

Used in: Column

optional bool replacement_missing_value = 2

Used in: Column

optional int64 num_values = 1
optional int32 replacement_missing_value = 2

Used in: CacheMetadata

optional bool available = 1
Is the column available in the cache.
oneof type
Type-specific column information.
- NumericalColumn numerical = 2
- CategoricalColumn categorical = 3
- BooleanColumn boolean = 4
- HashColumn hash = 5

Used in: Column

(message has no fields)

Used in: Column, SortedColumnMetadata, WorkerResult.SortNumericalColumn

optional float replacement_missing_value = 1
optional int64 num_unique_values = 2
optional bool discretized = 3
optional int32 discretized_replacement_missing_value = 4
optional int64 num_discretized_shards = 5
optional int32 num_discretized_values = 6

Configuration for the creation of a cache.

Used in: distributed_gradient_boosted_trees.proto.DistributedGradientBoostedTreesTrainingConfig

optional int64 index_cache_file_size_bytes = 1
Indicative size of a file in the index cache, expressed in bytes.
20MB
optional int32 label_column_idx = 2
Optional index of the label column in the dataspec.
optional int32 group_column_idx = 7
Optional index to the group column in the dataspec. Only used for ranking training.
optional int32 weight_column_idx = 3
Optional index of the weight column in the dataspec. Not set if the training is non-weighted.
optional bool remove_zero_weighted_examples = 4
If true, and if "weight_column_idx" is set, the cache does not includes examples with weight = 0.
optional int64 max_unique_values_for_discretized_numerical = 5
Maximum number of unique value of a numerical feature to allow its pre-discretization. In case of large datasets, discretized numerical features with a small number of unique values are more efficient to learn than classical / non-discretized numerical features. This parameter does not impact the final model. However, it can speed-up or slown the training.
optional bool force_numerical_discretization = 6
If false, only the numerical column safisfying "max_unique_values_for_discretized_numerical" will be discretized. If true, all the numerical columns will be discretized. Columns with more than "max_unique_values_for_discretized_numerical" unique values will be approximated with "max_unique_values_for_discretized_numerical" bins. This parameter will impact the model training.

Used in: distributed_gradient_boosted_trees.proto.DistributedGradientBoostedTreesTrainingConfig

repeated int32 features = 1
Indices of the features accessible in reading operations.
optional bool load_all_features = 4
If set, loads all the available features. In this case, "features" should be empty.
optional int32 reading_buffer = 2
Number of values to read at each reading call.
optional bool load_cache_in_memory = 3
Load an read the cache from memory, or read the cache from disk.

Partial metadata from a subset of observations of a given column obtained during the creation of a dataset cache.

optional int64 num_examples = 1
Number of example in the shard.
optional int64 num_missing_examples = 2
oneof type
- PartialColumnShardMetadata.NumericalColumn numerical = 3
- PartialColumnShardMetadata.CategoricalColumn categorical = 4

Used in: PartialColumnShardMetadata

optional int64 number_of_unique_values = 1
Same as "number_of_unique_values" in the dataspec. Integer values should be in [-1, number_of_unique_values).
map<string, dataset.proto.CategoricalSpec.VocabValue> items = 2
Same as "items" in the dataspec. Dictionary for a categorical string feature.

Used in: PartialColumnShardMetadata

optional double mean = 1
optional double min = 2
optional double max = 3

Partial metadata during the creation of a dataset cache.

repeated string column_names = 1
optional int32 num_shards = 2

Information relative to one shard in the feature cache.

optional int64 num_examples = 1
Number of example in the shard.

Information relative to one pre-sorted column in the index cache.

optional CacheMetadata.NumericalColumn metadata = 1

Request message of the workers.

oneof type
Each of the following actions are made available buy a function of the same name in "column_cache.h".
- WorkerRequest.SeparateDatasetColumns separate_dataset_columns = 1
  Separate the columns of a dataset into individual files (per column and per shards).
- WorkerRequest.SortNumericalColumn sort_numerical_column = 2
  Sort or discretize a numerical column (depending on the number of unique values). Sorting a numerical column consist in exporting both the unique attribute values (sorted by value) as well as the example indices (ordered by the attribute value). Discretizing a numerical column consists in export the unique attribute values (sorted by value; same as before) and the index of the attribute value in this table (sorted by example index i.e. the original dataset ordering). Currently, this stage is implemented with in-memory sorting i.e. worker should be able to load an entire column in memory i.e. 4 bytes * number of examples. If the number of unique values is small enough, this stage can compute the optimal discretization (using the sorted values) and then export the discretized values.
- WorkerRequest.ConvertPartialToFinalRawData convert_partial_to_final_raw_data = 3
  Copy and transform the raw data in the partial cache format into the raw data in the (final) cache format.

Used in: WorkerRequest

optional string partial_cache_directory = 1
optional string final_cache_directory = 2
optional int32 column_idx = 3
optional int32 shard_idx = 4
optional int32 num_shards = 5
optional bool delete_source_file = 6
oneof transformation
- ConvertPartialToFinalRawData.Numerical numerical = 7
- ConvertPartialToFinalRawData.CategoricalInt categorical_int = 8
- ConvertPartialToFinalRawData.CategoricalString categorical_string = 9

Used in: ConvertPartialToFinalRawData

optional int64 max_value = 1
optional int32 nan_value_replacement = 2

Used in: ConvertPartialToFinalRawData

map<string, dataset.proto.CategoricalSpec.VocabValue> items = 1
Same as "items" in the dataspec. Dictionary for a categorical string feature of the result.
optional int32 nan_value_replacement = 2

Used in: ConvertPartialToFinalRawData

optional float nan_value_replacement = 1
Replace the NaN values.

Used in: WorkerRequest

optional string dataset_path = 1
Part to dataset (or subset of dataset).
optional string output_directory = 2
Output directory.
repeated int32 columns = 3
Columns to exact. The other columns are ignored.
optional dataset.proto.DataSpecification dataspec = 4
optional int32 shard_idx = 5
Corresponding shard index in the feature cache.
optional int32 column_idx_remove_example_with_zero = 6
If set, index of a numerical column. The example with zero value (for this column) are removed from the cache.
optional int32 num_shards = 7
Number of shards for the feature.

Used in: WorkerRequest

optional string output_directory = 1
Output directory.
optional int64 num_examples = 2
Total number of example in the column.
optional int32 delta_bit_idx = 11
Delta bit.
optional int32 column_idx = 3
Index of the column.
optional int32 num_shards = 4
Number of shards in the input data.
optional int32 num_example_per_output_shards = 6
Number of examples to write in each output shard.
optional int64 max_unique_values_for_discretized_numerical = 7
Depending on the number of unique values, the numerical column will exported as pre-sorted or pre-discretized. If the number of unique values is > max_unique_values_for_discretized_numerical, the output will contains sorted numerical values. Otherwise, the output will contain in-order discretizerd numerical values.
optional float replacement_missing_value = 8
Value replacing missing values in the input. This is only used to compute the returned meta data message (as missing values have been filtered in the previous stage).
optional bool force_numerical_discretization = 9
If true, force the discretization of the column. If the column contains more than "max_unique_values_for_discretized_numerical" unique values, the column is discretized using "max_unique_values_for_discretized_numerical" quantiles.
optional int32 num_shards_in_output_shards = 10
Number of expected shards in the output.

Result message of the worker.

oneof type
- WorkerResult.SeparateDatasetColumns separate_dataset_columns = 1
- WorkerResult.SortNumericalColumn sort_numerical_column = 2
- WorkerResult.ConvertPartialToFinalRawData convert_partial_to_final_raw_data = 3

Used in: WorkerResult

(message has no fields)

Used in: WorkerResult

optional int32 shard_idx = 2
optional int64 num_examples = 3

Used in: WorkerResult

optional int32 column_idx = 2
optional CacheMetadata.NumericalColumn metadata = 3

Welcome message of the worker.

No welcome data yet.

(message has no fields)

package yggdrasil_decision_forests.model.distributed_decision_tree.dataset_cache.proto

message CacheMetadata

optional int64 num_examples = 1

optional int32 delta_bit_idx = 7

optional int32 num_shards_in_feature_cache = 2

optional int32 num_shards_in_index_cache = 3

repeated CacheMetadata.Column columns = 4

optional int32 label_column_idx = 5

optional int32 group_column_idx = 8

optional int32 weight_column_idx = 6

message CacheMetadata.BooleanColumn

optional bool replacement_missing_value = 2

message CacheMetadata.CategoricalColumn

optional int64 num_values = 1

optional int32 replacement_missing_value = 2

message CacheMetadata.Column

optional bool available = 1

oneof type

NumericalColumn numerical = 2

CategoricalColumn categorical = 3

BooleanColumn boolean = 4

HashColumn hash = 5

message CacheMetadata.HashColumn

message CacheMetadata.NumericalColumn

optional float replacement_missing_value = 1

optional int64 num_unique_values = 2

optional bool discretized = 3

optional int32 discretized_replacement_missing_value = 4

optional int64 num_discretized_shards = 5

optional int32 num_discretized_values = 6

message CreateDatasetCacheConfig

optional int64 index_cache_file_size_bytes = 1

optional int32 label_column_idx = 2

optional int32 group_column_idx = 7

optional int32 weight_column_idx = 3

optional bool remove_zero_weighted_examples = 4

optional int64 max_unique_values_for_discretized_numerical = 5

optional bool force_numerical_discretization = 6

message DatasetCacheReaderOptions

repeated int32 features = 1

optional bool load_all_features = 4

optional int32 reading_buffer = 2

optional bool load_cache_in_memory = 3

message PartialColumnShardMetadata

optional int64 num_examples = 1

optional int64 num_missing_examples = 2

oneof type

PartialColumnShardMetadata.NumericalColumn numerical = 3

PartialColumnShardMetadata.CategoricalColumn categorical = 4

message PartialColumnShardMetadata.CategoricalColumn

optional int64 number_of_unique_values = 1

map<string, dataset.proto.CategoricalSpec.VocabValue> items = 2

message PartialColumnShardMetadata.NumericalColumn

optional double mean = 1

optional double min = 2

optional double max = 3

message PartialDatasetMetadata

repeated string column_names = 1

optional int32 num_shards = 2

message ShardMetadata

optional int64 num_examples = 1

message SortedColumnMetadata

optional CacheMetadata.NumericalColumn metadata = 1

message WorkerRequest

oneof type

WorkerRequest.SeparateDatasetColumns separate_dataset_columns = 1

WorkerRequest.SortNumericalColumn sort_numerical_column = 2

WorkerRequest.ConvertPartialToFinalRawData convert_partial_to_final_raw_data = 3

message WorkerRequest.ConvertPartialToFinalRawData

optional string partial_cache_directory = 1

optional string final_cache_directory = 2

optional int32 column_idx = 3

optional int32 shard_idx = 4

optional int32 num_shards = 5

optional bool delete_source_file = 6

oneof transformation

ConvertPartialToFinalRawData.Numerical numerical = 7

ConvertPartialToFinalRawData.CategoricalInt categorical_int = 8

ConvertPartialToFinalRawData.CategoricalString categorical_string = 9

message WorkerRequest.ConvertPartialToFinalRawData.CategoricalInt