package smith

Get desktop application:
View/edit binary Protocol Buffers messages

Next Available Field: 4

Config for the BERT/SMITH based dual encoder.

optional EncoderConfig encoder_config = 1
optional TrainEvalConfig train_eval_config = 2
This field must be set to supply the train/eval data.
optional LossConfig loss_config = 3
Config for optimization, this field is required.

Configuration for BERT-based or SMITH-based encoder. Next Available Field: 18

Used in: DualEncoderConfig

optional string model_name = 12
The name of the model.
optional string init_checkpoint = 1
Which pretrained checkpoint to use. This field is required for fine-tuning.
optional string predict_checkpoint = 2
Which prediction checkpoint to use for model prediction process.
optional string bert_config_file = 3
Where is the bert config file.
optional string doc_bert_config_file = 4
Where is the document level bert config file, which is only used in the the SMITH model.
optional string vocab_file = 5
Where is the vocab file.
optional int32 max_seq_length = 6
This is only used for the BERT model. The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Normally, this should be no larger than the one used in pretraining. This should be matched with the data generation settings.
optional int32 max_predictions_per_seq = 7
Maximum number of masked LM predictions per sequence. Note that for the SMITH model, the maximum number of masked LM predictions per document is max_doc_length_by_sentence * max_predictions_per_seq.
optional int32 max_sent_length_by_word = 8
This is only used for the SMITH model. The maximum number of tokens in a sentence.
optional int32 max_doc_length_by_sentence = 9
This is only used for the SMITH model. The maximum number of sentences in a document.
optional int32 loop_sent_number_per_doc = 10
This is only used for the SMITH model. The number of looped sentences in a document to control the used TPU memory. This number should be shorter than the setting of max_doc_length_by_sentence.
optional bool sent_bert_trainable = 11
This is only used for the SMITH model. Whether update the parameters in the sentence level Tranformers of the SMITH model.
optional int32 max_masked_sent_per_doc = 14
This is only used for the SMITH model. The maximum number of sentences to be masked in each document.
optional bool use_masked_sentence_lm_loss = 15
This is only used for the SMITH model. If true, add the masked sentence LM loss into the total training loss.
optional int32 num_labels = 13
The number of different labels in classification task.
optional string doc_rep_combine_mode = 16
The type of document representation combing mode. It can be normal, sum_concat, mean_concat or attention.
optional int32 doc_rep_combine_attention_size = 17
The size of the attention vector in the attention layer for combining the sentence level representations to generate the document level representations.

Configuration for a loss function. Next Available Field: 2

Used in: DualEncoderConfig

optional float similarity_score_amplifier = 1
Hyperparameters for the loss function. The amplifier to increase the logits value, so that sigmoid(logits) is closer to 0 or 1. The default value is 6.0.

Definition of sections in WikiDoc pages. NextID: 3

Used in: WikiDoc

optional string title = 1
optional string text = 2

Proto to specify train/eval datasets and train/eval settings. Next Available Field: 13

Used in: DualEncoderConfig

optional string input_file_for_train = 1
File patterns for train set, separated by commas if you have multiple files. This field is required.
optional string input_file_for_eval = 2
File patterns for eval set, separated by commas if you have multiple files.
optional int32 train_batch_size = 4
Total batch size for training.
optional int32 eval_batch_size = 5
Total batch size for evaluation.
optional int32 predict_batch_size = 6
Total batch size for prediction.
optional int32 max_eval_steps = 7
Maximum number of eval steps. This should be set according to the size of eval data. During model pre-training, we can also use a part of training data for evaluation.
optional int32 save_checkpoints_steps = 8
How often to save the model checkpoint.
optional int32 iterations_per_loop = 9
How many steps to make in each estimator call.
optional bool eval_with_eval_data = 10
This is set to true if we awalys want to evaluate the model with the eval or test data even in the pre-train mode, so that we know whether the model overfits the training data.
optional float neg_to_pos_example_ratio = 12
The weight to compensate when we have more negative examples.

Definition of contents in a WikiDoc objects. NextID: 7

Used in: WikiDocPair

optional string id = 1
An id that uniquely identifies this document. The id can be generated based on the url of the document.
optional string url = 2
The url of the WikiDoc page.
optional string title = 3
The title of the WikiDoc page.
optional string description = 4
The description of the WikiDoc page.
repeated Section section_contents = 5
The section contents of the WikiDoc page.
repeated string image_ids = 6
A list of image ids of images in the WikiDoc page.

Definition of a pair of two WikiDoc objects. NextID: 10

optional string id = 1
An id that uniquely identifies this document pair. The id can be generated based on the urls of the document pair.
optional int32 machine_label_for_classification = 2
The classification label generated by machine. We set this as int in case we would like to change number of graded levels of this label.
optional int32 human_label_for_classification = 3
The classification label generated by human.
optional float machine_label_for_regression = 4
The regression label generated by machine.
optional float human_label_for_regression = 5
The regression label generated by human.
optional WikiDoc doc_one = 6
Two document objects with similarity labels.
optional WikiDoc doc_two = 7
optional float model_prediction = 8
The model predicted similarity score for this pair.
repeated int32 human_label = 9
The raw human rating scores.

package smith

message DualEncoderConfig

optional EncoderConfig encoder_config = 1

optional TrainEvalConfig train_eval_config = 2

optional LossConfig loss_config = 3

message EncoderConfig

optional string model_name = 12

optional string init_checkpoint = 1

optional string predict_checkpoint = 2

optional string bert_config_file = 3

optional string doc_bert_config_file = 4

optional string vocab_file = 5

optional int32 max_seq_length = 6

optional int32 max_predictions_per_seq = 7

optional int32 max_sent_length_by_word = 8

optional int32 max_doc_length_by_sentence = 9

optional int32 loop_sent_number_per_doc = 10

optional bool sent_bert_trainable = 11

optional int32 max_masked_sent_per_doc = 14

optional bool use_masked_sentence_lm_loss = 15

optional int32 num_labels = 13

optional string doc_rep_combine_mode = 16

optional int32 doc_rep_combine_attention_size = 17

message LossConfig

optional float similarity_score_amplifier = 1

message Section

optional string title = 1

optional string text = 2

message TrainEvalConfig

optional string input_file_for_train = 1

optional string input_file_for_eval = 2

optional int32 train_batch_size = 4

optional int32 eval_batch_size = 5

optional int32 predict_batch_size = 6

optional int32 max_eval_steps = 7

optional int32 save_checkpoints_steps = 8

optional int32 iterations_per_loop = 9

optional bool eval_with_eval_data = 10

optional float neg_to_pos_example_ratio = 12

message WikiDoc

optional string id = 1

optional string url = 2

optional string title = 3

optional string description = 4

repeated Section section_contents = 5

repeated string image_ids = 6

message WikiDocPair

optional string id = 1

optional int32 machine_label_for_classification = 2

optional int32 human_label_for_classification = 3

optional float machine_label_for_regression = 4

optional float human_label_for_regression = 5

optional WikiDoc doc_one = 6

optional WikiDoc doc_two = 7

optional float model_prediction = 8

repeated int32 human_label = 9