package sentencepiece

Get desktop application:
View/edit binary Protocol Buffers messages

ModelProto stores model parameters. SentencePieceProcessor is supposed to be self-contained. All settings/parameters which may change the behavior must be encoded in ModelProto.

repeated ModelProto.SentencePiece pieces = 1
Sentence pieces with scores.
optional TrainerSpec trainer_spec = 2
Spec used to generate this model file.
optional NormalizerSpec normalizer_spec = 3
Spec for text normalization.
optional SelfTestData self_test_data = 4
Stores sample input and its expected segmentation to verify the model.
optional NormalizerSpec denormalizer_spec = 5
Spec for text de-normalization.

Used in: ModelProto

optional string piece = 1
piece must not be empty.
optional float score = 2
optional SentencePiece.Type type = 3

Used in: SentencePiece

NORMAL = 1
normal symbol
UNKNOWN = 2
unknown symbol. only <unk> for now.
CONTROL = 3
control symbols. </s>, <s>, <2ja> etc.
USER_DEFINED = 4
user defined symbols.
BYTE = 6
Typical usage of USER_DEFINED symbol is placeholder.
byte symbols. Used when `byte_fallback` is true.
UNUSED = 5
this piece is not used.

repeated SentencePieceText nbests = 1

NormalizerSpec encodes a various parameters for string normalizaiton

Used in: ModelProto

optional string name = 1
name of normalization rule.
optional bytes precompiled_charsmap = 2
Pre-compiled normalization rule created by Builder::GetPrecompiledCharsMap() or Builder::CompileCharsMap() method. Usually this field is set by Builder::GetNormalizerSpec() method.
optional bool add_dummy_prefix = 3
Adds dummy whitespace at the beginning of text in order to treat "world" in "world" and "hello world" in the same way.
optional bool remove_extra_whitespaces = 4
Removes leading, trailing, and duplicate internal whitespace.
optional bool escape_whitespaces = 5
Replaces whitespace with meta symbol. This field must be true to train sentence piece model.
optional string normalization_rule_tsv = 6
Custom normalization rule file in TSV format. https://github.com/google/sentencepiece/blob/master/doc/normalization.md This field is only used in SentencePieceTrainer::Train() method, which compiles the rule into the binary rule stored in `precompiled_charsmap`.

Proto to store samples for self-testing.

Used in: ModelProto

repeated SelfTestData.Sample samples = 1

Used in: SelfTestData

optional string input = 1
optional string expected = 2

SentencePieceText manages a user-facing source sentence, postprocessed target sentence, and internal segmentation with byte offsets.

Used in: NBestSentencePieceText

optional string text = 1
User input or postprocessed text. This should be immutable since the byte range in SentencePiece is pointing to a span over this text. Meta symbols for whitespaces are not included.
repeated SentencePieceText.SentencePiece pieces = 2
A sequence of sentence pieces.
optional float score = 3
Score (usually log probability) for MultiSentencePieceText.

Used in: SentencePieceText

optional string piece = 1
Internal representation for the decoder. - Decoder can use |piece| as a basic token. - the piece must be non-empty. - A whitespace is replaced with a meta symbol. - Concatenation of pieces is not always the same as the |text|.
optional uint32 id = 2
Vocabulary id.
optional string surface = 3
External representation for the client. - It is always guaranteed that text.substr(begin, end - begin) == surface. - Concatenation of surface is always the same as the |text|. - |surface| may contain whitespaces. - |surface| may be empty if the piece encodes a control vocabulary. e.g., <s>, </s>, <unk>. - When |surface| is empty, always begin == end. (zero-length span).
optional uint32 begin = 4
optional uint32 end = 5

TrainerSpec encodes a various parameters for SentencePiece training. Next id: 55

Used in: ModelProto

repeated string input = 1
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
optional string input_format = 7
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string model_prefix = 2
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional TrainerSpec.ModelType model_type = 3
optional int32 vocab_size = 4
Vocabulary size. 8k is the default size.
repeated string accept_language = 5
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
optional int32 self_test_sample_size = 6
Size of self-test samples, which are encoded in the model file.
optional bool enable_differential_privacy = 50
Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional float differential_privacy_noise_level = 51
Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional uint64 differential_privacy_clipping_threshold = 52
Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional float character_coverage = 10
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional uint64 input_sentence_size = 11
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional bool shuffle_input_sentence = 19
optional int32 mining_sentence_size = 12
Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 training_sentence_size = 13
Maximum size of sentences to train sentence pieces.
optional int32 seed_sentencepiece_size = 14
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional float shrinking_factor = 15
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional int32 max_sentence_length = 18
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 num_threads = 16
Number of threads in the training.
optional int32 num_sub_iterations = 17
Number of EM sub iterations.
optional int32 max_sentencepiece_length = 20
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional bool split_by_unicode_script = 21
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_number = 23
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_whitespace = 22
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool treat_whitespace_as_suffix = 24
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool allow_whitespace_only_pieces = 26
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool split_digits = 25
Split all digits (0-9) into separate pieces.
optional string pretokenization_delimiter = 53
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
repeated string control_symbols = 30
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string user_defined_symbols = 31
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
optional string required_chars = 36
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional bool byte_fallback = 35
Decomposes unknown pieces into UTF-8 bytes.
optional bool vocabulary_output_piece_score = 32
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool hard_vocab_limit = 33
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool use_all_vocab = 34
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional int32 unk_id = 40
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contigous.
<unk>
optional int32 bos_id = 41
<s>
optional int32 eos_id = 42
</s>
optional int32 pad_id = 43
<pad> (padding)
optional string unk_piece = 45
optional string bos_piece = 46
optional string eos_piece = 47
optional string pad_piece = 48
optional string unk_surface = 44
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional bool train_extremely_large_corpus = 49
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional string seed_sentencepieces_file = 54
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.

Model type. only have UNIGRAM now.

Used in: TrainerSpec

UNIGRAM = 1
Unigram language model with dynamic algorithm
BPE = 2
Byte Pair Encoding
WORD = 3
Delimitered by whitespace.
CHAR = 4
tokenizes into character sequence

package sentencepiece

message ModelProto

repeated ModelProto.SentencePiece pieces = 1

optional TrainerSpec trainer_spec = 2

optional NormalizerSpec normalizer_spec = 3

optional SelfTestData self_test_data = 4

optional NormalizerSpec denormalizer_spec = 5

message ModelProto.SentencePiece

optional string piece = 1

optional float score = 2

optional SentencePiece.Type type = 3

enum ModelProto.SentencePiece.Type

NORMAL = 1

UNKNOWN = 2

CONTROL = 3

USER_DEFINED = 4

BYTE = 6

UNUSED = 5

message NBestSentencePieceText

repeated SentencePieceText nbests = 1

message NormalizerSpec

optional string name = 1

optional bytes precompiled_charsmap = 2

optional bool add_dummy_prefix = 3

optional bool remove_extra_whitespaces = 4

optional bool escape_whitespaces = 5

optional string normalization_rule_tsv = 6

message SelfTestData

repeated SelfTestData.Sample samples = 1

message SelfTestData.Sample

optional string input = 1

optional string expected = 2

message SentencePieceText

optional string text = 1

repeated SentencePieceText.SentencePiece pieces = 2

optional float score = 3

message SentencePieceText.SentencePiece

optional string piece = 1

optional uint32 id = 2

optional string surface = 3

optional uint32 begin = 4

optional uint32 end = 5

message TrainerSpec

repeated string input = 1

optional string input_format = 7

optional string model_prefix = 2

optional TrainerSpec.ModelType model_type = 3

optional int32 vocab_size = 4

repeated string accept_language = 5

optional int32 self_test_sample_size = 6

optional bool enable_differential_privacy = 50

optional float differential_privacy_noise_level = 51

optional uint64 differential_privacy_clipping_threshold = 52

optional float character_coverage = 10

optional uint64 input_sentence_size = 11

optional bool shuffle_input_sentence = 19

optional int32 mining_sentence_size = 12

optional int32 training_sentence_size = 13

optional int32 seed_sentencepiece_size = 14

optional float shrinking_factor = 15

optional int32 max_sentence_length = 18

optional int32 num_threads = 16

optional int32 num_sub_iterations = 17

optional int32 max_sentencepiece_length = 20

optional bool split_by_unicode_script = 21

optional bool split_by_number = 23

optional bool split_by_whitespace = 22

optional bool treat_whitespace_as_suffix = 24

optional bool allow_whitespace_only_pieces = 26

optional bool split_digits = 25

optional string pretokenization_delimiter = 53

repeated string control_symbols = 30

repeated string user_defined_symbols = 31

optional string required_chars = 36

optional bool byte_fallback = 35

optional bool vocabulary_output_piece_score = 32

optional bool hard_vocab_limit = 33

optional bool use_all_vocab = 34

optional int32 unk_id = 40

optional int32 bos_id = 41