Get desktop application:
View/edit binary Protocol Buffers messages
ModelProto stores model parameters. SentencePieceProcessor is supposed to be self-contained. All settings/parameters which may change the behavior must be encoded in ModelProto.
Sentence pieces with scores.
Spec used to generate this model file.
Spec for text normalization.
Stores sample input and its expected segmentation to verify the model.
Spec for text de-normalization.
Used in:
piece must not be empty.
Used in:
normal symbol
unknown symbol. only <unk> for now.
control symbols. </s>, <s>, <2ja> etc.
user defined symbols.
Typical usage of USER_DEFINED symbol is placeholder.
byte symbols. Used when `byte_fallback` is true.
this piece is not used.
NormalizerSpec encodes a various parameters for string normalizaiton
Used in:
name of normalization rule.
Pre-compiled normalization rule created by Builder::GetPrecompiledCharsMap() or Builder::CompileCharsMap() method. Usually this field is set by Builder::GetNormalizerSpec() method.
Adds dummy whitespace at the beginning of text in order to treat "world" in "world" and "hello world" in the same way.
Removes leading, trailing, and duplicate internal whitespace.
Replaces whitespace with meta symbol. This field must be true to train sentence piece model.
Custom normalization rule file in TSV format. https://github.com/google/sentencepiece/blob/master/doc/normalization.md This field is only used in SentencePieceTrainer::Train() method, which compiles the rule into the binary rule stored in `precompiled_charsmap`.
Proto to store samples for self-testing.
Used in:
Used in:
TrainerSpec encodes a various parameters for SentencePiece training. Next id: 53
Used in:
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
Vocabulary size. 8k is the default size.
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
Size of self-test samples, which are encoded in the model file.
Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
Set these parameters if you need DP version of sentencepiece. std of noise to add.
Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
Maximum size of sentences to train sentence pieces.
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
Number of threads in the training.
Number of EM sub iterations.
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
Split all digits (0-9) into separate pieces.
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
Decomposes unknown pieces into UTF-8 bytes.
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contigous.
<unk>
<s>
</s>
<pad> (padding)
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
Model type. only have UNIGRAM now.
Used in:
Unigram language model with dynamic algorithm
Byte Pair Encoding
Delimitered by whitespace.
tokenizes into character sequence