Get desktop application:
View/edit binary Protocol Buffers messages
ModelProto stores model parameters. SentencePieceProcessor is supposed to be self-contained. All settings/parameters which may change the behavior must be encoded in ModelProto.
Sentence pieces with scores.
Spec used to generate this model file.
Spec for text normalization.
Stores sample input and its expected segmentation to verify the model.
Spec for text de-normalization.
Used in:
piece must not be empty.
Used in:
normal symbol
unknown symbol. only <unk> for now.
control symbols. </s>, <s>, <2ja> etc.
user defined symbols.
Typical usage of USER_DEFINED symbol is placeholder.
byte symbols. Used when `byte_fallback` is true.
this piece is not used.
NormalizerSpec encodes a various parameters for string normalizaiton
Used in:
name of normalization rule.
Pre-compiled normalization rule created by Builder::GetPrecompiledCharsMap() or Builder::CompileCharsMap() method. Usually this field is set by Builder::GetNormalizerSpec() method.
Adds dummy whitespace at the beginning of text in order to treat "world" in "world" and "hello world" in the same way.
Removes leading, trailing, and duplicate internal whitespace.
Replaces whitespace with meta symbol. This field must be true to train sentence piece model.
Custom normalization rule file in TSV format. https://github.com/google/sentencepiece/blob/master/doc/normalization.md This field is only used in SentencePieceTrainer::Train() method, which compiles the rule into the binary rule stored in `precompiled_charsmap`.
Proto to store samples for self-testing.
Used in:
Used in:
SentencePieceText manages a user-facing source sentence, postprocessed target sentence, and internal segmentation with byte offsets.
Used in:
User input or postprocessed text. This should be immutable since the byte range in SentencePiece is pointing to a span over this text. Meta symbols for whitespaces are not included.
A sequence of sentence pieces.
Score (usually log probability) for MultiSentencePieceText.
Used in:
Internal representation for the decoder. - Decoder can use |piece| as a basic token. - the piece must be non-empty. - A whitespace is replaced with a meta symbol. - Concatenation of pieces is not always the same as the |text|.
Vocabulary id.
External representation for the client. - It is always guaranteed that text.substr(begin, end - begin) == surface. - Concatenation of surface is always the same as the |text|. - |surface| may contain whitespaces. - |surface| may be empty if the piece encodes a control vocabulary. e.g., <s>, </s>, <unk>. - When |surface| is empty, always begin == end. (zero-length span).
TrainerSpec encodes a various parameters for SentencePiece training. Next id: 55
Used in:
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
Vocabulary size. 8k is the default size.
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
Size of self-test samples, which are encoded in the model file.
Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
Set these parameters if you need DP version of sentencepiece. std of noise to add.
Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
Maximum size of sentences to train sentence pieces.
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
Number of threads in the training.
Number of EM sub iterations.
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
Split all digits (0-9) into separate pieces.
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
Decomposes unknown pieces into UTF-8 bytes.
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contigous.
<unk>
<s>
</s>
<pad> (padding)
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
Model type. only have UNIGRAM now.
Used in:
Unigram language model with dynamic algorithm
Byte Pair Encoding
Delimitered by whitespace.
tokenizes into character sequence