Get desktop application:
View/edit binary Protocol Buffers messages
Affix table entry, for serialization of the affix tables.
The type of affix table, as a string.
The maximum affix length.
The list of affixes, in order of affix ID.
Nested message for serializing a single affix.
Used in:
The affix as a string.
The length of the affix (this is non-trivial to compute due to UTF-8).
The ID of the affix that is one character shorter, or -1 if none exists.
An alternative analysis of tokens in the document. The repeated fields are indexed relative to the beginning of a sentence. Fields not represented in the alternative analysis are assumed to be unchanged. Currently only alternatives for tags, categories and (labeled) dependency heads are supported. Each repeated field should either have length=0 or length=number of tokens.
Used in:
,Head of this token in the dependency tree: the id of the token which has an arc going to this one. If it is the root token of a sentence, then it is set to -1.
Part-of-speech tag for token.
Coarse-grained word category for token.
Label for dependency relation between this token and its head.
The score of this analysis, where bigger values typically indicate better quality, but there are no guarantees and there is also no pre-defined range.
Descriptor for feature extractor.
Top-level feature function for extractor.
Descriptor for feature function.
Used in:
Feature function type.
Feature function name.
Default argument for feature function.
Named parameters for feature descriptor.
Nested sub-feature function descriptors.
A list of alternative (k-best) syntax analyses, grouped by sentences.
Alternative analyses for each sentence. Sentences are listed in the order visited by a SentenceIterator.
Alternative analyses for each token.
A list of alternative (k-best) analyses for a sentence spanning from a start token index to an end token index. The alternative analyses are ordered by decreasing model score from best to worst. The first analysis is the 1-best analysis, which is typically also stored in the document tokens.
Used in:
First token of sentence.
Last token of sentence.
K-best analyses for the tokens in this sentence. All of the analyses in the list have the same "type"; e.g., k-best taggings, k-best {tagging+parse}s, etc. Note also that the type of analysis stored in this list can change depending on where we are in the document processing pipeline; e.g., may initially be taggings, and then switch to parses. The first token_analysis would be the 1-best analysis, which is typically also stored in the document. Note: some post-processors will update the document's syntax trees, but will leave these unchanged.
A list of scored alternative (k-best) analyses for a particular token. These are all distinct from each other and ordered by decreasing model score. The first is the 1-best analysis, which may or may not match the document tokens depending on how the k-best analyses are selected.
Used in:
All token analyses in this repeated field refer to the same token. Each alternative analysis will contain a single entry for repeated fields such as head, tag, category and label.
Used in:
A Sentence contains the raw text contents of a sentence, as well as an analysis.
Identifier for document.
Raw text contents of the sentence.
Tokenization of the sentence.
A sparse set of features. If using SparseStringToIdTransformer, description is required and id should be omitted; otherwise, id is required and description optional. id, weight, and description fields are all aligned if present (ie, any of these that are non-empty should have the same # items). If weight is omitted, 1.0 is used.
Serializable representation of a string=>string mapping.
Key=>value pairs.
Serializable representation of a string=>string pair.
Used in:
String representing the key.
String representing the value.
Task input descriptor.
Used in:
Name of input resource.
Name of stage responsible of creating this resource.
File format for resource.
Record format for resource.
Is this resource multi-file?
An input can consist of multiple file sets.
Used in:
File pattern for file set.
File format for file set.
Record format for file set.
Task output descriptor.
Used in:
Name of output resource.
File format for output resource.
Record format for output resource.
Number of shards in output. If it is different from zero this output is sharded. If the number of shards is set to -1 this means that the output is sharded, but the number of shard is unknown. The files are then named 'base-*-of-*'.
Base file name for output resource. If this is not set by the task component it is set to a default value by the workflow engine.
Optional extension added to the file name.
A task specification is used for describing executing parameters.
Name of task.
Workflow task type.
Task inputs.
Task outputs.
Task parameters.
Used in:
A document token marks a span of bytes in the document text as a token or word.
Used in:
Token word form.
Start position of token in text.
End position of token in text. Gives index of last byte, not one past the last byte. If token came from lexer, excludes any trailing HTML tags.
Head of this token in the dependency tree: the id of the token which has an arc going to this one. If it is the root token of a sentence, then it is set to -1.
Part-of-speech tag for token.
Coarse-grained word category for token.
Label for dependency relation between this token and its head.
Break level for tokens that indicates how it was separated from the previous token in the text.
Used in:
No separation between tokens.
Tokens separated by space.
Tokens separated by line break.
Tokens separated by sentence break.
A light-weight proto to store vectors in binary format.
can be word or phrase, or URL, etc.
If available, raw count of this token in the training corpus.
Used in:
Stores information about the morphology of a token.
This attribute field is designated to hold a single disambiguated analysis.
Morphology is represented by a set of attribute values.
Used in: