Get desktop application:
View/edit binary Protocol Buffers messages
Abbreviation which will be expanded using morphosyntactic features.
Used in:
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A number that should be read as an cardinal, e.g. "one". The input is specified by a string and verbalization via optional morphosyntactic features for gender, case etc.
Used in: ,
A string of digits optionally prefixed with '-'.
Preserve order from what was given in the text
A string to indicate switching to a separate language, esp. useful for bilingual situations, e.g. when switching between English and Hindi depending on context.
A token that connects two other tokens, such as the : in 1:1 or the x in 4x3.
Used in:
The type of connector this is, eg. "ratio", "range", etc.
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A date, give in day, month year. 'style' could be used to control, e.g. whether day=4, month=5, year=1997 is said "fourth of may nineteen ninety seven", or "may the fourth nineteen ninety seven" 'era' is for era indicators such as "CE", "BCE", "AD", "CE" etc.
Used in:
Set at least one of month and year
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A number read out with a decimal point. For example "34.45" is integer_part: "34" fraction: "45" -23.64 is integer_part: "23" fraction: "64" negative: true Optional field for exponent to handle cases like -3.24E29 The quantity field is intended to be used if exponent isn't to represent quantities expressed as words --- e.g. 2.3 billion. This is on analogy of what is done in Money, except that it is really more general than Money, and should be available to other classes that use Decimal.
Used in: , ,
Set to true or leave unset, do not set to false.
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
Electronic items such as URLs, email addresses, etc. The full schema for URLs, which email addresses can effectively be seen as a subset of, is: protocol://username:password@domain:port/path?query_string#fragment_id Hence populating just username and domain will read as an email address.
Used in:
Only used if username is set.
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A number that should be read as a fraction, e.g. "three quarters". The input is specified as a separate numerator and denominator.
Used in: ,
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
Name for this grammar.
A single utterance's linguistic structure
Used in:
ID uniquely identifying this utterance. If used in asynchronous mode the utterance IDs can be used to match multiply emitted utterances generated from a single source. 64-bit integer is used to accommodate utterance ID as a timestamp.
The original sentence.
Used in: ,
The index of this entity; mainly useful for debugging purposes.
The index of the parent entity of the current entity.
The index of the first child of the current entity.
The index of the last child of the current entity.
A measure, e.g. 6 feet, 9 meters etc. units are the units of the measure e.g. "miles"; definitions of all legal units are in a fixed list in the text norm params. Cardinal is to make it easier to incorporate East Asian counter words as measures. The vast majority of the time one sees these they are after an integer, so this just allows one to avoid the excess baggage of using decimal or fraction markup. More generally for real measures in other languages this could be useful for similar reasons. BTW, The motivation for treating counter words as measures is that in languages that have them, more familiar measures are treated as a subset of counter words, in that one never gets a counter word AND a measure.
Used in:
Set one of the following three.
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
An amount of money, eg. $12.50, £15, etc. style could be used to control how it is read. for the example $12.50: 1: "twelve dollars and fifty cents" 2: "twelve dollars fifty" 3: "twelve united states dollars and fifty cents"
Used in:
The amount of money
Optional million/billion etc, as an integer
lowercase ISO4217 currency code, e.g. "usd" List of minimally supported set of currencies: kestrel_grammar/verbalize/universal/money_definition.txt
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A number that should be read as an ordinal, e.g. "first". The input is specified by a string and optional morphosyntactic features for gender, case etc.
Used in:
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
Used in:
Main normalization rule.
Optional PDT parens.
Optional MPDT assignments.
Optional reduplication rule.
Tokenizer-classifier.
Regular expression for sentence boundary detector. This is a set of possible end-of-sentence markers.
Optional file specifying tokens that end in a possible end-of-sentence marker that should *not* usually induce an end-of-sentence decision e.g. “Mr.”
A telephone number. NB. There should always be at least one number_part.
Used in:
Country code (excluding '+'), eg. 44 for UK.
Parts of the phone number
The internal extension.
if 1, number is read digit by digit. if 2, digits are grouped ("double 2" etc)
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
A time, given as hours, minutes and seconds. 'style' controls how the time should be spoken. For example "hours=13, minutes=15" could be "one fifteen pm", "thirteen fifteen", "a quarter past one" etc. The styles are defined in the language specific verbalizer. 'zone' contains an optional time zone which is verbalized letter by letter, eg. PST, GMT etc.
Used in:
Do not set to false.
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
Message containing the contents for a single token as determined by the tokenizer. Roughly speaking, a token corresponds to a single verbalizable entity, such as a single word, or single semiotic object such as "$15.60".
Used in:
Structural relationships. The children are words. TODO(rws): Probably phase out links since we are not using it.
Optional information about where this token came from in the original input. Indices are given in Unicode codepoints (*not* byte indices).
The name of the token, which is generally the original unnormalized text the token was generated from. Voice Building Note: This field appears on ScriptLine protos that serve as input to voice building.
Basic type of the token (see enum comments). Voice Building Note: This field appears on ScriptLine protos that serve as input to voice building.
The wordid of the token, when a single one is known. Set when type == WORD
If the token is a word, this represents the regular lower-cased spelling of that word.
If true, this token is a phrase break.
Indicates a pause of given length, in seconds. Used when pause given from markup. Currently unused.
If set, indicates a general length of pause that should be introduced for synthesis. For example, a fullstop would generally generate a longer pause than a comma. Currently unused.
This is used to store spelling with stress mark produced by stress assigner or provided in input text. Currently unused.
If true, don't verbalize this token. Used to skip tokens that are part of a multi-token semiotic class, or bypass homograph resolution when explicit wordids are provided.
Is true if a space follows this token. E.g., after tokenization in Chinese/Japanese. Currently unused.
All the following (fields in the range [14-27]) are used when the token represents a semiotic class. In such a case, one of these is filled by the output from the classifier/parser stage. Alternatively, if part of the input was given as markup, it will be copied from the input to these fields.
Tokens defined by things they connect, for example "-" in "Mon-Fri", ":" in "1:1", etc.
Abbreviations, intended for languages where they may inflect depending on case etc.
Indices of the first and last words.
General pause duration lengths.
Used in:
No pause.
Brief pause, eg. for brackets or quotes.
Longer pause, for a comma or similar.
Longest pause, for a fullstop or phrase break.
Describes the kind of entity this token represents.
Used in:
A known word which is present in the lexicon.
A semiotic class.
Punctuation which is not expected to be pronounced.
A word, but requires some further verbalization work. For example, Thai words with a trailing repetition character.
An utterance
An arbitrary identifier used to identify the utterance for debugging purposes. The controller assigns this id internally when it creates an utterance, and the id is unique (with high probability) within the process. Currently unused.
If loaded from file, the filename (usually without a path). Mainly intended as a human-readable identifier for debugging purposes. Currently unused.
This field can be mutated by various text pre-processing streams, such as character segmenters and text filters. Currently unused.
Copy of the original sentence that is guaranteed not to be changed by the pipeline. Currently unused.
If segmentation was applied on the original sentences, the following field will contain the results of the segmentation. Each string corresponds to an individual sentence. Currently unused.
Linguistic streams, words, tokens etc.
A single word
Used in:
Structural relationships. The parent items are tokens. TODO(rws): Probably phase out links since we are not using it.
The id of the word, predominantly used as a key into the lexicon.
The conventional spelling of the word. There can be several spellings matching one id in the lexicon (e.g. colour, color correspond to the same wordid) and vice versa (spelling "project" maps to ids "project_nou" and "project_vrb").
If set, indicates the length of pause that should be generated for this word, in seconds. Only applies to the special word "sil". Currently unused.
True when the prosodic_features have specified that there should (value true) or should not (value false) be a pause just after this word, either because contains_pause was specified in an utterance in which this was the penultimate word, or because precedes_pause was specified in an utterance in which this was the last word. Currently unused.
Parent token