package speech.sparrowhawk

Get desktop application:
View/edit binary Protocol Buffers messages

Abbreviation which will be expanded using morphosyntactic features.

Used in: Token

required string text = 1
optional string morphosyntactic_features = 2
optional string code_switch = 3
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.

A number that should be read as an cardinal, e.g. "one". The input is specified by a string and verbalization via optional morphosyntactic features for gender, case etc.

Used in: Measure, Token

required string integer = 1
A string of digits optionally prefixed with '-'.
optional string morphosyntactic_features = 2
optional bool preserve_order = 3
Preserve order from what was given in the text
optional string code_switch = 4
A string to indicate switching to a separate language, esp. useful for bilingual situations, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 5

A token that connects two other tokens, such as the : in 1:1 or the x in 4x3.

Used in: Token

optional string type = 1
The type of connector this is, eg. "ratio", "range", etc.
optional string morphosyntactic_features = 2
optional string code_switch = 3
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.

A date, give in day, month year. 'style' could be used to control, e.g. whether day=4, month=5, year=1997 is said "fourth of may nineteen ninety seven", or "may the fourth nineteen ninety seven" 'era' is for era indicators such as "CE", "BCE", "AD", "CE" etc.

Used in: Token

optional string weekday = 1
optional string day = 2
optional string month = 3
Set at least one of month and year
optional string year = 4
optional int32 style = 5
optional string text = 6
optional bool short_year = 7
optional string era = 8
optional string morphosyntactic_features = 9
optional bool preserve_order = 10
optional string code_switch = 11
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 12

A number read out with a decimal point. For example "34.45" is integer_part: "34" fraction: "45" -23.64 is integer_part: "23" fraction: "64" negative: true Optional field for exponent to handle cases like -3.24E29 The quantity field is intended to be used if exponent isn't to represent quantities expressed as words --- e.g. 2.3 billion. This is on analogy of what is done in Money, except that it is really more general than Money, and should be available to other classes that use Decimal.

Used in: Measure, Money, Token

optional bool negative = 1
Set to true or leave unset, do not set to false.
optional string integer_part = 2
optional string fractional_part = 3
optional string quantity = 4
optional string exponent = 5
optional int32 style = 6
optional string morphosyntactic_features = 7
optional bool preserve_order = 8
optional string code_switch = 9
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 10

Electronic items such as URLs, email addresses, etc. The full schema for URLs, which email addresses can effectively be seen as a subset of, is: protocol://username:password@domain:port/path?query_string#fragment_id Hence populating just username and domain will read as an email address.

Used in: Token

optional string protocol = 1
optional string username = 2
optional string password = 3
Only used if username is set.
optional string domain = 4
optional int32 port = 5
optional string path = 6
optional string query_string = 7
optional string fragment_id = 8
optional string morphosyntactic_features = 9
optional bool preserve_order = 10
optional string code_switch = 11
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 12

A number that should be read as a fraction, e.g. "three quarters". The input is specified as a separate numerator and denominator.

Used in: Measure, Token

optional string integer_part = 1
required string numerator = 2
required string denominator = 3
optional int32 style = 4
optional string morphosyntactic_features = 5
optional bool preserve_order = 6
optional string code_switch = 7
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
optional bool negative = 8
repeated string field_order = 9

required string grammar_file = 1
required string grammar_name = 2
Name for this grammar.
repeated Rule rules = 3

A single utterance's linguistic structure

Used in: Utterance

optional int64 id = 1
ID uniquely identifying this utterance. If used in asynchronous mode the utterance IDs can be used to match multiply emitted utterances generated from a single source. 64-bit integer is used to accommodate utterance ID as a timestamp.
optional string input = 2
The original sentence.
repeated Token tokens = 3
repeated Word words = 4

Used in: Token, Word

optional int32 own_index = 1
The index of this entity; mainly useful for debugging purposes.
optional int32 parent = 2
The index of the parent entity of the current entity.
optional int32 first_child = 3
The index of the first child of the current entity.
optional int32 last_child = 4
The index of the last child of the current entity.

A measure, e.g. 6 feet, 9 meters etc. units are the units of the measure e.g. "miles"; definitions of all legal units are in a fixed list in the text norm params. Cardinal is to make it easier to incorporate East Asian counter words as measures. The vast majority of the time one sees these they are after an integer, so this just allows one to avoid the excess baggage of using decimal or fraction markup. More generally for real measures in other languages this could be useful for similar reasons. BTW, The motivation for treating counter words as measures is that in languages that have them, more familiar measures are treated as a subset of counter words, in that one never gets a counter word AND a measure.

Used in: Token

optional Decimal decimal = 1
Set one of the following three.
optional Fraction fraction = 2
optional Cardinal cardinal = 3
optional string units = 4
optional int32 style = 5
optional string morphosyntactic_features = 6
optional bool preserve_order = 7
optional string code_switch = 8
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 9

An amount of money, eg. $12.50, £15, etc. style could be used to control how it is read. for the example $12.50: 1: "twelve dollars and fifty cents" 2: "twelve dollars fifty" 3: "twelve united states dollars and fifty cents"

Used in: Token

required Decimal amount = 1
The amount of money
optional int64 quantity = 2
Optional million/billion etc, as an integer
required string currency = 3
lowercase ISO4217 currency code, e.g. "usd" List of minimally supported set of currencies: kestrel_grammar/verbalize/universal/money_definition.txt
optional int32 style = 4
optional string morphosyntactic_features = 5
optional bool preserve_order = 6
optional string code_switch = 7
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 8

A number that should be read as an ordinal, e.g. "first". The input is specified by a string and optional morphosyntactic features for gender, case etc.

Used in: Token

required string integer = 1
optional string morphosyntactic_features = 2
optional bool preserve_order = 3
optional string code_switch = 4
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 5

Used in: Grammar

required string main = 1
Main normalization rule.
optional string parens = 2
Optional PDT parens.
optional string assignments = 3
Optional MPDT assignments.
optional string redup = 4
Optional reduplication rule.

optional string tokenizer_grammar = 1
Tokenizer-classifier.
optional string verbalizer_grammar = 2
optional string sentence_boundary_regexp = 3
Regular expression for sentence boundary detector. This is a set of possible end-of-sentence markers.
optional string sentence_boundary_exceptions_file = 4
Optional file specifying tokens that end in a possible end-of-sentence marker that should *not* usually induce an end-of-sentence decision e.g. “Mr.”

A telephone number. NB. There should always be at least one number_part.

Used in: Token

optional string country_code = 1
Country code (excluding '+'), eg. 44 for UK.
repeated string number_part = 2
Parts of the phone number
optional string extension = 3
The internal extension.
optional int32 style = 4
if 1, number is read digit by digit. if 2, digits are grouped ("double 2" etc)
optional string morphosyntactic_features = 5
optional bool preserve_order = 6
optional string code_switch = 7
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 8

A time, given as hours, minutes and seconds. 'style' controls how the time should be spoken. For example "hours=13, minutes=15" could be "one fifteen pm", "thirteen fifteen", "a quarter past one" etc. The styles are defined in the language specific verbalizer. 'zone' contains an optional time zone which is verbalized letter by letter, eg. PST, GMT etc.

Used in: Token

optional int32 hours = 1
optional int32 minutes = 2
optional int32 seconds = 3
optional bool speak_period = 4
Do not set to false.
optional string suffix = 5
optional int32 style = 6
optional string zone = 7
optional string morphosyntactic_features = 9
optional bool preserve_order = 10
optional string code_switch = 11
A string to indicate switching to a separate language, esp. useful for bilingual languages, e.g. when switching between English and Hindi depending on context.
repeated string field_order = 12

Message containing the contents for a single token as determined by the tokenizer. Roughly speaking, a token corresponds to a single verbalizable entity, such as a single word, or single semiotic object such as "$15.60".

Used in: LinguisticStructure

optional Links links = 1
Structural relationships. The children are words. TODO(rws): Probably phase out links since we are not using it.
optional uint32 start_index = 2
Optional information about where this token came from in the original input. Indices are given in Unicode codepoints (*not* byte indices).
optional uint32 end_index = 3
optional string name = 4
The name of the token, which is generally the original unnormalized text the token was generated from. Voice Building Note: This field appears on ScriptLine protos that serve as input to voice building.
optional Token.Type type = 5
Basic type of the token (see enum comments). Voice Building Note: This field appears on ScriptLine protos that serve as input to voice building.
optional string wordid = 6
The wordid of the token, when a single one is known. Set when type == WORD
optional string spelling = 7
If the token is a word, this represents the regular lower-cased spelling of that word.
optional bool phrase_break = 8
If true, this token is a phrase break.
optional float pause_duration = 9
Indicates a pause of given length, in seconds. Used when pause given from markup. Currently unused.
optional Token.PauseLength pause_length = 10
If set, indicates a general length of pause that should be introduced for synthesis. For example, a fullstop would generally generate a longer pause than a comma. Currently unused.
optional string spelling_with_stress = 11
This is used to store spelling with stress mark produced by stress assigner or provided in input text. Currently unused.
optional bool skip = 12
If true, don't verbalize this token. Used to skip tokens that are part of a multi-token semiotic class, or bypass homograph resolution when explicit wordids are provided.
optional bool next_space = 13
Is true if a space follows this token. E.g., after tokenization in Chinese/Japanese. Currently unused.
optional Cardinal cardinal = 14
All the following (fields in the range [14-27]) are used when the token represents a semiotic class. In such a case, one of these is filled by the output from the classifier/parser stage. Alternatively, if part of the input was given as markup, it will be copied from the input to these fields.
optional Ordinal ordinal = 15
optional string digit = 16
optional Decimal decimal = 17
optional Fraction fraction = 18
optional Time time = 19
optional Measure measure = 20
optional Decimal percent = 21
optional Date date = 22
optional Telephone telephone = 23
optional Money money = 24
optional Electronic electronic = 25
optional string verbatim = 26
optional string letters = 27
optional Connector connector = 28
Tokens defined by things they connect, for example "-" in "Mon-Fri", ":" in "1:1", etc.
optional Abbreviation abbreviation = 29
Abbreviations, intended for languages where they may inflect depending on case etc.
optional int32 first_daughter = 30
Indices of the first and last words.
optional int32 last_daughter = 31

General pause duration lengths.

Used in: Token

PAUSE_NONE = 0
No pause.
PAUSE_SHORT = 1
Brief pause, eg. for brackets or quotes.
PAUSE_MEDIUM = 2
Longer pause, for a comma or similar.
PAUSE_LONG = 3
Longest pause, for a fullstop or phrase break.

Describes the kind of entity this token represents.

Used in: Token

WORD = 1
A known word which is present in the lexicon.
SEMIOTIC_CLASS = 2
A semiotic class.
PUNCT = 3
Punctuation which is not expected to be pronounced.
WORD_NEEDS_VERBALIZATION = 4
A word, but requires some further verbalization work. For example, Thai words with a trailing repetition character.

An utterance

optional uint64 id = 1
An arbitrary identifier used to identify the utterance for debugging purposes. The controller assigns this id internally when it creates an utterance, and the id is unique (with high probability) within the process. Currently unused.
optional string filename = 2
If loaded from file, the filename (usually without a path). Mainly intended as a human-readable identifier for debugging purposes. Currently unused.
optional string sentence = 3
This field can be mutated by various text pre-processing streams, such as character segmenters and text filters. Currently unused.
optional string original_sentence = 4
Copy of the original sentence that is guaranteed not to be changed by the pipeline. Currently unused.
repeated string segmenter_output = 5
If segmentation was applied on the original sentences, the following field will contain the results of the segmentation. Each string corresponds to an individual sentence. Currently unused.
optional LinguisticStructure linguistic = 6
Linguistic streams, words, tokens etc.

A single word

Used in: LinguisticStructure

optional Links links = 1
Structural relationships. The parent items are tokens. TODO(rws): Probably phase out links since we are not using it.
optional string id = 2
The id of the word, predominantly used as a key into the lexicon.
optional string spelling = 3
The conventional spelling of the word. There can be several spellings matching one id in the lexicon (e.g. colour, color correspond to the same wordid) and vice versa (spelling "project" maps to ids "project_nou" and "project_vrb").
optional float pause_length = 4
If set, indicates the length of pause that should be generated for this word, in seconds. Only applies to the special word "sil". Currently unused.
optional bool precedes_pause = 5
True when the prosodic_features have specified that there should (value true) or should not (value false) be a pause just after this word, either because contains_pause was specified in an utterance in which this was the penultimate word, or because precedes_pause was specified in an utterance in which this was the last word. Currently unused.
optional int32 parent = 6
Parent token

package speech.sparrowhawk

message Abbreviation

required string text = 1

optional string morphosyntactic_features = 2

optional string code_switch = 3

message Cardinal

required string integer = 1

optional string morphosyntactic_features = 2

optional bool preserve_order = 3

optional string code_switch = 4

repeated string field_order = 5

message Connector

optional string type = 1

optional string morphosyntactic_features = 2

optional string code_switch = 3

message Date

optional string weekday = 1

optional string day = 2

optional string month = 3

optional string year = 4

optional int32 style = 5

optional string text = 6

optional bool short_year = 7

optional string era = 8

optional string morphosyntactic_features = 9

optional bool preserve_order = 10

optional string code_switch = 11

repeated string field_order = 12

message Decimal

optional bool negative = 1

optional string integer_part = 2

optional string fractional_part = 3

optional string quantity = 4

optional string exponent = 5

optional int32 style = 6

optional string morphosyntactic_features = 7

optional bool preserve_order = 8

optional string code_switch = 9

repeated string field_order = 10

message Electronic

optional string protocol = 1

optional string username = 2

optional string password = 3

optional string domain = 4

optional int32 port = 5

optional string path = 6

optional string query_string = 7

optional string fragment_id = 8

optional string morphosyntactic_features = 9

optional bool preserve_order = 10

optional string code_switch = 11

repeated string field_order = 12

message Fraction

optional string integer_part = 1

required string numerator = 2

required string denominator = 3

optional int32 style = 4

optional string morphosyntactic_features = 5

optional bool preserve_order = 6

optional string code_switch = 7

optional bool negative = 8

repeated string field_order = 9

message Grammar

required string grammar_file = 1

required string grammar_name = 2

repeated Rule rules = 3

message LinguisticStructure

optional int64 id = 1

optional string input = 2

repeated Token tokens = 3

repeated Word words = 4

message Links

optional int32 own_index = 1

optional int32 parent = 2

optional int32 first_child = 3

optional int32 last_child = 4

message Measure

optional Decimal decimal = 1

optional Fraction fraction = 2

optional Cardinal cardinal = 3