package nvidia.riva.asr

Get desktop application:
View/edit binary Protocol Buffers messages

The RivaSpeechRecognition service provides two mechanisms for converting speech to text.

rpc Recognize (RecognizeRequest, RecognizeResponse)
riva_asr.proto:32
Recognize expects a RecognizeRequest and returns a RecognizeResponse. This request will block until the audio is uploaded, processed, and a transcript is returned.
message RecognizeRequest
riva_asr.proto:44
RecognizeRequest is used for batch processing of a single audio recording.
- optional RecognitionConfig config = 1
  Provides information to recognizer that specifies how to process the request.
- bytes audio = 2
  The raw audio data to be processed. The audio bytes must be encoded as specified in `RecognitionConfig`.
message RecognizeResponse
riva_asr.proto:160
The only message returned to the client by the `Recognize` method. It contains the result as zero or more sequential `SpeechRecognitionResult` messages.
- repeated SpeechRecognitionResult results = 1
  Sequential list of transcription results corresponding to sequential portions of audio. Currently only returns one transcript.
rpc StreamingRecognize (stream StreamingRecognizeRequest, stream StreamingRecognizeResponse)
riva_asr.proto:37
StreamingRecognize is a non-blocking API call that allows audio data to be fed to the server in chunks as it becomes available. Depending on the configuration in the StreamingRecognizeRequest, intermediate results can be sent back to the client. Recognition ends when the stream is closed by the client.
message StreamingRecognizeRequest
riva_asr.proto:59
A StreamingRecognizeRequest is used to configure and stream audio content to the Riva ASR Service. The first message sent must include only a StreamingRecognitionConfig. Subsequent messages sent in the stream must contain only raw bytes of the audio to be recognized.
- oneof streaming_request
  The streaming request, which is either a streaming config or audio content.
  - StreamingRecognitionConfig streaming_config = 1
    Provides information to the recognizer that specifies how to process the request. The first `StreamingRecognizeRequest` message must contain a `streaming_config` message.
  - bytes audio_content = 2
    The audio data to be recognized. Sequential chunks of audio data are sent in sequential `StreamingRecognizeRequest` messages. The first `StreamingRecognizeRequest` message must not contain `audio` data and all subsequent `StreamingRecognizeRequest` messages must contain `audio` data. The audio bytes must be encoded as specified in `RecognitionConfig`.
message StreamingRecognizeResponse
riva_asr.proto:237
- repeated StreamingRecognitionResult results = 1
  This repeated list contains the latest transcript(s) corresponding to audio currently being processed. Currently one result is returned, where each result can have multiple alternatives

Provides information to the recognizer that specifies how to process the request

Used in: RecognizeRequest, StreamingRecognitionConfig

AudioEncoding encoding = 1
The encoding of the audio data sent in the request. All encodings support only 1 channel (mono) audio.
int32 sample_rate_hertz = 2
Sample rate in Hertz of the audio data sent in all `RecognizeAudio` messages.
string language_code = 3
Required. The language of the supplied audio as a [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en-US". Currently only en-US is supported
int32 max_alternatives = 4
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of `SpeechRecognizeAlternative` messages within each `SpeechRecognizeResult`. The server may return fewer than `max_alternatives`. If omitted, will return a maximum of one.
int32 audio_channel_count = 7
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are `1`-`8`. Valid values for OGG_OPUS are '1'-'254'. Valid value for MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only `1`. If `0` or omitted, defaults to one channel (mono). Note: We only recognize the first channel by default. To perform independent recognition on each channel set `enable_separate_recognition_per_channel` to 'true'.
bool enable_word_time_offsets = 8
If `true`, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If `false`, no word-level time offset information is returned. The default is `false`.
bool enable_automatic_punctuation = 11
If 'true', adds punctuation to recognition result hypotheses. The default 'false' value does not add punctuation to result hypotheses.
bool enable_separate_recognition_per_channel = 12
This needs to be set to `true` explicitly and `audio_channel_count` > 1 to get each channel recognized separately. The recognition result will contain a `channel_tag` field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: `audio_channel_count` multiplied by the length of the audio.
string model = 13
Which model to select for the given request. Valid choices: Jasper, Quartznet
bool verbatim_transcripts = 14
The verbatim_transcripts flag enables or disable inverse text normalization. 'true' returns exactly what was said, with no denormalization. 'false' applies inverse text normalization, also this is the default
map<string, string> custom_configuration = 24
Custom fields for passing request-level configuration options to plugins used in the model pipeline.

Alternative hypotheses (a.k.a. n-best list).

Used in: SpeechRecognitionResult, StreamingRecognitionResult

string transcript = 1
Transcript text representing the words that the user spoke.
float confidence = 2
The non-normalized confidence estimate. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for a non-streaming result or, of a streaming result where `is_final=true`. This field is not guaranteed to be accurate and users should not rely on it to be always provided.
repeated WordInfo words = 3
A list of word-specific information for each recognized word. Only populated if is_final=true

A speech recognition result corresponding to the latest transcript

Used in: RecognizeResponse

repeated SpeechRecognitionAlternative alternatives = 1
May contain one or more recognition hypotheses (up to the maximum specified in `max_alternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
int32 channel_tag = 2
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.
float audio_processed = 3
Length of audio processed so far in seconds

Provides information to the recognizer that specifies how to process the request

Used in: StreamingRecognizeRequest

optional RecognitionConfig config = 1
Provides information to the recognizer that specifies how to process the request
bool interim_results = 2
If `true`, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the `is_final=false` flag). If `false` or omitted, only `is_final=true` result(s) are returned.

A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.

Used in: StreamingRecognizeResponse

repeated SpeechRecognitionAlternative alternatives = 1
May contain one or more recognition hypotheses (up to the maximum specified in `max_alternatives`). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.
bool is_final = 2
If `false`, this `StreamingRecognitionResult` represents an interim result that may change. If `true`, this is the final time the speech service will return this particular `StreamingRecognitionResult`, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.
float stability = 3
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (`is_final=false`). The default of 0.0 is a sentinel value indicating `stability` was not set.
int32 channel_tag = 5
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.
float audio_processed = 6
Length of audio processed so far in seconds

Word-specific information for recognized words.

Used in: SpeechRecognitionAlternative

int32 start_time = 1
Time offset relative to the beginning of the audio in ms and corresponding to the start of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.
int32 end_time = 2
Time offset relative to the beginning of the audio in ms and corresponding to the end of the spoken word. This field is only set if `enable_word_time_offsets=true` and only in the top hypothesis.
string word = 3
The word corresponding to this set of information.

package nvidia.riva.asr

service RivaSpeechRecognition

rpc Recognize (RecognizeRequest, RecognizeResponse)

message RecognizeRequest

optional RecognitionConfig config = 1

bytes audio = 2

message RecognizeResponse

repeated SpeechRecognitionResult results = 1

rpc StreamingRecognize (stream StreamingRecognizeRequest, stream StreamingRecognizeResponse)

message StreamingRecognizeRequest

oneof streaming_request

StreamingRecognitionConfig streaming_config = 1

bytes audio_content = 2

message StreamingRecognizeResponse

repeated StreamingRecognitionResult results = 1

message RecognitionConfig

AudioEncoding encoding = 1

int32 sample_rate_hertz = 2

string language_code = 3

int32 max_alternatives = 4

int32 audio_channel_count = 7

bool enable_word_time_offsets = 8

bool enable_automatic_punctuation = 11

bool enable_separate_recognition_per_channel = 12

string model = 13

bool verbatim_transcripts = 14

map<string, string> custom_configuration = 24

message SpeechRecognitionAlternative

string transcript = 1

float confidence = 2

repeated WordInfo words = 3

message SpeechRecognitionResult

repeated SpeechRecognitionAlternative alternatives = 1

int32 channel_tag = 2

float audio_processed = 3

message StreamingRecognitionConfig

optional RecognitionConfig config = 1

bool interim_results = 2

message StreamingRecognitionResult

repeated SpeechRecognitionAlternative alternatives = 1

bool is_final = 2

float stability = 3

int32 channel_tag = 5

float audio_processed = 6

message WordInfo

int32 start_time = 1

int32 end_time = 2

string word = 3