Get desktop application:
View/edit binary Protocol Buffers messages
Query LLM to generate text or tokens.
The maximum output length of a sequence. It's used in JetStream to control the output/decode length of a sequence. It would not be used in the engine. We should always set max_tokens <= (max_target_length - max_prefill_predict_length). max_target_length is the maximum length of a sequence; max_prefill_predict_length is the maximum length of the input/prefill of a sequence.
The client can pass the inputs either as a string, in which case the server will tokenize it, or as tokens, in which case it's the client's responsibility to ensure they tokenize its input strings with the correct tokenizer.
Indicates whether the content has a beginning of sequence (BOS) token.
Checks if the model server is live.
(message has no fields)
Denotes whether the model server is live
Used in:
Used in:
Used in:
InitialContent supports returning initial one-off response data from the stream. It's a placeholder for future features such as history cache.
Used in:
(message has no fields)
Used in:
Supports multiple samples in the StreamContent. The Sample list size depends on text generation strategy the engine used.
Used in:
The text string decoded from token id(s).
List of token ids, one list per sample. When speculative decoding is disabled, the list size should be 1; When speculative decoding is enabled, the list size should be >= 1.