Get desktop application:
View/edit binary Protocol Buffers messages
/ Model Info
/ Empty request
(message has no fields)
/ Service discovery
/ Empty request
(message has no fields)
/ Other shards urls
/ Empties batch cache
/ Optional batch id
/ Empty response
(message has no fields)
/ Remove requests from a cached batch
/ Batch ID
/ Requests to keep
/ Filtered Batch (cached)
/ Warmup the model and compute max cache size
/ Batch to warmup on
/ Maximum number of tokens supported by the model
/ Maximum input tokens by clients should be equal to request value if it's set / Otherwise warmup automatically allocates a value here
/ Maximum total tokens by clients should be equal to request value if it's set / Otherwise warmup automatically allocates a value here
/ Prefill batch and decode first token
/ Batch
/ Optional cached batch
/ Generation
/ Next batch (cached)
/ Forward elapsed time in nanoseconds
/ Decode elapsed time in nanoseconds
/ Total elapsed time in nanoseconds
/ Concatenate elapsed time in nanoseconds
/ Decode token for a list of prefilled batches
/ Cached batches
/ Decodes
/ Next batch (cached)
/ Forward elapsed time in nanoseconds
/ Decode elapsed time in nanoseconds
/ Total elapsed time in nanoseconds
/ Concatenate elapsed time in nanoseconds
/ Health check
(message has no fields)
(message has no fields)
Used in:
,/ Batch ID
/ Individual requests
/ Batch size (==len(requests))
/ Maximum number of tokens this batch will grow to
/ Maximum number of Paged Attention blocks
Used in:
, , , ,/ Batch ID
/ Individual requests ids
/ Batch size (==len(requests))
/ Maximum number of tokens this batch will grow to
/ Number of tokens in the next forward
Used in:
Used in:
/ Output
/ Number of generated tokens
/ Finish reason
/ Seed
Used in:
,/ Request ID
/ Prefill tokens (optional)
/ Complete generated text
/ Top tokens
Used in:
Used in:
/ Binary image data.
/ Image MIME type.
Used in:
Used in:
/ Plain text data
/ Image data
Used in:
/ exponential scaling output probability distribution
/ restricting to the k highest probability elements
/ restricting to top tokens summing to prob_cut_off <= prob_cut_off
/ restricting to top tokens summing to prob_cut_off <= prob_cut_off
/ apply sampling on the logits
/ random seed for sampling
/ repetition penalty
/ frequency penalty
/ token watermarking using "A Watermark for Large Language Models"
/ grammar (applied if not empty)
/ grammar type
Used in:
/ Request ID
/ The generation context as chunks
/ The generation context, stringified input_chunks
/ Context truncation
/ Next Token Chooser Parameters
/ Stopping Criteria Parameters
/ Return prefill logprobs
/ Return most likely n tokens
/ Paged attention blocks
/ Paged attention slots
/ LORA adapter index
/ Tokens that can be retrieved from the KV cache. / This value is set for the first prefill and never reset
/ Context truncation
/ Chunk of tokens that must be computed for the first prefill / This value is set for the first prefill and never reset
Used in:
/ Maximum number of generated tokens
/ Optional stopping sequences
/ Ignore end of sequence token / used for benchmarking
Used in:
/ Token IDs
/ Logprobs
/ tokens
/ special