Proto commits in huggingface/text-generation-inference

These 74 commits are when the Protocol Buffers files have changed:

Commit:da3f18e
Author:drbh
Committer:drbh

feat: include proto changes

Commit:68b5409
Author:drbh
Committer:drbh

feat: include proto changes

Commit:4479480
Author:drbh

feat: include proto changes

Commit:5ced960
Author:Miquel Farre
Committer:drbh

adopting video url

Commit:19e1c8d
Author:Miquel Farre
Committer:drbh

working version

Commit:bc5e202
Author:drbh
Committer:drbh

fix: adjust video process, reduce to 1 fps and adjust tensor shape

Commit:83a7f18
Author:David Holtz
Committer:drbh

fix: add protobuf update and mp4parse dep

Commit:0c9b6cd
Author:Nicolas Patry
Committer:GitHub

Choosing input/total tokens automatically based on available VRAM? (#2673) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).

The documentation is generated from this commit.

Commit:a1aac78
Author:Nicolas Patry
Committer:Nicolas Patry

Choosing input/total tokens automatically based on available VRAM?

The documentation is generated from this commit.

Commit:a6a0c97
Author:OlivierDehaene
Committer:GitHub

feat: prefill chunking (#2600) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Commit:2f0fde1
Author:Nicolas Patry
Committer:Nicolas Patry

TMP chunking.

Commit:e415b69
Author:Nicolas Patry
Committer:GitHub

Lots of improvements (Still 2 allocators) (#2449) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Commit:2cf1f5c
Author:Nicolas Patry
Committer:Nicolas Patry

Fixing the issue with `add_special_tokens` not being passed around.

Commit:8deeaca
Author:Daniël de Kok
Committer:GitHub

Add support for prefix caching to the v3 router (#2392) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.

Commit:7735b38
Author:Daniël de Kok
Committer:Daniël de Kok

Prefix caching WIP

Commit:10b9405
Author:Nathan Brake
Committer:erikkaum

update docs

Commit:ea915ad
Author:Nathan Brake
Committer:erikkaum

Add support for no_repeat_ngram_size

Commit:ae46fae
Author:Nathan Brake

update docs

Commit:28e6a50
Author:Nathan Brake

Add support for no_repeat_ngram_size

Commit:04e1af9
Author:drbh
Committer:GitHub

Enable multiple LoRa adapters (#2010) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>

Commit:7ed1044
Author:OlivierDehaene

added padded blocks and logs everywhere

Commit:9ac7b7b
Author:OlivierDehaene

remove slots from grpc

Commit:73c3903
Author:OlivierDehaene
Committer:OlivierDehaene

FlashCausalLM implem

Commit:298bf31
Author:OlivierDehaene
Committer:OlivierDehaene

add terminated_generations

Commit:18e77a5
Author:OlivierDehaene
Committer:OlivierDehaene

wip

Commit:1cc8693
Author:OlivierDehaene
Committer:OlivierDehaene

wip

Commit:8b50f4b
Author:drbh
Committer:drbh

feat: prefer lorax implementation and port loading logic

Commit:db3d8e6
Author:drbh
Committer:drbh

feat: first draft load multiple lora

Commit:73eb2ae
Author:drbh
Committer:drbh

fix: refactor and move changes to v3 proto

Commit:81707bf
Author:drbh
Committer:drbh

fix: include rust code for adapter id

Commit:8aece3b
Author:OlivierDehaene
Committer:GitHub

feat: move allocation logic to rust (#1835) Close #2007

Commit:757223b
Author:OlivierDehaene
Committer:GitHub

feat: add SchedulerV3 (#1996) - Refactor code to allow supporting multiple versions of the generate.proto at the same time - Add v3/generate.proto (ISO to generate.proto for now but allow for future changes without impacting v2 backends) - Add Schedule trait to abstract queuing and batching mechanisms that will be different in the future - Add SchedulerV2/V3 impl

Commit:df71aaf
Author:Daniël de Kok
Committer:Daniël de Kok

router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.

Commit:fc52ba6
Author:Daniël de Kok

router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.

Commit:1f7be73
Author:drbh

feat: remove uncompile grammar and improve logit processor logic

Commit:4f7074c
Author:drbh

feat: compile grammar and send over grpc

Commit:b40e833
Author:OlivierDehaene
Committer:GitHub

feat: starcoder2 (#1605)

Commit:cef0553
Author:drbh
Committer:GitHub

Outlines guided generation (#1539) This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' | jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' | jq ``` response ```json { "generated_text": " email = 'david" } ```

Commit:09b7c26
Author:OlivierDehaene
Committer:GitHub

feat(server): add frequency penalty (#1541)

Commit:d077150
Author:OlivierDehaene
Committer:GitHub

fix: fix gpt-q with groupsize = -1 (#1358)

Commit:50b495f
Author:OlivierDehaene
Committer:GitHub

feat: add more latency metrics in forward (#1346)

Commit:9ecfa16
Author:Nicolas Patry
Committer:GitHub

Speculative (#1308)

Commit:a478b27
Author:Nicolas Patry
Committer:Nicolas Patry

Modifying the protobuf.

Commit:3b56d76
Author:OlivierDehaene
Committer:GitHub

feat: add mistral model (#1071)

Commit:33958e0
Author:Nicolas Patry

Start.

Commit:211b54a
Author:Nicolas Patry
Committer:GitHub

Rebased #617 (#868) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>

Commit:fe80f53
Author:OlivierDehaene
Committer:GitHub

feat(server): auto max_batch_total_tokens for flash att models (#630)

Commit:e74bd41
Author:OlivierDehaene
Committer:GitHub

feat(server): add paged attention to flash models (#516) Closes #478

Commit:895c5f1
Author:OlivierDehaene
Committer:GitHub

feat(server): only compute prefill logprobs when asked (#406) Close #288

Commit:218c9ad
Author:OlivierDehaene
Committer:GitHub

feat: decrease IPC proto size (#367) Closes #307 #308

Commit:db2b4e0
Author:Nicolas Patry
Committer:GitHub

feat(router): new healthcheck that skips the queue (#244) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

Commit:ebc74d5
Author:OlivierDehaene
Committer:GitHub

feat(router): use number of tokens in batch as input for dynamic batching (#226) Co-authored-by: Nick Hill <nickhill@us.ibm.com>

Commit:343437c
Author:OlivierDehaene
Committer:GitHub

feat(router): add device and dtype info (#215)

Commit:9987960
Author:OlivierDehaene
Committer:GitHub

feat(router): make router input validation optional (#164)

Commit:610bb1f
Author:OlivierDehaene
Committer:GitHub

feat(benchmark): tui based benchmarking tool (#149)

Commit:f000068
Author:OlivierDehaene
Committer:GitHub

feat(server): clear cache on error (#143)

Commit:b49dbf2
Author:OlivierDehaene
Committer:GitHub

fix(server): use server tokenizer as gt (#128)

Commit:1a2d682
Author:OlivierDehaene
Committer:GitHub

feat: support typical sampling (#114) closes #112

Commit:9b8ea6a
Author:OlivierDehaene
Committer:GitHub

feat(server): add logits watermark (#90)

Commit:0ac184c
Author:OlivierDehaene
Committer:GitHub

feat(server): add special token bool (#85)

Commit:20c3c59
Author:OlivierDehaene
Committer:GitHub

feat(router): refactor API and add openAPI schemas (#53)

Commit:313194f
Author:OlivierDehaene
Committer:GitHub

feat(server): support repetition penalty (#47)

Commit:017a2a8
Author:OlivierDehaene
Committer:GitHub

feat: Add token streaming using ServerSideEvents support (#41)

Commit:54fec93
Author:OlivierDehaene
Committer:GitHub

fix(server): fix seeding with multiple shards (#44)

Commit:4f9ac67
Author:OlivierDehaene
Committer:GitHub

Revert "feat: Add token streaming using ServerSideEvents support" (#40) Reverts huggingface/text-generation-inference#36

Commit:7fbfbb0
Author:OlivierDehaene
Committer:GitHub

feat: Add token streaming using ServerSideEvents support (#36) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```

Commit:cd298bc
Author:OlivierDehaene
Committer:GitHub

feat: Support sampling seeding (#37) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

Commit:32a2530
Author:OlivierDehaene
Committer:GitHub

feat: Return logprobs (#8)

Commit:718096f
Author:OlivierDehaene
Committer:GitHub

feat: Support stop sequences (#7)

Commit:427d7cc
Author:OlivierDehaene

feat(server): Support AutoModelForSeq2SeqLM

Commit:c5665f5
Author:OlivierDehaene

feat(server): Support generic AutoModelForCausalLM

Commit:f16f2f5
Author:Olivier Dehaene
Committer:OlivierDehaene

v0.1.0

Commit:4c693e6
Author:Olivier Dehaene

Refactored gRPC interface Added validation logic

Commit:295831a
Author:Olivier Dehaene

Init