Proto commits in huggingface/text-generation-inference

These 74 commits are when the Protocol Buffers files have changed:

2025-05-05

Commit:	da3f18e
Author:	drbh	2025-04-22 16:26:56 +0000
Committer:	drbh	2025-05-05 17:34:03 -0400

feat: include proto changes

2025-04-30

Commit:	68b5409
Author:	drbh	2025-04-22 16:26:56 +0000
Committer:	drbh	2025-04-30 10:47:24 -0400

feat: include proto changes

2025-04-22

Commit:	4479480
Author:	drbh	2025-04-22 16:26:56 +0000

feat: include proto changes

2024-12-23

Commit:	5ced960
Author:	Miquel Farre	2024-11-13 13:37:17 +0000
Committer:	drbh	2024-12-23 13:47:18 -0500

adopting video url

Commit:	19e1c8d
Author:	Miquel Farre	2024-12-11 21:08:03 +0000
Committer:	drbh	2024-12-23 13:47:18 -0500

working version

Commit:	bc5e202
Author:	drbh	2024-11-25 16:40:32 -0500
Committer:	drbh	2024-12-23 13:47:18 -0500

fix: adjust video process, reduce to 1 fps and adjust tensor shape

Commit:	83a7f18
Author:	David Holtz	2024-11-18 21:22:19 +0000
Committer:	drbh	2024-12-23 13:47:18 -0500

fix: add protobuf update and mp4parse dep

2024-10-28

Commit:	0c9b6cd
Author:	Nicolas Patry	2024-10-28 04:59:49 +0100
Committer:	GitHub	2024-10-28 04:59:49 +0100

Choosing input/total tokens automatically based on available VRAM? (#2673) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).

The documentation is generated from this commit.

2024-10-21

Commit:	a1aac78
Author:	Nicolas Patry	2024-10-21 13:02:04 +0200
Committer:	Nicolas Patry	2024-10-21 15:06:36 +0200

Choosing input/total tokens automatically based on available VRAM?

The documentation is generated from this commit.

2024-10-16

Commit:	a6a0c97
Author:	OlivierDehaene	2024-10-16 12:49:33 +0200
Committer:	GitHub	2024-10-16 12:49:33 +0200

feat: prefill chunking (#2600) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2024-09-17

Commit:	2f0fde1
Author:	Nicolas Patry	2024-09-02 11:46:36 +0200
Committer:	Nicolas Patry	2024-09-17 11:18:28 +0200

TMP chunking.

2024-08-29

Commit:	e415b69
Author:	Nicolas Patry	2024-08-29 16:29:01 +0200
Committer:	GitHub	2024-08-29 16:29:01 +0200

Lots of improvements (Still 2 allocators) (#2449) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

2024-08-27

Commit:	2cf1f5c
Author:	Nicolas Patry	2024-08-27 20:02:35 +0200
Committer:	Nicolas Patry	2024-08-27 20:06:12 +0200

Fixing the issue with `add_special_tokens` not being passed around.

2024-08-12

Commit:	8deeaca
Author:	Daniël de Kok	2024-08-12 14:59:17 +0200
Committer:	GitHub	2024-08-12 14:59:17 +0200

Add support for prefix caching to the v3 router (#2392) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.

2024-08-09

Commit:	7735b38
Author:	Daniël de Kok	2024-08-09 11:47:14 +0000
Committer:	Daniël de Kok	2024-08-09 14:52:59 +0000

Prefix caching WIP

2024-08-01

Commit:	10b9405
Author:	Nathan Brake	2024-07-15 13:55:43 +0000
Committer:	erikkaum	2024-08-01 12:04:56 +0200

update docs

Commit:	ea915ad
Author:	Nathan Brake	2024-07-15 13:51:11 +0000
Committer:	erikkaum	2024-08-01 12:04:56 +0200

Add support for no_repeat_ngram_size

2024-07-15

Commit:	ae46fae
Author:	Nathan Brake	2024-07-15 13:55:43 +0000

update docs

Commit:	28e6a50
Author:	Nathan Brake	2024-07-15 13:51:11 +0000

Add support for no_repeat_ngram_size

2024-06-25

Commit:	04e1af9
Author:	drbh	2024-06-25 14:46:27 -0400
Committer:	GitHub	2024-06-25 14:46:27 -0400

Enable multiple LoRa adapters (#2010) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>

2024-06-18

Commit:	7ed1044
Author:	OlivierDehaene	2024-06-18 12:18:05 +0200

added padded blocks and logs everywhere

2024-06-12

Commit:	9ac7b7b
Author:	OlivierDehaene	2024-06-12 11:50:31 +0200

remove slots from grpc

2024-06-11

Commit:	73c3903
Author:	OlivierDehaene	2024-06-11 12:38:07 +0200
Committer:	OlivierDehaene	2024-06-11 13:15:06 +0200

FlashCausalLM implem

Commit:	298bf31
Author:	OlivierDehaene	2024-06-07 11:26:17 +0200
Committer:	OlivierDehaene	2024-06-11 13:15:06 +0200

add terminated_generations

Commit:	18e77a5
Author:	OlivierDehaene	2024-06-05 15:28:10 +0200
Committer:	OlivierDehaene	2024-06-11 13:15:05 +0200

wip

Commit:	1cc8693
Author:	OlivierDehaene	2024-06-05 17:01:06 +0200
Committer:	OlivierDehaene	2024-06-11 13:15:05 +0200

wip

2024-06-10

Commit:	8b50f4b
Author:	drbh	2024-06-05 23:56:04 +0000
Committer:	drbh	2024-06-10 10:23:52 -0400

feat: prefer lorax implementation and port loading logic

Commit:	db3d8e6
Author:	drbh	2024-05-30 19:16:15 +0000
Committer:	drbh	2024-06-10 10:23:52 -0400

feat: first draft load multiple lora

Commit:	73eb2ae
Author:	drbh	2024-06-06 20:31:27 +0000
Committer:	drbh	2024-06-10 10:23:52 -0400

fix: refactor and move changes to v3 proto

Commit:	81707bf
Author:	drbh	2024-06-06 23:23:17 +0000
Committer:	drbh	2024-06-10 10:23:52 -0400

fix: include rust code for adapter id

2024-06-05

Commit:	8aece3b
Author:	OlivierDehaene	2024-06-05 12:18:38 +0200
Committer:	GitHub	2024-06-05 12:18:38 +0200

feat: move allocation logic to rust (#1835) Close #2007

2024-06-04

Commit:	757223b
Author:	OlivierDehaene	2024-06-04 15:56:56 +0200
Committer:	GitHub	2024-06-04 15:56:56 +0200

feat: add SchedulerV3 (#1996) - Refactor code to allow supporting multiple versions of the generate.proto at the same time - Add v3/generate.proto (ISO to generate.proto for now but allow for future changes without impacting v2 backends) - Add Schedule trait to abstract queuing and batching mechanisms that will be different in the future - Add SchedulerV2/V3 impl

2024-06-03

Commit:	df71aaf
Author:	Daniël de Kok	2024-06-03 07:27:22 +0000
Committer:	Daniël de Kok	2024-06-03 17:02:41 +0200

router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.

Commit:	fc52ba6
Author:	Daniël de Kok	2024-06-03 07:27:22 +0000

2024-03-08

Commit:	1f7be73
Author:	drbh	2024-03-08 03:34:19 +0000

feat: remove uncompile grammar and improve logit processor logic

2024-03-06

Commit:	4f7074c
Author:	drbh	2024-03-06 02:40:56 +0000

feat: compile grammar and send over grpc

2024-02-28

Commit:	b40e833
Author:	OlivierDehaene	2024-02-28 12:07:08 +0100
Committer:	GitHub	2024-02-28 12:07:08 +0100

feat: starcoder2 (#1605)

2024-02-15

Commit:	cef0553
Author:	drbh	2024-02-15 04:28:10 -0500
Committer:	GitHub	2024-02-15 10:28:10 +0100

Outlines guided generation (#1539) This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' | jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' | jq ``` response ```json { "generated_text": " email = 'david" } ```

2024-02-08

Commit:	09b7c26
Author:	OlivierDehaene	2024-02-08 18:41:25 +0100
Committer:	GitHub	2024-02-08 18:41:25 +0100

feat(server): add frequency penalty (#1541)

2023-12-18

Commit:	d077150
Author:	OlivierDehaene	2023-12-18 16:07:05 +0100
Committer:	GitHub	2023-12-18 16:07:05 +0100

fix: fix gpt-q with groupsize = -1 (#1358)

2023-12-14

Commit:	50b495f
Author:	OlivierDehaene	2023-12-14 15:59:38 +0100
Committer:	GitHub	2023-12-14 15:59:38 +0100

feat: add more latency metrics in forward (#1346)

2023-12-11

Commit:	9ecfa16
Author:	Nicolas Patry	2023-12-11 12:46:30 +0100
Committer:	GitHub	2023-12-11 12:46:30 +0100

Speculative (#1308)

2023-11-29

Commit:	a478b27
Author:	Nicolas Patry	2023-11-29 16:20:11 +0000
Committer:	Nicolas Patry	2023-11-29 16:20:33 +0000

Modifying the protobuf.

2023-09-28

Commit:	3b56d76
Author:	OlivierDehaene	2023-09-28 09:55:47 +0200
Committer:	GitHub	2023-09-28 09:55:47 +0200

feat: add mistral model (#1071)

2023-09-11

Commit:	33958e0
Author:	Nicolas Patry	2023-09-11 18:25:49 +0000

Start.

2023-08-28

Commit:	211b54a
Author:	Nicolas Patry	2023-08-28 11:43:47 +0200
Committer:	GitHub	2023-08-28 11:43:47 +0200

Rebased #617 (#868) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>

2023-07-19

Commit:	fe80f53
Author:	OlivierDehaene	2023-07-19 09:31:25 +0200
Committer:	GitHub	2023-07-19 09:31:25 +0200

feat(server): auto max_batch_total_tokens for flash att models (#630)

2023-06-30

Commit:	e74bd41
Author:	OlivierDehaene	2023-06-30 19:09:59 +0200
Committer:	GitHub	2023-06-30 19:09:59 +0200

feat(server): add paged attention to flash models (#516) Closes #478

2023-06-02

Commit:	895c5f1
Author:	OlivierDehaene	2023-06-02 17:12:30 +0200
Committer:	GitHub	2023-06-02 17:12:30 +0200

feat(server): only compute prefill logprobs when asked (#406) Close #288

2023-05-24

Commit:	218c9ad
Author:	OlivierDehaene	2023-05-24 19:19:57 +0200
Committer:	GitHub	2023-05-24 19:19:57 +0200

feat: decrease IPC proto size (#367) Closes #307 #308

2023-04-26

Commit:	db2b4e0
Author:	Nicolas Patry	2023-04-26 20:23:54 +0200
Committer:	GitHub	2023-04-26 20:23:54 +0200

feat(router): new healthcheck that skips the queue (#244) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>

2023-04-24

Commit:	ebc74d5
Author:	OlivierDehaene	2023-04-24 17:59:00 +0200
Committer:	GitHub	2023-04-24 17:59:00 +0200

feat(router): use number of tokens in batch as input for dynamic batching (#226) Co-authored-by: Nick Hill <nickhill@us.ibm.com>

2023-04-21

Commit:	343437c
Author:	OlivierDehaene	2023-04-21 15:36:29 +0200
Committer:	GitHub	2023-04-21 15:36:29 +0200

feat(router): add device and dtype info (#215)

2023-04-09

Commit:	9987960
Author:	OlivierDehaene	2023-04-09 20:22:27 +0200
Committer:	GitHub	2023-04-09 20:22:27 +0200

feat(router): make router input validation optional (#164)

2023-03-30

Commit:	610bb1f
Author:	OlivierDehaene	2023-03-30 15:26:27 +0200
Committer:	GitHub	2023-03-30 15:26:27 +0200

feat(benchmark): tui based benchmarking tool (#149)

2023-03-28

Commit:	f000068
Author:	OlivierDehaene	2023-03-28 11:29:35 +0200
Committer:	GitHub	2023-03-28 11:29:35 +0200

feat(server): clear cache on error (#143)

2023-03-16

Commit:	b49dbf2
Author:	OlivierDehaene	2023-03-16 12:12:26 +0100
Committer:	GitHub	2023-03-16 12:12:26 +0100

fix(server): use server tokenizer as gt (#128)

2023-03-09

Commit:	1a2d682
Author:	OlivierDehaene	2023-03-09 11:33:57 +0100
Committer:	GitHub	2023-03-09 11:33:57 +0100

feat: support typical sampling (#114) closes #112

2023-03-02

Commit:	9b8ea6a
Author:	OlivierDehaene	2023-03-02 12:30:41 +0100
Committer:	GitHub	2023-03-02 12:30:41 +0100

feat(server): add logits watermark (#90)

2023-02-24

Commit:	0ac184c
Author:	OlivierDehaene	2023-02-24 15:55:57 +0100
Committer:	GitHub	2023-02-24 15:55:57 +0100

feat(server): add special token bool (#85)

2023-02-03

Commit:	20c3c59
Author:	OlivierDehaene	2023-02-03 12:43:37 +0100
Committer:	GitHub	2023-02-03 12:43:37 +0100

feat(router): refactor API and add openAPI schemas (#53)

2023-02-01

Commit:	313194f
Author:	OlivierDehaene	2023-02-01 15:58:42 +0100
Committer:	GitHub	2023-02-01 15:58:42 +0100

feat(server): support repetition penalty (#47)

2023-01-31

Commit:	017a2a8
Author:	OlivierDehaene	2023-01-31 17:04:00 +0100
Committer:	GitHub	2023-01-31 17:04:00 +0100

feat: Add token streaming using ServerSideEvents support (#41)

Commit:	54fec93
Author:	OlivierDehaene	2023-01-31 16:01:15 +0100
Committer:	GitHub	2023-01-31 16:01:15 +0100

fix(server): fix seeding with multiple shards (#44)

Commit:	4f9ac67
Author:	OlivierDehaene	2023-01-31 14:21:51 +0100
Committer:	GitHub	2023-01-31 14:21:51 +0100

Revert "feat: Add token streaming using ServerSideEvents support" (#40) Reverts huggingface/text-generation-inference#36

Commit:	7fbfbb0
Author:	OlivierDehaene	2023-01-31 11:49:43 +0100
Committer:	GitHub	2023-01-31 11:49:43 +0100

feat: Add token streaming using ServerSideEvents support (#36) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```

2023-01-30

Commit:	cd298bc
Author:	OlivierDehaene	2023-01-30 15:36:16 +0100
Committer:	GitHub	2023-01-30 15:36:16 +0100

feat: Support sampling seeding (#37) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>

2022-12-15

Commit:	32a2530
Author:	OlivierDehaene	2022-12-15 17:03:56 +0100
Committer:	GitHub	2022-12-15 17:03:56 +0100

feat: Return logprobs (#8)

2022-12-12

Commit:	718096f
Author:	OlivierDehaene	2022-12-12 18:25:22 +0100
Committer:	GitHub	2022-12-12 18:25:22 +0100

feat: Support stop sequences (#7)

2022-11-04

Commit:	427d7cc
Author:	OlivierDehaene	2022-11-04 18:03:04 +0100

feat(server): Support AutoModelForSeq2SeqLM

Commit:	c5665f5
Author:	OlivierDehaene	2022-11-04 14:22:47 +0100

feat(server): Support generic AutoModelForCausalLM

2022-10-20

Commit:	f16f2f5
Author:	Olivier Dehaene	2022-10-18 15:19:03 +0200
Committer:	OlivierDehaene	2022-10-20 19:14:44 +0200

v0.1.0

2022-10-11

Commit:	4c693e6
Author:	Olivier Dehaene	2022-10-11 16:50:54 +0200

Refactored gRPC interface Added validation logic

2022-10-08

Commit:	295831a
Author:	Olivier Dehaene	2022-10-08 12:30:12 +0200

Init