These 74 commits are when the Protocol Buffers files have changed:
Commit: | da3f18e | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
feat: include proto changes
Commit: | 68b5409 | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
feat: include proto changes
Commit: | 4479480 | |
---|---|---|
Author: | drbh |
feat: include proto changes
Commit: | 5ced960 | |
---|---|---|
Author: | Miquel Farre | |
Committer: | drbh |
adopting video url
Commit: | 19e1c8d | |
---|---|---|
Author: | Miquel Farre | |
Committer: | drbh |
working version
Commit: | bc5e202 | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
fix: adjust video process, reduce to 1 fps and adjust tensor shape
Commit: | 83a7f18 | |
---|---|---|
Author: | David Holtz | |
Committer: | drbh |
fix: add protobuf update and mp4parse dep
Commit: | 0c9b6cd | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | GitHub |
Choosing input/total tokens automatically based on available VRAM? (#2673) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).
The documentation is generated from this commit.
Commit: | a1aac78 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | Nicolas Patry |
Choosing input/total tokens automatically based on available VRAM?
The documentation is generated from this commit.
Commit: | a6a0c97 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: prefill chunking (#2600) * wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Commit: | 2f0fde1 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | Nicolas Patry |
TMP chunking.
Commit: | e415b69 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | GitHub |
Lots of improvements (Still 2 allocators) (#2449) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Commit: | 2cf1f5c | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | Nicolas Patry |
Fixing the issue with `add_special_tokens` not being passed around.
Commit: | 8deeaca | |
---|---|---|
Author: | Daniël de Kok | |
Committer: | GitHub |
Add support for prefix caching to the v3 router (#2392) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.
Commit: | 7735b38 | |
---|---|---|
Author: | Daniël de Kok | |
Committer: | Daniël de Kok |
Prefix caching WIP
Commit: | 10b9405 | |
---|---|---|
Author: | Nathan Brake | |
Committer: | erikkaum |
update docs
Commit: | ea915ad | |
---|---|---|
Author: | Nathan Brake | |
Committer: | erikkaum |
Add support for no_repeat_ngram_size
Commit: | ae46fae | |
---|---|---|
Author: | Nathan Brake |
update docs
Commit: | 28e6a50 | |
---|---|---|
Author: | Nathan Brake |
Add support for no_repeat_ngram_size
Commit: | 04e1af9 | |
---|---|---|
Author: | drbh | |
Committer: | GitHub |
Enable multiple LoRa adapters (#2010) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>
Commit: | 7ed1044 | |
---|---|---|
Author: | OlivierDehaene |
added padded blocks and logs everywhere
Commit: | 9ac7b7b | |
---|---|---|
Author: | OlivierDehaene |
remove slots from grpc
Commit: | 73c3903 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | OlivierDehaene |
FlashCausalLM implem
Commit: | 298bf31 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | OlivierDehaene |
add terminated_generations
Commit: | 18e77a5 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | OlivierDehaene |
wip
Commit: | 1cc8693 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | OlivierDehaene |
wip
Commit: | 8b50f4b | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
feat: prefer lorax implementation and port loading logic
Commit: | db3d8e6 | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
feat: first draft load multiple lora
Commit: | 73eb2ae | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
fix: refactor and move changes to v3 proto
Commit: | 81707bf | |
---|---|---|
Author: | drbh | |
Committer: | drbh |
fix: include rust code for adapter id
Commit: | 8aece3b | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: move allocation logic to rust (#1835) Close #2007
Commit: | 757223b | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: add SchedulerV3 (#1996) - Refactor code to allow supporting multiple versions of the generate.proto at the same time - Add v3/generate.proto (ISO to generate.proto for now but allow for future changes without impacting v2 backends) - Add Schedule trait to abstract queuing and batching mechanisms that will be different in the future - Add SchedulerV2/V3 impl
Commit: | df71aaf | |
---|---|---|
Author: | Daniël de Kok | |
Committer: | Daniël de Kok |
router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.
Commit: | fc52ba6 | |
---|---|---|
Author: | Daniël de Kok |
router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.
Commit: | 1f7be73 | |
---|---|---|
Author: | drbh |
feat: remove uncompile grammar and improve logit processor logic
Commit: | 4f7074c | |
---|---|---|
Author: | drbh |
feat: compile grammar and send over grpc
Commit: | b40e833 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: starcoder2 (#1605)
Commit: | cef0553 | |
---|---|---|
Author: | drbh | |
Committer: | GitHub |
Outlines guided generation (#1539) This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' | jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' | jq ``` response ```json { "generated_text": " email = 'david" } ```
Commit: | 09b7c26 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): add frequency penalty (#1541)
Commit: | d077150 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
fix: fix gpt-q with groupsize = -1 (#1358)
Commit: | 50b495f | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: add more latency metrics in forward (#1346)
Commit: | 9ecfa16 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | GitHub |
Speculative (#1308)
Commit: | a478b27 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | Nicolas Patry |
Modifying the protobuf.
Commit: | 3b56d76 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: add mistral model (#1071)
Commit: | 33958e0 | |
---|---|---|
Author: | Nicolas Patry |
Start.
Commit: | 211b54a | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | GitHub |
Rebased #617 (#868) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>
Commit: | fe80f53 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): auto max_batch_total_tokens for flash att models (#630)
Commit: | e74bd41 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): add paged attention to flash models (#516) Closes #478
Commit: | 895c5f1 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): only compute prefill logprobs when asked (#406) Close #288
Commit: | 218c9ad | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: decrease IPC proto size (#367) Closes #307 #308
Commit: | db2b4e0 | |
---|---|---|
Author: | Nicolas Patry | |
Committer: | GitHub |
feat(router): new healthcheck that skips the queue (#244) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Commit: | ebc74d5 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(router): use number of tokens in batch as input for dynamic batching (#226) Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Commit: | 343437c | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(router): add device and dtype info (#215)
Commit: | 9987960 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(router): make router input validation optional (#164)
Commit: | 610bb1f | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(benchmark): tui based benchmarking tool (#149)
Commit: | f000068 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): clear cache on error (#143)
Commit: | b49dbf2 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
fix(server): use server tokenizer as gt (#128)
Commit: | 1a2d682 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: support typical sampling (#114) closes #112
Commit: | 9b8ea6a | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): add logits watermark (#90)
Commit: | 0ac184c | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): add special token bool (#85)
Commit: | 20c3c59 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(router): refactor API and add openAPI schemas (#53)
Commit: | 313194f | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat(server): support repetition penalty (#47)
Commit: | 017a2a8 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: Add token streaming using ServerSideEvents support (#41)
Commit: | 54fec93 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
fix(server): fix seeding with multiple shards (#44)
Commit: | 4f9ac67 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
Revert "feat: Add token streaming using ServerSideEvents support" (#40) Reverts huggingface/text-generation-inference#36
Commit: | 7fbfbb0 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: Add token streaming using ServerSideEvents support (#36) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```
Commit: | cd298bc | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: Support sampling seeding (#37) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
Commit: | 32a2530 | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: Return logprobs (#8)
Commit: | 718096f | |
---|---|---|
Author: | OlivierDehaene | |
Committer: | GitHub |
feat: Support stop sequences (#7)
Commit: | 427d7cc | |
---|---|---|
Author: | OlivierDehaene |
feat(server): Support AutoModelForSeq2SeqLM
Commit: | c5665f5 | |
---|---|---|
Author: | OlivierDehaene |
feat(server): Support generic AutoModelForCausalLM
Commit: | f16f2f5 | |
---|---|---|
Author: | Olivier Dehaene | |
Committer: | OlivierDehaene |
v0.1.0
Commit: | 4c693e6 | |
---|---|---|
Author: | Olivier Dehaene |
Refactored gRPC interface Added validation logic
Commit: | 295831a | |
---|---|---|
Author: | Olivier Dehaene |
Init