Proto commits in AI-Hypercomputer/JetStream

These 34 commits are when the Protocol Buffers files have changed:

Commit:508a947
Author:Lihao Ran
Committer:Lihao Ran

Allow users to specify whether or not to add bos token

The documentation is generated from this commit.

Commit:572ff6d
Author:Lihao Ran
Committer:Lihao Ran

Allow users to specify whether or not to add bos token

The documentation is generated from this commit.

Commit:082c0ac
Author:Aman Gupta
Committer:GitHub

Supporting Multi-LoRA inferencing via JetStream server (#221) Supporting Multi-LoRA inferencing via JetStream server following [LLM Inference gateway API protocols](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented an adapter_tensorstore to load, store, manage and unload the adapter weights - Added and exposed [required metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#metrics-reporting) at prometheus endpoint - Added multi_lora_decoding service with corresponding APIs as per the [requirement](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented single LoRA functionality support.

Commit:9d19631
Author:gpolovets1
Committer:GitHub

Added new HuggingFaceTokenizer to token_utils and updated TokenizerParameters to include tokenizer_type and access_token as additional metadata to store. (#229)

Commit:55b6604
Author:George
Committer:George

Added new HuggingFaceTokenizer to token_utils and updaetd TokenizerParameters to include tokenizer_type and access_token as additional metadata to store.

Commit:5f679a9
Author:Aman Gupta

- Created separate adapter_tensorstore for each engine. - Implemented unapply lora from base_params - Fixed some comments from the PR

Commit:26b1f37
Author:Aman Gupta

Merging main to amangu-lora.

Commit:eb74d86
Author:Aman Gupta

Refactoring part-2.

Commit:e4d875a
Author:Aman Gupta

Refactoring and cleaning of the JetStream server code.

Commit:1d6b456
Author:Yijia
Committer:GitHub

Revert accidental change - back to #216 This reverts commit 00dc5a61fcad66846cb45e449d5078c6424c970e, reversing changes made to 951b3ef8d329e419289af1df8cdada6983073605. Co-authored-by: Yijia J <yijiaj@google.com>

Commit:80bfefc
Author:Lumosis
Committer:GitHub

Add multi-sampling functionality (#215)

Commit:0d464da
Author:Lihao Ran
Committer:Lihao Ran

test multi-sampling

Commit:3c6fcbd
Author:aman2930

1) Implemented a new Service API proto to align with OpenAI completion API (https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/proposals/003-model-server-protocol/README.md#inference-api-protocol), & . 2) Added a flag to explicitly run the JetStream server with these APIs when . Else only expose older Decode() & HealthCheck() APIs of the JetStream Server. 3) Fixed a bug in the adapter_tensorstore while converting jnp_array and np_array. 4) Added a which made requests to the new APIs (v1/load_lora_adapter, v1/unload_lora_adapter, v1/models, v1/completions)

Commit:fb88eca
Author:aman2930

1) Implemented adapter_tensorstore module to store and manage the adapters. Its functionality includes loading, unloading of adapters between CPU RAM and HBM. It also follows LRU policy to evict the adapter if a new load_adapter request comes up. Currently it is only storing the adapter as separate tensors (lora_a and lora_b). Calculation of lora_b x lora_a is being done in prefill() and generate() during decode request. Adapter_tensorstore can be configured with a max_limit on HBM and RAM. 2) Functionality to load from a catalog file at the start of the server is added. If no file is given, it will just load the base params. Loading from the catalog file is done on CPU RAM. After that based on incoming requests, those params are moved/evicted to/from HBM. 3) Some proto updates to get only single path for each adapter, and that path is expected to have an adapter_config.json and Orbax format weights in 0/items folder.

Commit:ef073f8
Author:jetstream authors
Committer:Vipan Nalla

fixing decode. PiperOrigin-RevId: 720397779

Commit:cfb987b
Author:wyzhang
Committer:Vipan Nalla

Revert past 2 commits which accidentally deletes the code due to copybara issue (#167) * Revert "Reverts 6a3579056f307fed3428102df5823a7ff7cebdc6" This reverts commit b459cc1f297a8564e9c6f14346ad5ef41e2d68c6. * Revert "fixing decode." This reverts commit 6a3579056f307fed3428102df5823a7ff7cebdc6.

Commit:91ab2e1
Author:jetstream authors
Committer:Vipan Nalla

internal change PiperOrigin-RevId: 720730187

Commit:405a3d5
Author:Yijia
Committer:Vipan Nalla

Revert "internal change" (#169) This reverts commit 4c7838ac69db15a17f540406787c6b4dbc692b03.

Commit:a49c0a4
Author:Yijia
Committer:GitHub

Revert "internal change" (#169) This reverts commit 4c7838ac69db15a17f540406787c6b4dbc692b03.

Commit:4c7838a
Author:jetstream authors
Committer:jetstream authors

internal change PiperOrigin-RevId: 720730187

Commit:e8439b7
Author:wyzhang
Committer:GitHub

Revert past 2 commits which accidentally deletes the code due to copybara issue (#167) * Revert "Reverts 6a3579056f307fed3428102df5823a7ff7cebdc6" This reverts commit b459cc1f297a8564e9c6f14346ad5ef41e2d68c6. * Revert "fixing decode." This reverts commit 6a3579056f307fed3428102df5823a7ff7cebdc6.

Commit:6a35790
Author:jetstream authors
Committer:jetstream authors

fixing decode. PiperOrigin-RevId: 720397779

Commit:7426ea7
Author:aman2930

1) Added MultiAdapterManager service proto along with the methods ListAdapters, LoadAdapter and UnloadAdapter. 2) Driver which is holding list of all loaded base-parameters is now storing the list of lora updated paramters for loaded lora. Implemented methods for loading, unloading and listing LoRA adapters into the Driver object. Original base model params are intact and saved into the params dictionary with key . 3) Created a proxy-client to make MultiAdapterManager service requests to JetStream server.

Commit:973647d
Author:Yijia
Committer:GitHub

Revert "Internal refactor" (#156) This reverts commit 8e18e7fd1db4ee271fa677eec86b0b90a3822c95. Co-authored-by: Yijia J <yijiaj@google.com>

Commit:8e18e7f
Author:jetstream authors
Committer:Yijia J

Internal refactor PiperOrigin-RevId: 706772024

Commit:d681995
Author:Brendan Slabe
Committer:GitHub

Various request time metrics (#121) * first commit * nit * fmt * description tweak * added more metrics * nit * nit * default metadata values * move `new_request.metadata.transfer_start_time = time.perf_counter()` * avoid NoneType * NoneType * set transfer_end_time and fmt * camel case -> snake case * description update * change descriptions * fmt * logs * better logs * changed timings * observing queue duration metric * buckets in sorted order * buckets not in sorted order * corrected times * number of output tokens * move prefill_start_time, enable debug, maybe correct len for num tokens in detokenize * fmt * correct lengths of output tokens based on debug * debug transfer queue time * remove log * removed logs, almost final * nits * readd log * change logs * reomve log * condence * improve test coverage * revert _abort_or_raise deletion * start_time mandatory * undo * nit * updated buckets * added 'jetstream_time_per_request' * nit * add 'jetstream_wait_time_per_request' * nit * missing .metadata * lint * change order of params * changed metric description * Add metadata field to proto * update proto * tweak generated file * tweak generated file * update proto * pylint * generate protos * change start time assignment * .value * CopyFrom * change definition of queue duration metric * Increase test coverage * fixed assertions * fmt * incorrect prefill time * Add license statements * Protobuf Python Version * fmt * pylint

Commit:3946afa
Author:Brendan Slabe
Committer:GitHub

Makefile (#125) * first commit * changed unit_tests.yaml * generate-protos * better generate-protos logic * append -> prepend * more make targets

Commit:46c152f
Author:Zijun Zhou
Committer:GitHub

Cleanup orchestrator proto (#112) * Cleanup orchestrator proto * Update JetStream based on proto cleanup

Commit:0c56aac
Author:vivianrwu
Committer:GitHub

Add healthcheck support for JetStream (#90) * Add healthcheck support for JetStream * fix indentation * fix pylint unit test * use pyink to reformat generated protos

Commit:01c5a03
Author:Zijun Zhou
Committer:GitHub

Update JetStream grpc proto to support I/O with text and token ids (#78) * Update JetStream grpc proto to support I/O with text and token ids * Update orchestrator and token utils to support text and token I/O * Add and update unit tests * Fix prometheus duplicate metrics issue * add shortuuid dep * Update docstring * Add client tokenization mode * Update client side I/O handling * latest pylint fix

Commit:0dbb2a5
Author:Zijun Zhou
Committer:Junwei Yang

Align Tokenizer in JetStream (#40) * Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

Commit:a0df320
Author:Zijun Zhou
Committer:GitHub

Align Tokenizer in JetStream (#40) * Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

Commit:90b2a9d
Author:Zijun Zhou
Committer:GitHub

Support JetStream MaxText user guide (#28)

Commit:6f55565
Author:Zijun Zhou
Committer:Zijun Zhou

JetStream init version Co-authored-by: Sholto Douglas <sholto@google.com> Co-authored-by: Zijun Zhou <zijunzhou@google.com>