Proto commits in AI-Hypercomputer/JetStream

These 34 commits are when the Protocol Buffers files have changed:

2025-05-06

Commit:	508a947
Author:	Lihao Ran	2025-05-05 22:51:12 +0000
Committer:	Lihao Ran	2025-05-06 20:20:56 +0000

Allow users to specify whether or not to add bos token

The documentation is generated from this commit.

2025-05-05

Commit:	572ff6d
Author:	Lihao Ran	2025-05-05 22:51:12 +0000
Committer:	Lihao Ran	2025-05-05 23:26:37 +0000

Allow users to specify whether or not to add bos token

The documentation is generated from this commit.

2025-04-14

Commit:	082c0ac
Author:	Aman Gupta	2025-04-14 11:58:47 -0700
Committer:	GitHub	2025-04-14 11:58:47 -0700

Supporting Multi-LoRA inferencing via JetStream server (#221) Supporting Multi-LoRA inferencing via JetStream server following [LLM Inference gateway API protocols](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented an adapter_tensorstore to load, store, manage and unload the adapter weights - Added and exposed [required metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#metrics-reporting) at prometheus endpoint - Added multi_lora_decoding service with corresponding APIs as per the [requirement](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol#inference-api-protocol). - Implemented single LoRA functionality support.

2025-03-24

Commit:	9d19631
Author:	gpolovets1	2025-03-24 16:27:22 -0700
Committer:	GitHub	2025-03-24 16:27:22 -0700

Added new HuggingFaceTokenizer to token_utils and updated TokenizerParameters to include tokenizer_type and access_token as additional metadata to store. (#229)

2025-03-15

Commit:	55b6604
Author:	George	2025-03-15 02:21:45 +0000
Committer:	George	2025-03-15 02:23:11 +0000

Added new HuggingFaceTokenizer to token_utils and updaetd TokenizerParameters to include tokenizer_type and access_token as additional metadata to store.

2025-03-13

Commit:	5f679a9
Author:	Aman Gupta	2025-03-13 18:53:20 +0000

- Created separate adapter_tensorstore for each engine. - Implemented unapply lora from base_params - Fixed some comments from the PR

2025-03-06

Commit:	26b1f37
Author:	Aman Gupta	2025-03-06 19:50:40 +0000

Merging main to amangu-lora.

Commit:	eb74d86
Author:	Aman Gupta	2025-03-06 06:00:51 +0000

Refactoring part-2.

Commit:	e4d875a
Author:	Aman Gupta	2025-03-06 04:55:05 +0000

Refactoring and cleaning of the JetStream server code.

2025-03-05

Commit:	1d6b456
Author:	Yijia	2025-03-04 17:53:25 -0800
Committer:	GitHub	2025-03-04 17:53:25 -0800

Revert accidental change - back to #216 This reverts commit 00dc5a61fcad66846cb45e449d5078c6424c970e, reversing changes made to 951b3ef8d329e419289af1df8cdada6983073605. Co-authored-by: Yijia J <yijiaj@google.com>

2025-03-03

Commit:	80bfefc
Author:	Lumosis	2025-03-03 13:59:40 -0800
Committer:	GitHub	2025-03-03 13:59:40 -0800

Add multi-sampling functionality (#215)

2025-02-27

Commit:	0d464da
Author:	Lihao Ran	2025-02-14 21:37:49 +0000
Committer:	Lihao Ran	2025-02-27 06:49:57 +0000

test multi-sampling

2025-02-24

Commit:	3c6fcbd
Author:	aman2930	2025-02-24 20:15:27 +0000

1) Implemented a new Service API proto to align with OpenAI completion API (https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/proposals/003-model-server-protocol/README.md#inference-api-protocol), & . 2) Added a flag to explicitly run the JetStream server with these APIs when . Else only expose older Decode() & HealthCheck() APIs of the JetStream Server. 3) Fixed a bug in the adapter_tensorstore while converting jnp_array and np_array. 4) Added a which made requests to the new APIs (v1/load_lora_adapter, v1/unload_lora_adapter, v1/models, v1/completions)

2025-02-18

Commit:	fb88eca
Author:	aman2930	2025-02-18 18:15:22 +0000

1) Implemented adapter_tensorstore module to store and manage the adapters. Its functionality includes loading, unloading of adapters between CPU RAM and HBM. It also follows LRU policy to evict the adapter if a new load_adapter request comes up. Currently it is only storing the adapter as separate tensors (lora_a and lora_b). Calculation of lora_b x lora_a is being done in prefill() and generate() during decode request. Adapter_tensorstore can be configured with a max_limit on HBM and RAM. 2) Functionality to load from a catalog file at the start of the server is added. If no file is given, it will just load the base params. Loading from the catalog file is done on CPU RAM. After that based on incoming requests, those params are moved/evicted to/from HBM. 3) Some proto updates to get only single path for each adapter, and that path is expected to have an adapter_config.json and Orbax format weights in 0/items folder.

2025-02-17

Commit:	ef073f8
Author:	jetstream authors	2025-01-27 19:42:33 -0800
Committer:	Vipan Nalla	2025-02-17 17:15:22 +0000

fixing decode. PiperOrigin-RevId: 720397779

Commit:	cfb987b
Author:	wyzhang	2025-01-27 21:26:42 -0800
Committer:	Vipan Nalla	2025-02-17 17:15:22 +0000

Revert past 2 commits which accidentally deletes the code due to copybara issue (#167) * Revert "Reverts 6a3579056f307fed3428102df5823a7ff7cebdc6" This reverts commit b459cc1f297a8564e9c6f14346ad5ef41e2d68c6. * Revert "fixing decode." This reverts commit 6a3579056f307fed3428102df5823a7ff7cebdc6.

Commit:	91ab2e1
Author:	jetstream authors	2025-01-28 15:32:55 -0800
Committer:	Vipan Nalla	2025-02-17 17:15:22 +0000

internal change PiperOrigin-RevId: 720730187

Commit:	405a3d5
Author:	Yijia	2025-01-28 16:46:40 -0800
Committer:	Vipan Nalla	2025-02-17 17:15:22 +0000

Revert "internal change" (#169) This reverts commit 4c7838ac69db15a17f540406787c6b4dbc692b03.

2025-01-29

Commit:	a49c0a4
Author:	Yijia	2025-01-28 16:46:40 -0800
Committer:	GitHub	2025-01-28 16:46:40 -0800

Revert "internal change" (#169) This reverts commit 4c7838ac69db15a17f540406787c6b4dbc692b03.

2025-01-28

Commit:	4c7838a
Author:	jetstream authors	2025-01-28 15:32:55 -0800
Committer:	jetstream authors	2025-01-28 15:33:31 -0800

internal change PiperOrigin-RevId: 720730187

Commit:	e8439b7
Author:	wyzhang	2025-01-27 21:26:42 -0800
Committer:	GitHub	2025-01-27 21:26:42 -0800

Commit:	6a35790
Author:	jetstream authors	2025-01-27 19:42:33 -0800
Committer:	jetstream authors	2025-01-27 19:43:07 -0800

fixing decode. PiperOrigin-RevId: 720397779

2025-01-27

Commit:	7426ea7
Author:	aman2930	2025-01-27 18:20:02 +0000

1) Added MultiAdapterManager service proto along with the methods ListAdapters, LoadAdapter and UnloadAdapter. 2) Driver which is holding list of all loaded base-parameters is now storing the list of lora updated paramters for loaded lora. Implemented methods for loading, unloading and listing LoRA adapters into the Driver object. Original base model params are intact and saved into the params dictionary with key . 3) Created a proxy-client to make MultiAdapterManager service requests to JetStream server.

2024-12-16

Commit:	973647d
Author:	Yijia	2024-12-16 13:42:52 -0800
Committer:	GitHub	2024-12-16 13:42:52 -0800

Revert "Internal refactor" (#156) This reverts commit 8e18e7fd1db4ee271fa677eec86b0b90a3822c95. Co-authored-by: Yijia J <yijiaj@google.com>

Commit:	8e18e7f
Author:	jetstream authors	2024-12-16 19:16:10 +0000
Committer:	Yijia J	2024-12-16 19:48:24 +0000

Internal refactor PiperOrigin-RevId: 706772024

2024-08-07

Commit:	d681995
Author:	Brendan Slabe	2024-08-07 22:50:03 +0200
Committer:	GitHub	2024-08-07 13:50:03 -0700

Various request time metrics (#121) * first commit * nit * fmt * description tweak * added more metrics * nit * nit * default metadata values * move `new_request.metadata.transfer_start_time = time.perf_counter()` * avoid NoneType * NoneType * set transfer_end_time and fmt * camel case -> snake case * description update * change descriptions * fmt * logs * better logs * changed timings * observing queue duration metric * buckets in sorted order * buckets not in sorted order * corrected times * number of output tokens * move prefill_start_time, enable debug, maybe correct len for num tokens in detokenize * fmt * correct lengths of output tokens based on debug * debug transfer queue time * remove log * removed logs, almost final * nits * readd log * change logs * reomve log * condence * improve test coverage * revert _abort_or_raise deletion * start_time mandatory * undo * nit * updated buckets * added 'jetstream_time_per_request' * nit * add 'jetstream_wait_time_per_request' * nit * missing .metadata * lint * change order of params * changed metric description * Add metadata field to proto * update proto * tweak generated file * tweak generated file * update proto * pylint * generate protos * change start time assignment * .value * CopyFrom * change definition of queue duration metric * Increase test coverage * fixed assertions * fmt * incorrect prefill time * Add license statements * Protobuf Python Version * fmt * pylint

Commit:	3946afa
Author:	Brendan Slabe	2024-08-07 21:30:12 +0200
Committer:	GitHub	2024-08-07 12:30:12 -0700

Makefile (#125) * first commit * changed unit_tests.yaml * generate-protos * better generate-protos logic * append -> prepend * more make targets

2024-07-16

Commit:	46c152f
Author:	Zijun Zhou	2024-07-16 15:08:52 -0700
Committer:	GitHub	2024-07-16 15:08:52 -0700

Cleanup orchestrator proto (#112) * Cleanup orchestrator proto * Update JetStream based on proto cleanup

2024-05-29

Commit:	0c56aac
Author:	vivianrwu	2024-05-28 18:20:06 -0700
Committer:	GitHub	2024-05-28 18:20:06 -0700

Add healthcheck support for JetStream (#90) * Add healthcheck support for JetStream * fix indentation * fix pylint unit test * use pyink to reformat generated protos

2024-05-14

Commit:	01c5a03
Author:	Zijun Zhou	2024-05-14 16:10:35 -0700
Committer:	GitHub	2024-05-14 16:10:35 -0700

Update JetStream grpc proto to support I/O with text and token ids (#78) * Update JetStream grpc proto to support I/O with text and token ids * Update orchestrator and token utils to support text and token I/O * Add and update unit tests * Fix prometheus duplicate metrics issue * add shortuuid dep * Update docstring * Add client tokenization mode * Update client side I/O handling * latest pylint fix

2024-05-06

Commit:	0dbb2a5
Author:	Zijun Zhou	2024-04-24 15:25:48 -0700
Committer:	Junwei Yang	2024-05-06 23:10:40 +0000

Align Tokenizer in JetStream (#40) * Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

2024-04-24

Commit:	a0df320
Author:	Zijun Zhou	2024-04-24 15:25:48 -0700
Committer:	GitHub	2024-04-24 15:25:48 -0700

Align Tokenizer in JetStream (#40) * Align Tokenizer in JetStream * Update requirements with pytest dep * Remove mix_decode unit test

2024-04-03

Commit:	90b2a9d
Author:	Zijun Zhou	2024-04-03 16:29:10 -0700
Committer:	GitHub	2024-04-03 16:29:10 -0700

Support JetStream MaxText user guide (#28)

2024-03-01

Commit:	6f55565
Author:	Zijun Zhou	2024-03-01 01:34:38 +0000
Committer:	Zijun Zhou	2024-03-01 01:48:16 +0000

JetStream init version Co-authored-by: Sholto Douglas <sholto@google.com> Co-authored-by: Zijun Zhou <zijunzhou@google.com>