Proto commits in intelligent-machine-learning/dlrover

These 65 commits are when the Protocol Buffers files have changed:

2024-12-17

Commit:	7601f76
Author:	cos120	2024-12-17 19:45:36 +0800
Committer:	GitHub	2024-12-17 19:45:36 +0800

feat(xpu-timer): init commit xpu-timer (#1391) * feat(xpu-timer): init commit xpu-timer * fix lint * Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e. * fix doc, ignore lint in xpu-timer * add copyright --------- Co-authored-by: lizhi <zhangji.zhang@antgroup.com> Co-authored-by: Qinlong Wang <WangQL1201@outlook.com>

The documentation is generated from this commit.

Commit:	c8afcf1
Author:	lizhi	2024-12-17 13:09:57 +0800
Committer:	lizhi	2024-12-17 13:34:31 +0800

add copyright

The documentation is generated from this commit.

Commit:	1340e1a
Author:	lizhi	2024-12-16 13:54:23 +0800
Committer:	lizhi	2024-12-17 13:34:31 +0800

Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e.

Commit:	2d8ef34
Author:	lizhi	2024-12-13 17:09:20 +0800
Committer:	lizhi	2024-12-17 13:34:31 +0800

fix lint

Commit:	6b60fd3
Author:	lizhi	2024-12-13 14:45:42 +0800
Committer:	lizhi	2024-12-17 13:34:30 +0800

feat(xpu-timer): init commit xpu-timer

2024-11-28

Commit:	ffbeae0
Author:	mingcheng	2024-11-27 11:01:38 +0800
Committer:	mingcheng	2024-11-28 10:45:12 +0800

migrated tfplus and atorch into independent repository

2024-05-13

Commit:	79d1ffc
Author:	Qinlong Wang	2024-05-13 13:20:25 +0800
Committer:	Qinlong Wang	2024-05-13 13:21:20 +0800

Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.

Commit:	87e5872
Author:	Qinlong Wang	2024-05-13 13:20:25 +0800
Committer:	GitHub	2024-05-13 13:20:25 +0800

Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.

2023-10-11

Commit:	c283841
Author:	Wlong692	2023-10-11 17:28:42 +0800
Committer:	GitHub	2023-10-11 17:28:42 +0800

feat(TFPlus): TFPlus 0.1.0 (#740) * tfplus opensource 0.1.0 init * fix github action * fix: resolved feedback from PR 740 review * fix: check precommit ci problem * fix: precommit ci git problem * fix resolved feedback of dev/script from PR 740 review * Update test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * update setup.py and Readme --------- Co-authored-by: Wlong962 <ycxovo_op@outlook.com>

2023-10-10

Commit:	ad8d898
Author:	ZhongYingMatrix	2023-09-27 17:11:32 +0800
Committer:	ZhongYingMatrix	2023-10-10 09:43:12 +0800

atorch update changes

2023-09-03

Commit:	e92fc20
Author:	Qinlong Wang	2023-09-03 09:40:31 +0800
Committer:	GitHub	2023-09-03 09:40:31 +0800

Refactor: Simplify the proto message definition. (#651) * Feat: Implement the message of distributed strategy. * Refactor: rename model_metric to model_info * Refactor: refator the proto message definition. * Move the position of DataLoderConfig * Retry to send grpc message. * Add test cases. * Fix test cases.

2023-08-18

Commit:	9815acc
Author:	Tingfeng Lan	2023-08-18 16:19:42 +0800
Committer:	GitHub	2023-08-18 16:19:42 +0800

2/3 feat(elastic_agent) add GPU statistic report to resource monitor. (#611) * Add proto prototype for grpc information. * Add information reporting for gpu stats in the master_client section. * Add unit test for gpu resource reporting to agent_monitor. * Rewrite gpu_list and unit tests using dataclass. * Formatting code. * Adjust log levels. * mock init_gpu_monitor to avoid pynvml error. * mock pynvml.nvmlinit to avoid pynvml error. * Remove redundant printouts. * Use the gpu metric class to update the gpu report. * Update test_report_used_resource with gpu_stats. * Formatting code.

2023-08-02

Commit:	1873d95
Author:	Qinlong Wang	2023-08-02 18:33:59 +0800
Committer:	GitHub	2023-08-02 18:33:59 +0800

Classify the training error message from the agent. (#570) * Report the message to the master when the nodes scale down. * Classify the training error message from the agent. * Fix test cases

2023-07-24

Commit:	2d86add
Author:	Qinlong Wang	2023-07-24 10:16:35 +0800
Committer:	GitHub	2023-07-24 10:16:35 +0800

Remove the round to test allgather with all nodes (#512) * Remove the round to test allgather with all nodes * Fix test cases * Replace a magic number with a variable * Fix docstring by comments * Format codes

2023-07-12

Commit:	2b3e886
Author:	Qinlong Wang	2023-07-12 11:12:11 +0800
Committer:	GitHub	2023-07-12 11:12:11 +0800

Scale down nodes with the number unit if not enough nodes. (#490) * Scale down nodes with an number unit if there are not enough nodes * Rename worker_num_unit to node_unit * Fix test cases * The manager notifies workers to restart processes only when the number of nodes is a multiple of node unit * Build rendzvous only when the number of new nodes is bigger than node unit

2023-07-03

Commit:	cb10597
Author:	Qinlong Wang	2023-07-03 17:15:03 +0800
Committer:	GitHub	2023-07-03 17:15:03 +0800

A monitor to log the error of training process. (#475) * A monitor to log the error of training process * Format codes

2023-07-01

Commit:	6435865
Author:	Qinlong Wang	2023-07-01 10:35:00 +0800
Committer:	GitHub	2023-07-01 10:35:00 +0800

Join rendezvous with the rank-id in the env. (#468) * Use rank-id not node-id to build rdzv * Use rank id to join rendezvous * Format codes * Fix test cases

2023-06-30

Commit:	b86bf50
Author:	Qinlong Wang	2023-06-30 15:12:28 +0800
Committer:	GitHub	2023-06-30 15:12:28 +0800

Agent to check network before starting training processes. (#466) * Agent to check network before starting training processes * Fix test cases * install torch 2.0.1 * Update the imaget for test * Refactor the grpc method * Use py38 as base to build ci image * Fix test cases * Format codes

Commit:	d481dd3
Author:	Qinlong Wang	2023-06-30 11:24:56 +0800
Committer:	GitHub	2023-06-30 11:24:56 +0800

Rendezvous manager to help the node to check network. (#465) * Implement rdzv manager for network check * Add docstring * Remove unused codes * Remove annotation codes * Merge test cases * Merge codes * Format codes

2023-06-28

Commit:	704cb09
Author:	Qinlong Wang	2023-06-28 13:15:56 +0800
Committer:	GitHub	2023-06-28 13:15:56 +0800

Rendezvous manager to assign the rank to nodes. (#458) * Rendezvous manager to assign the rank to nodes * Format codes * Add test cases * Format codes * Format codes * Remove debug codes * Set timeout to 120s * Get rendezvous round index from master * lock to resturn rdzv_nodes * Fix the bug to remove exited nodes * Add docstring * Use dlrover-run in the elasticjob of torch * Add docstring to next_rendezvous * Format codes

Commit:	3cb888f
Author:	Qinlong Wang	2023-06-28 10:27:24 +0800
Committer:	GitHub	2023-06-28 10:27:24 +0800

The agent reports process error message to DLRover master. (#462) * Report node failures to master * Agent reports node failures to master when the process fails * Format codes * Return response * Lock to log error data

2023-06-27

Commit:	697d935
Author:	Zhang Haitao	2023-06-27 14:20:04 +0800
Committer:	GitHub	2023-06-27 14:20:04 +0800

Add acc engine (#457) * add acc engine * format and test * fix * fix tests * exclude servicer.py to avoid mypy error * fix ut

2023-06-14

Commit:	54b7ce7
Author:	Qinlong Wang	2023-06-14 17:11:46 +0800
Committer:	GitHub	2023-06-14 17:11:46 +0800

Master log the rank of Pod because the rank will change for FT. (#445) * Worker report its rank to master for log * Cast rank to int * Format codes * Fix test cases * Implement report node status in LocalMasterClient * Fix test cases

2023-04-11

Commit:	ef6f40f
Author:	Qinlong Wang	2023-04-11 16:08:48 +0800
Committer:	GitHub	2023-04-11 16:08:48 +0800

Support the scale down PS if some PS nodes fails (#388)

2023-04-07

Commit:	9fe1fad
Author:	Qinlong Wang	2023-04-07 14:18:01 +0800
Committer:	GitHub	2023-04-07 14:18:01 +0800

Checkpoint dataset shards and restore those shards when restarting workers. (#382) * Checkpoint model and dataset for elasticity and fault-tolerance * Set the default value of checkpoint path to an empty string * Reformat codes * Fix test cases * Fix test cases

2023-04-05

Commit:	a0f401c
Author:	Qinlong Wang	2023-04-05 09:32:51 +0800
Committer:	GitHub	2023-04-05 09:32:51 +0800

Adjust the number of worker base 2. (#377) * Adjust the number of worker base 2 * Fix rendezvous service to support torch elastic * Fix to update rendezvous states * Implement a resource checker to get available number of workers * Format codes * Fix the image of mnist * Fix test cases * Sleep 30s to start auto-scaling * Clear KV store if the worker group changes * Remove unused imports

2023-03-25

Commit:	5f0a49c
Author:	Qinlong Wang	2023-03-25 09:58:10 +0800

Rank 0 worker reports its IP as the master addr

2023-03-24

Commit:	0cba49e
Author:	Qinlong Wang	2023-03-24 14:21:28 +0800

Fix to get key-value from master store

2023-03-22

Commit:	f8f546e
Author:	Qinlong Wang	2023-03-22 18:31:48 +0800

Format codes

2023-03-13

Commit:	0ae5aba
Author:	Qinlong Wang	2023-03-13 20:39:13 +0800

Implement rendezvous service for torch elastic

2023-03-06

Commit:	7edadf0
Author:	hanxudong	2023-03-06 16:35:58 +0800

fix

Commit:	f5eb65e
Author:	hanxudong	2023-03-06 14:14:02 +0800

reformat

2023-03-02

Commit:	23b61c2
Author:	hxdtest	2023-03-02 09:56:33 +0800
Committer:	GitHub	2023-03-02 09:56:33 +0800

Merge pull request #260 from workingloong/sync-service Implement a sychronized service

2023-02-28

Commit:	914e495
Author:	Qinlong Wang	2023-02-28 15:19:55 +0800

Implement a sychronized service

2023-02-27

Commit:	1f6f005
Author:	cailun.cl	2023-02-27 17:14:40 +0800

add priority field to PodResource

2023-02-09

Commit:	3a2d901
Author:	Qinlong Wang	2023-02-09 17:05:22 +0800
Committer:	GitHub	2023-02-09 17:05:22 +0800

Merge pull request #197 from intelligent-machine-learning/add_ray_platform Adaption for running on ray.

Commit:	4362676
Author:	hanxudong	2023-02-09 11:10:26 +0800

reformat

2023-02-08

Commit:	ce33560
Author:	jian.sha	2023-02-09 01:13:03 +0800

update legarcy comments.

Commit:	7d5f62f
Author:	hanxudong	2023-02-08 16:13:12 +0800

run a master and actor experiment

2023-02-07

Commit:	9ff78e7
Author:	hanxudong	2023-02-07 16:34:56 +0800

add info

Commit:	8a0f098
Author:	bsang	2023-02-07 16:20:43 +0800

remove vendor

Commit:	0a09490
Author:	hanxudong	2023-02-07 14:19:33 +0800
Committer:	hanxudong	2023-02-07 14:38:50 +0800

add test case

Commit:	8dc90b9
Author:	hanxudong	2023-02-07 12:30:21 +0800
Committer:	hanxudong	2023-02-07 14:37:49 +0800

add event

Commit:	7526233
Author:	hanxudong	2023-02-07 11:43:01 +0800
Committer:	hanxudong	2023-02-07 14:37:49 +0800

add node event and type

2023-01-12

Commit:	b8f931f
Author:	bsang	2023-01-12 11:29:59 +0800

update

2022-12-22

Commit:	accff1a
Author:	b.sang	2022-12-21 22:16:45 -0800

add brain dockerfile

Commit:	7e55a7e
Author:	b.sang	2022-12-21 22:15:54 -0800

add brain dockerfile

2022-12-15

Commit:	39f1d19
Author:	Qinlong Wang	2022-12-15 18:38:27 +0800

Rename proto package

2022-12-12

Commit:	d449655
Author:	Qinlong Wang	2022-12-12 11:11:22 +0800

Merge branch 'master' into fix-dispatch-task

Commit:	f389160
Author:	Qinlong Wang	2022-12-12 10:49:34 +0800

Get tasks with the type of a node

2022-12-10

Commit:	2356b95
Author:	Qinlong Wang	2022-12-10 20:25:16 +0800

Fix dynamic sharding

2022-12-02

Commit:	6ccabd5
Author:	Qinlong Wang	2022-12-02 17:30:02 +0800

Fix trainer-test

Commit:	6869e8e
Author:	b.sang	2022-12-01 16:48:27 -0800

k8s watcher impl

Commit:	c526d17
Author:	b.sang	2022-12-01 16:47:52 -0800

k8s watcher impl

2022-11-23

Commit:	97d4ce4
Author:	Qinlong Wang	2022-11-23 19:16:29 +0800

Format codes

2022-11-18

Commit:	f9f94ae
Author:	b.sang	2022-11-17 19:25:03 -0800

add ut

Commit:	1469e63
Author:	b.sang	2022-11-17 19:23:59 -0800

add ut

2022-11-17

Commit:	4cf329b
Author:	b.sang	2022-11-17 15:00:47 -0800

add brain server

Commit:	dd78acf
Author:	b.sang	2022-11-17 10:58:00 -0800

add config manager

Commit:	3438617
Author:	b.sang	2022-11-17 10:21:30 -0800

update proto

Commit:	fc8112a
Author:	b.sang	2022-11-16 18:17:56 -0800

add base datastore

Commit:	f6a14c0
Author:	b.sang	2022-11-16 18:17:12 -0800

add base datastore

2022-11-16

Commit:	e7aa32b
Author:	b.sang	2022-11-15 17:01:19 -0800

brain proto

2022-11-15

Commit:	43d1163
Author:	Qinlong Wang	2022-11-15 20:20:07 +0800

Format proto

Commit:	d7a01d6
Author:	Qinlong Wang	2022-11-15 19:58:03 +0800

Implement TaskManager to manage tasks of dataset