Proto commits in intelligent-machine-learning/dlrover

These 65 commits are when the Protocol Buffers files have changed:

Commit:7601f76
Author:cos120
Committer:GitHub

feat(xpu-timer): init commit xpu-timer (#1391) * feat(xpu-timer): init commit xpu-timer * fix lint * Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e. * fix doc, ignore lint in xpu-timer * add copyright --------- Co-authored-by: lizhi <zhangji.zhang@antgroup.com> Co-authored-by: Qinlong Wang <WangQL1201@outlook.com>

The documentation is generated from this commit.

Commit:c8afcf1
Author:lizhi
Committer:lizhi

add copyright

The documentation is generated from this commit.

Commit:1340e1a
Author:lizhi
Committer:lizhi

Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e.

Commit:2d8ef34
Author:lizhi
Committer:lizhi

fix lint

Commit:6b60fd3
Author:lizhi
Committer:lizhi

feat(xpu-timer): init commit xpu-timer

Commit:ffbeae0
Author:mingcheng
Committer:mingcheng

migrated tfplus and atorch into independent repository

Commit:79d1ffc
Author:Qinlong Wang
Committer:Qinlong Wang

Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.

Commit:87e5872
Author:Qinlong Wang
Committer:GitHub

Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.

Commit:c283841
Author:Wlong692
Committer:GitHub

feat(TFPlus): TFPlus 0.1.0 (#740) * tfplus opensource 0.1.0 init * fix github action * fix: resolved feedback from PR 740 review * fix: check precommit ci problem * fix: precommit ci git problem * fix resolved feedback of dev/script from PR 740 review * Update test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * update setup.py and Readme --------- Co-authored-by: Wlong962 <ycxovo_op@outlook.com>

Commit:ad8d898
Author:ZhongYingMatrix
Committer:ZhongYingMatrix

atorch update changes

Commit:e92fc20
Author:Qinlong Wang
Committer:GitHub

Refactor: Simplify the proto message definition. (#651) * Feat: Implement the message of distributed strategy. * Refactor: rename model_metric to model_info * Refactor: refator the proto message definition. * Move the position of DataLoderConfig * Retry to send grpc message. * Add test cases. * Fix test cases.

Commit:9815acc
Author:Tingfeng Lan
Committer:GitHub

2/3 feat(elastic_agent) add GPU statistic report to resource monitor. (#611) * Add proto prototype for grpc information. * Add information reporting for gpu stats in the master_client section. * Add unit test for gpu resource reporting to agent_monitor. * Rewrite gpu_list and unit tests using dataclass. * Formatting code. * Adjust log levels. * mock init_gpu_monitor to avoid pynvml error. * mock pynvml.nvmlinit to avoid pynvml error. * Remove redundant printouts. * Use the gpu metric class to update the gpu report. * Update test_report_used_resource with gpu_stats. * Formatting code.

Commit:1873d95
Author:Qinlong Wang
Committer:GitHub

Classify the training error message from the agent. (#570) * Report the message to the master when the nodes scale down. * Classify the training error message from the agent. * Fix test cases

Commit:2d86add
Author:Qinlong Wang
Committer:GitHub

Remove the round to test allgather with all nodes (#512) * Remove the round to test allgather with all nodes * Fix test cases * Replace a magic number with a variable * Fix docstring by comments * Format codes

Commit:2b3e886
Author:Qinlong Wang
Committer:GitHub

Scale down nodes with the number unit if not enough nodes. (#490) * Scale down nodes with an number unit if there are not enough nodes * Rename worker_num_unit to node_unit * Fix test cases * The manager notifies workers to restart processes only when the number of nodes is a multiple of node unit * Build rendzvous only when the number of new nodes is bigger than node unit

Commit:cb10597
Author:Qinlong Wang
Committer:GitHub

A monitor to log the error of training process. (#475) * A monitor to log the error of training process * Format codes

Commit:6435865
Author:Qinlong Wang
Committer:GitHub

Join rendezvous with the rank-id in the env. (#468) * Use rank-id not node-id to build rdzv * Use rank id to join rendezvous * Format codes * Fix test cases

Commit:b86bf50
Author:Qinlong Wang
Committer:GitHub

Agent to check network before starting training processes. (#466) * Agent to check network before starting training processes * Fix test cases * install torch 2.0.1 * Update the imaget for test * Refactor the grpc method * Use py38 as base to build ci image * Fix test cases * Format codes

Commit:d481dd3
Author:Qinlong Wang
Committer:GitHub

Rendezvous manager to help the node to check network. (#465) * Implement rdzv manager for network check * Add docstring * Remove unused codes * Remove annotation codes * Merge test cases * Merge codes * Format codes

Commit:704cb09
Author:Qinlong Wang
Committer:GitHub

Rendezvous manager to assign the rank to nodes. (#458) * Rendezvous manager to assign the rank to nodes * Format codes * Add test cases * Format codes * Format codes * Remove debug codes * Set timeout to 120s * Get rendezvous round index from master * lock to resturn rdzv_nodes * Fix the bug to remove exited nodes * Add docstring * Use dlrover-run in the elasticjob of torch * Add docstring to next_rendezvous * Format codes

Commit:3cb888f
Author:Qinlong Wang
Committer:GitHub

The agent reports process error message to DLRover master. (#462) * Report node failures to master * Agent reports node failures to master when the process fails * Format codes * Return response * Lock to log error data

Commit:697d935
Author:Zhang Haitao
Committer:GitHub

Add acc engine (#457) * add acc engine * format and test * fix * fix tests * exclude servicer.py to avoid mypy error * fix ut

Commit:54b7ce7
Author:Qinlong Wang
Committer:GitHub

Master log the rank of Pod because the rank will change for FT. (#445) * Worker report its rank to master for log * Cast rank to int * Format codes * Fix test cases * Implement report node status in LocalMasterClient * Fix test cases

Commit:ef6f40f
Author:Qinlong Wang
Committer:GitHub

Support the scale down PS if some PS nodes fails (#388)

Commit:9fe1fad
Author:Qinlong Wang
Committer:GitHub

Checkpoint dataset shards and restore those shards when restarting workers. (#382) * Checkpoint model and dataset for elasticity and fault-tolerance * Set the default value of checkpoint path to an empty string * Reformat codes * Fix test cases * Fix test cases

Commit:a0f401c
Author:Qinlong Wang
Committer:GitHub

Adjust the number of worker base 2. (#377) * Adjust the number of worker base 2 * Fix rendezvous service to support torch elastic * Fix to update rendezvous states * Implement a resource checker to get available number of workers * Format codes * Fix the image of mnist * Fix test cases * Sleep 30s to start auto-scaling * Clear KV store if the worker group changes * Remove unused imports

Commit:5f0a49c
Author:Qinlong Wang

Rank 0 worker reports its IP as the master addr

Commit:0cba49e
Author:Qinlong Wang

Fix to get key-value from master store

Commit:f8f546e
Author:Qinlong Wang

Format codes

Commit:0ae5aba
Author:Qinlong Wang

Implement rendezvous service for torch elastic

Commit:7edadf0
Author:hanxudong

fix

Commit:f5eb65e
Author:hanxudong

reformat

Commit:23b61c2
Author:hxdtest
Committer:GitHub

Merge pull request #260 from workingloong/sync-service Implement a sychronized service

Commit:914e495
Author:Qinlong Wang

Implement a sychronized service

Commit:1f6f005
Author:cailun.cl

add priority field to PodResource

Commit:3a2d901
Author:Qinlong Wang
Committer:GitHub

Merge pull request #197 from intelligent-machine-learning/add_ray_platform Adaption for running on ray.

Commit:4362676
Author:hanxudong

reformat

Commit:ce33560
Author:jian.sha

update legarcy comments.

Commit:7d5f62f
Author:hanxudong

run a master and actor experiment

Commit:9ff78e7
Author:hanxudong

add info

Commit:8a0f098
Author:bsang

remove vendor

Commit:0a09490
Author:hanxudong
Committer:hanxudong

add test case

Commit:8dc90b9
Author:hanxudong
Committer:hanxudong

add event

Commit:7526233
Author:hanxudong
Committer:hanxudong

add node event and type

Commit:b8f931f
Author:bsang

update

Commit:accff1a
Author:b.sang

add brain dockerfile

Commit:7e55a7e
Author:b.sang

add brain dockerfile

Commit:39f1d19
Author:Qinlong Wang

Rename proto package

Commit:d449655
Author:Qinlong Wang

Merge branch 'master' into fix-dispatch-task

Commit:f389160
Author:Qinlong Wang

Get tasks with the type of a node

Commit:2356b95
Author:Qinlong Wang

Fix dynamic sharding

Commit:6ccabd5
Author:Qinlong Wang

Fix trainer-test

Commit:6869e8e
Author:b.sang

k8s watcher impl

Commit:c526d17
Author:b.sang

k8s watcher impl

Commit:97d4ce4
Author:Qinlong Wang

Format codes

Commit:f9f94ae
Author:b.sang

add ut

Commit:1469e63
Author:b.sang

add ut

Commit:4cf329b
Author:b.sang

add brain server

Commit:dd78acf
Author:b.sang

add config manager

Commit:3438617
Author:b.sang

update proto

Commit:fc8112a
Author:b.sang

add base datastore

Commit:f6a14c0
Author:b.sang

add base datastore

Commit:e7aa32b
Author:b.sang

brain proto

Commit:43d1163
Author:Qinlong Wang

Format proto

Commit:d7a01d6
Author:Qinlong Wang

Implement TaskManager to manage tasks of dataset