These 65 commits are when the Protocol Buffers files have changed:
| Commit: | 7601f76 | |
|---|---|---|
| Author: | cos120 | |
| Committer: | GitHub | |
feat(xpu-timer): init commit xpu-timer (#1391) * feat(xpu-timer): init commit xpu-timer * fix lint * Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e. * fix doc, ignore lint in xpu-timer * add copyright --------- Co-authored-by: lizhi <zhangji.zhang@antgroup.com> Co-authored-by: Qinlong Wang <WangQL1201@outlook.com>
The documentation is generated from this commit.
| Commit: | c8afcf1 | |
|---|---|---|
| Author: | lizhi | |
| Committer: | lizhi | |
add copyright
The documentation is generated from this commit.
| Commit: | 1340e1a | |
|---|---|---|
| Author: | lizhi | |
| Committer: | lizhi | |
Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e.
| Commit: | 2d8ef34 | |
|---|---|---|
| Author: | lizhi | |
| Committer: | lizhi | |
fix lint
| Commit: | 6b60fd3 | |
|---|---|---|
| Author: | lizhi | |
| Committer: | lizhi | |
feat(xpu-timer): init commit xpu-timer
| Commit: | ffbeae0 | |
|---|---|---|
| Author: | mingcheng | |
| Committer: | mingcheng | |
migrated tfplus and atorch into independent repository
| Commit: | 79d1ffc | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | Qinlong Wang | |
Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.
| Commit: | 87e5872 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.
| Commit: | c283841 | |
|---|---|---|
| Author: | Wlong692 | |
| Committer: | GitHub | |
feat(TFPlus): TFPlus 0.1.0 (#740) * tfplus opensource 0.1.0 init * fix github action * fix: resolved feedback from PR 740 review * fix: check precommit ci problem * fix: precommit ci git problem * fix resolved feedback of dev/script from PR 740 review * Update test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * update setup.py and Readme --------- Co-authored-by: Wlong962 <ycxovo_op@outlook.com>
| Commit: | ad8d898 | |
|---|---|---|
| Author: | ZhongYingMatrix | |
| Committer: | ZhongYingMatrix | |
atorch update changes
| Commit: | e92fc20 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Refactor: Simplify the proto message definition. (#651) * Feat: Implement the message of distributed strategy. * Refactor: rename model_metric to model_info * Refactor: refator the proto message definition. * Move the position of DataLoderConfig * Retry to send grpc message. * Add test cases. * Fix test cases.
| Commit: | 9815acc | |
|---|---|---|
| Author: | Tingfeng Lan | |
| Committer: | GitHub | |
2/3 feat(elastic_agent) add GPU statistic report to resource monitor. (#611) * Add proto prototype for grpc information. * Add information reporting for gpu stats in the master_client section. * Add unit test for gpu resource reporting to agent_monitor. * Rewrite gpu_list and unit tests using dataclass. * Formatting code. * Adjust log levels. * mock init_gpu_monitor to avoid pynvml error. * mock pynvml.nvmlinit to avoid pynvml error. * Remove redundant printouts. * Use the gpu metric class to update the gpu report. * Update test_report_used_resource with gpu_stats. * Formatting code.
| Commit: | 1873d95 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Classify the training error message from the agent. (#570) * Report the message to the master when the nodes scale down. * Classify the training error message from the agent. * Fix test cases
| Commit: | 2d86add | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Remove the round to test allgather with all nodes (#512) * Remove the round to test allgather with all nodes * Fix test cases * Replace a magic number with a variable * Fix docstring by comments * Format codes
| Commit: | 2b3e886 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Scale down nodes with the number unit if not enough nodes. (#490) * Scale down nodes with an number unit if there are not enough nodes * Rename worker_num_unit to node_unit * Fix test cases * The manager notifies workers to restart processes only when the number of nodes is a multiple of node unit * Build rendzvous only when the number of new nodes is bigger than node unit
| Commit: | cb10597 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
A monitor to log the error of training process. (#475) * A monitor to log the error of training process * Format codes
| Commit: | 6435865 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Join rendezvous with the rank-id in the env. (#468) * Use rank-id not node-id to build rdzv * Use rank id to join rendezvous * Format codes * Fix test cases
| Commit: | b86bf50 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Agent to check network before starting training processes. (#466) * Agent to check network before starting training processes * Fix test cases * install torch 2.0.1 * Update the imaget for test * Refactor the grpc method * Use py38 as base to build ci image * Fix test cases * Format codes
| Commit: | d481dd3 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Rendezvous manager to help the node to check network. (#465) * Implement rdzv manager for network check * Add docstring * Remove unused codes * Remove annotation codes * Merge test cases * Merge codes * Format codes
| Commit: | 704cb09 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Rendezvous manager to assign the rank to nodes. (#458) * Rendezvous manager to assign the rank to nodes * Format codes * Add test cases * Format codes * Format codes * Remove debug codes * Set timeout to 120s * Get rendezvous round index from master * lock to resturn rdzv_nodes * Fix the bug to remove exited nodes * Add docstring * Use dlrover-run in the elasticjob of torch * Add docstring to next_rendezvous * Format codes
| Commit: | 3cb888f | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
The agent reports process error message to DLRover master. (#462) * Report node failures to master * Agent reports node failures to master when the process fails * Format codes * Return response * Lock to log error data
| Commit: | 697d935 | |
|---|---|---|
| Author: | Zhang Haitao | |
| Committer: | GitHub | |
Add acc engine (#457) * add acc engine * format and test * fix * fix tests * exclude servicer.py to avoid mypy error * fix ut
| Commit: | 54b7ce7 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Master log the rank of Pod because the rank will change for FT. (#445) * Worker report its rank to master for log * Cast rank to int * Format codes * Fix test cases * Implement report node status in LocalMasterClient * Fix test cases
| Commit: | ef6f40f | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Support the scale down PS if some PS nodes fails (#388)
| Commit: | 9fe1fad | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Checkpoint dataset shards and restore those shards when restarting workers. (#382) * Checkpoint model and dataset for elasticity and fault-tolerance * Set the default value of checkpoint path to an empty string * Reformat codes * Fix test cases * Fix test cases
| Commit: | a0f401c | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Adjust the number of worker base 2. (#377) * Adjust the number of worker base 2 * Fix rendezvous service to support torch elastic * Fix to update rendezvous states * Implement a resource checker to get available number of workers * Format codes * Fix the image of mnist * Fix test cases * Sleep 30s to start auto-scaling * Clear KV store if the worker group changes * Remove unused imports
| Commit: | 5f0a49c | |
|---|---|---|
| Author: | Qinlong Wang | |
Rank 0 worker reports its IP as the master addr
| Commit: | 0cba49e | |
|---|---|---|
| Author: | Qinlong Wang | |
Fix to get key-value from master store
| Commit: | f8f546e | |
|---|---|---|
| Author: | Qinlong Wang | |
Format codes
| Commit: | 0ae5aba | |
|---|---|---|
| Author: | Qinlong Wang | |
Implement rendezvous service for torch elastic
| Commit: | 7edadf0 | |
|---|---|---|
| Author: | hanxudong | |
fix
| Commit: | f5eb65e | |
|---|---|---|
| Author: | hanxudong | |
reformat
| Commit: | 23b61c2 | |
|---|---|---|
| Author: | hxdtest | |
| Committer: | GitHub | |
Merge pull request #260 from workingloong/sync-service Implement a sychronized service
| Commit: | 914e495 | |
|---|---|---|
| Author: | Qinlong Wang | |
Implement a sychronized service
| Commit: | 1f6f005 | |
|---|---|---|
| Author: | cailun.cl | |
add priority field to PodResource
| Commit: | 3a2d901 | |
|---|---|---|
| Author: | Qinlong Wang | |
| Committer: | GitHub | |
Merge pull request #197 from intelligent-machine-learning/add_ray_platform Adaption for running on ray.
| Commit: | 4362676 | |
|---|---|---|
| Author: | hanxudong | |
reformat
| Commit: | ce33560 | |
|---|---|---|
| Author: | jian.sha | |
update legarcy comments.
| Commit: | 7d5f62f | |
|---|---|---|
| Author: | hanxudong | |
run a master and actor experiment
| Commit: | 9ff78e7 | |
|---|---|---|
| Author: | hanxudong | |
add info
| Commit: | 8a0f098 | |
|---|---|---|
| Author: | bsang | |
remove vendor
| Commit: | 0a09490 | |
|---|---|---|
| Author: | hanxudong | |
| Committer: | hanxudong | |
add test case
| Commit: | 8dc90b9 | |
|---|---|---|
| Author: | hanxudong | |
| Committer: | hanxudong | |
add event
| Commit: | 7526233 | |
|---|---|---|
| Author: | hanxudong | |
| Committer: | hanxudong | |
add node event and type
| Commit: | b8f931f | |
|---|---|---|
| Author: | bsang | |
update
| Commit: | accff1a | |
|---|---|---|
| Author: | b.sang | |
add brain dockerfile
| Commit: | 7e55a7e | |
|---|---|---|
| Author: | b.sang | |
add brain dockerfile
| Commit: | 39f1d19 | |
|---|---|---|
| Author: | Qinlong Wang | |
Rename proto package
| Commit: | d449655 | |
|---|---|---|
| Author: | Qinlong Wang | |
Merge branch 'master' into fix-dispatch-task
| Commit: | f389160 | |
|---|---|---|
| Author: | Qinlong Wang | |
Get tasks with the type of a node
| Commit: | 2356b95 | |
|---|---|---|
| Author: | Qinlong Wang | |
Fix dynamic sharding
| Commit: | 6ccabd5 | |
|---|---|---|
| Author: | Qinlong Wang | |
Fix trainer-test
| Commit: | 6869e8e | |
|---|---|---|
| Author: | b.sang | |
k8s watcher impl
| Commit: | c526d17 | |
|---|---|---|
| Author: | b.sang | |
k8s watcher impl
| Commit: | 97d4ce4 | |
|---|---|---|
| Author: | Qinlong Wang | |
Format codes
| Commit: | f9f94ae | |
|---|---|---|
| Author: | b.sang | |
add ut
| Commit: | 1469e63 | |
|---|---|---|
| Author: | b.sang | |
add ut
| Commit: | 4cf329b | |
|---|---|---|
| Author: | b.sang | |
add brain server
| Commit: | dd78acf | |
|---|---|---|
| Author: | b.sang | |
add config manager
| Commit: | 3438617 | |
|---|---|---|
| Author: | b.sang | |
update proto
| Commit: | fc8112a | |
|---|---|---|
| Author: | b.sang | |
add base datastore
| Commit: | f6a14c0 | |
|---|---|---|
| Author: | b.sang | |
add base datastore
| Commit: | e7aa32b | |
|---|---|---|
| Author: | b.sang | |
brain proto
| Commit: | 43d1163 | |
|---|---|---|
| Author: | Qinlong Wang | |
Format proto
| Commit: | d7a01d6 | |
|---|---|---|
| Author: | Qinlong Wang | |
Implement TaskManager to manage tasks of dataset