These 65 commits are when the Protocol Buffers files have changed:
Commit: | 7601f76 | |
---|---|---|
Author: | cos120 | |
Committer: | GitHub |
feat(xpu-timer): init commit xpu-timer (#1391) * feat(xpu-timer): init commit xpu-timer * fix lint * Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e. * fix doc, ignore lint in xpu-timer * add copyright --------- Co-authored-by: lizhi <zhangji.zhang@antgroup.com> Co-authored-by: Qinlong Wang <WangQL1201@outlook.com>
The documentation is generated from this commit.
Commit: | c8afcf1 | |
---|---|---|
Author: | lizhi | |
Committer: | lizhi |
add copyright
The documentation is generated from this commit.
Commit: | 1340e1a | |
---|---|---|
Author: | lizhi | |
Committer: | lizhi |
Revert "fix lint" This reverts commit b5d346c03db4ad0b96a2e5f0a6673876c071762e.
Commit: | 2d8ef34 | |
---|---|---|
Author: | lizhi | |
Committer: | lizhi |
fix lint
Commit: | 6b60fd3 | |
---|---|---|
Author: | lizhi | |
Committer: | lizhi |
feat(xpu-timer): init commit xpu-timer
Commit: | ffbeae0 | |
---|---|---|
Author: | mingcheng | |
Committer: | mingcheng |
migrated tfplus and atorch into independent repository
Commit: | 79d1ffc | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | Qinlong Wang |
Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.
Commit: | 87e5872 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Remove empty.proto in the elastic_training.proto (#1116) * Remove empty.proto in the elastic_training.proto * Set the rdzv backend and timeout.
Commit: | c283841 | |
---|---|---|
Author: | Wlong692 | |
Committer: | GitHub |
feat(TFPlus): TFPlus 0.1.0 (#740) * tfplus opensource 0.1.0 init * fix github action * fix: resolved feedback from PR 740 review * fix: check precommit ci problem * fix: precommit ci git problem * fix resolved feedback of dev/script from PR 740 review * Update test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * Fix test trigger logic in CI workflow * update setup.py and Readme --------- Co-authored-by: Wlong962 <ycxovo_op@outlook.com>
Commit: | ad8d898 | |
---|---|---|
Author: | ZhongYingMatrix | |
Committer: | ZhongYingMatrix |
atorch update changes
Commit: | e92fc20 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Refactor: Simplify the proto message definition. (#651) * Feat: Implement the message of distributed strategy. * Refactor: rename model_metric to model_info * Refactor: refator the proto message definition. * Move the position of DataLoderConfig * Retry to send grpc message. * Add test cases. * Fix test cases.
Commit: | 9815acc | |
---|---|---|
Author: | Tingfeng Lan | |
Committer: | GitHub |
2/3 feat(elastic_agent) add GPU statistic report to resource monitor. (#611) * Add proto prototype for grpc information. * Add information reporting for gpu stats in the master_client section. * Add unit test for gpu resource reporting to agent_monitor. * Rewrite gpu_list and unit tests using dataclass. * Formatting code. * Adjust log levels. * mock init_gpu_monitor to avoid pynvml error. * mock pynvml.nvmlinit to avoid pynvml error. * Remove redundant printouts. * Use the gpu metric class to update the gpu report. * Update test_report_used_resource with gpu_stats. * Formatting code.
Commit: | 1873d95 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Classify the training error message from the agent. (#570) * Report the message to the master when the nodes scale down. * Classify the training error message from the agent. * Fix test cases
Commit: | 2d86add | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Remove the round to test allgather with all nodes (#512) * Remove the round to test allgather with all nodes * Fix test cases * Replace a magic number with a variable * Fix docstring by comments * Format codes
Commit: | 2b3e886 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Scale down nodes with the number unit if not enough nodes. (#490) * Scale down nodes with an number unit if there are not enough nodes * Rename worker_num_unit to node_unit * Fix test cases * The manager notifies workers to restart processes only when the number of nodes is a multiple of node unit * Build rendzvous only when the number of new nodes is bigger than node unit
Commit: | cb10597 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
A monitor to log the error of training process. (#475) * A monitor to log the error of training process * Format codes
Commit: | 6435865 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Join rendezvous with the rank-id in the env. (#468) * Use rank-id not node-id to build rdzv * Use rank id to join rendezvous * Format codes * Fix test cases
Commit: | b86bf50 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Agent to check network before starting training processes. (#466) * Agent to check network before starting training processes * Fix test cases * install torch 2.0.1 * Update the imaget for test * Refactor the grpc method * Use py38 as base to build ci image * Fix test cases * Format codes
Commit: | d481dd3 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Rendezvous manager to help the node to check network. (#465) * Implement rdzv manager for network check * Add docstring * Remove unused codes * Remove annotation codes * Merge test cases * Merge codes * Format codes
Commit: | 704cb09 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Rendezvous manager to assign the rank to nodes. (#458) * Rendezvous manager to assign the rank to nodes * Format codes * Add test cases * Format codes * Format codes * Remove debug codes * Set timeout to 120s * Get rendezvous round index from master * lock to resturn rdzv_nodes * Fix the bug to remove exited nodes * Add docstring * Use dlrover-run in the elasticjob of torch * Add docstring to next_rendezvous * Format codes
Commit: | 3cb888f | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
The agent reports process error message to DLRover master. (#462) * Report node failures to master * Agent reports node failures to master when the process fails * Format codes * Return response * Lock to log error data
Commit: | 697d935 | |
---|---|---|
Author: | Zhang Haitao | |
Committer: | GitHub |
Add acc engine (#457) * add acc engine * format and test * fix * fix tests * exclude servicer.py to avoid mypy error * fix ut
Commit: | 54b7ce7 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Master log the rank of Pod because the rank will change for FT. (#445) * Worker report its rank to master for log * Cast rank to int * Format codes * Fix test cases * Implement report node status in LocalMasterClient * Fix test cases
Commit: | ef6f40f | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Support the scale down PS if some PS nodes fails (#388)
Commit: | 9fe1fad | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Checkpoint dataset shards and restore those shards when restarting workers. (#382) * Checkpoint model and dataset for elasticity and fault-tolerance * Set the default value of checkpoint path to an empty string * Reformat codes * Fix test cases * Fix test cases
Commit: | a0f401c | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Adjust the number of worker base 2. (#377) * Adjust the number of worker base 2 * Fix rendezvous service to support torch elastic * Fix to update rendezvous states * Implement a resource checker to get available number of workers * Format codes * Fix the image of mnist * Fix test cases * Sleep 30s to start auto-scaling * Clear KV store if the worker group changes * Remove unused imports
Commit: | 5f0a49c | |
---|---|---|
Author: | Qinlong Wang |
Rank 0 worker reports its IP as the master addr
Commit: | 0cba49e | |
---|---|---|
Author: | Qinlong Wang |
Fix to get key-value from master store
Commit: | f8f546e | |
---|---|---|
Author: | Qinlong Wang |
Format codes
Commit: | 0ae5aba | |
---|---|---|
Author: | Qinlong Wang |
Implement rendezvous service for torch elastic
Commit: | 7edadf0 | |
---|---|---|
Author: | hanxudong |
fix
Commit: | f5eb65e | |
---|---|---|
Author: | hanxudong |
reformat
Commit: | 23b61c2 | |
---|---|---|
Author: | hxdtest | |
Committer: | GitHub |
Merge pull request #260 from workingloong/sync-service Implement a sychronized service
Commit: | 914e495 | |
---|---|---|
Author: | Qinlong Wang |
Implement a sychronized service
Commit: | 1f6f005 | |
---|---|---|
Author: | cailun.cl |
add priority field to PodResource
Commit: | 3a2d901 | |
---|---|---|
Author: | Qinlong Wang | |
Committer: | GitHub |
Merge pull request #197 from intelligent-machine-learning/add_ray_platform Adaption for running on ray.
Commit: | 4362676 | |
---|---|---|
Author: | hanxudong |
reformat
Commit: | ce33560 | |
---|---|---|
Author: | jian.sha |
update legarcy comments.
Commit: | 7d5f62f | |
---|---|---|
Author: | hanxudong |
run a master and actor experiment
Commit: | 9ff78e7 | |
---|---|---|
Author: | hanxudong |
add info
Commit: | 8a0f098 | |
---|---|---|
Author: | bsang |
remove vendor
Commit: | 0a09490 | |
---|---|---|
Author: | hanxudong | |
Committer: | hanxudong |
add test case
Commit: | 8dc90b9 | |
---|---|---|
Author: | hanxudong | |
Committer: | hanxudong |
add event
Commit: | 7526233 | |
---|---|---|
Author: | hanxudong | |
Committer: | hanxudong |
add node event and type
Commit: | b8f931f | |
---|---|---|
Author: | bsang |
update
Commit: | accff1a | |
---|---|---|
Author: | b.sang |
add brain dockerfile
Commit: | 7e55a7e | |
---|---|---|
Author: | b.sang |
add brain dockerfile
Commit: | 39f1d19 | |
---|---|---|
Author: | Qinlong Wang |
Rename proto package
Commit: | d449655 | |
---|---|---|
Author: | Qinlong Wang |
Merge branch 'master' into fix-dispatch-task
Commit: | f389160 | |
---|---|---|
Author: | Qinlong Wang |
Get tasks with the type of a node
Commit: | 2356b95 | |
---|---|---|
Author: | Qinlong Wang |
Fix dynamic sharding
Commit: | 6ccabd5 | |
---|---|---|
Author: | Qinlong Wang |
Fix trainer-test
Commit: | 6869e8e | |
---|---|---|
Author: | b.sang |
k8s watcher impl
Commit: | c526d17 | |
---|---|---|
Author: | b.sang |
k8s watcher impl
Commit: | 97d4ce4 | |
---|---|---|
Author: | Qinlong Wang |
Format codes
Commit: | f9f94ae | |
---|---|---|
Author: | b.sang |
add ut
Commit: | 1469e63 | |
---|---|---|
Author: | b.sang |
add ut
Commit: | 4cf329b | |
---|---|---|
Author: | b.sang |
add brain server
Commit: | dd78acf | |
---|---|---|
Author: | b.sang |
add config manager
Commit: | 3438617 | |
---|---|---|
Author: | b.sang |
update proto
Commit: | fc8112a | |
---|---|---|
Author: | b.sang |
add base datastore
Commit: | f6a14c0 | |
---|---|---|
Author: | b.sang |
add base datastore
Commit: | e7aa32b | |
---|---|---|
Author: | b.sang |
brain proto
Commit: | 43d1163 | |
---|---|---|
Author: | Qinlong Wang |
Format proto
Commit: | d7a01d6 | |
---|---|---|
Author: | Qinlong Wang |
Implement TaskManager to manage tasks of dataset