Issues Forks Stars License

Motivation

Computing resources on cloud such as Amazon AWSBaidu Cloud have multi-tenancy. Deep learning model training and inference with elastic resources will be common on cloud. We propose Elastic Deep Learning (EDL) that makes training and inference of deep learning models on cloud easier and more efficient.

Now EDL is an incubation-stage project of the LF AI Foundation.

Installation

EDL package support python2.7/3.6/3.7. You can install with pip install paddle_edl. But we highly recommend you use it in our docker:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

Latest Release(0.3.1)

Quick start Demo

pip install paddle-serving-server-gpu
cd example/distill/resnet

wget --no-check-certificate https://paddle-edl.bj.bcebos.com/distill_teacher_model/ResNeXt101_32x16d_wsl_model.tar.gz
tar -zxf ResNeXt101_32x16d_wsl_model.tar.gz

python -m paddle_serving_server_gpu.serve \
  --model ResNeXt101_32x16d_wsl_model \
  --mem_optim \
  --port 9898 \
  --gpu_ids 1
python -m paddle.distributed.launch --selected_gpus 0 \
  ./train_with_fleet.py \
  --model=ResNet50_vd \
  --data_dir=./ImageNet \
  --use_distill_service=True \
  --distill_teachers=127.0.0.1:9898
mode teacher resource student resource total batch size acc1 acc5 speed(img/s)
pure train None 8 * v100 256 77.1 93.5 1828
teacher and student on the same gpus 8 * v100 8 * v100 256 79.0 94.3 656
EDL service distill 40 * P4 8 * v100 256 79.0 94.5 1514

About Knowledge Distillation in EDL

Release 0.2.0

Checkpoint based elastic training on multiple GPUs

Resnet50 experiments on a single machine in docker

cd example/demo/collective
node_ips="127.0.0.1"
python -u paddle_edl.demo.collective.job_server_demo \
    --node_ips ${node_ips} \
    --pod_num_of_node 8 \
    --time_interval_to_change 900 \
    --gpu_num_of_node 8
# set the ImageNet data path
export PADDLE_EDL_IMAGENET_PATH=<your path>
# set the checkpoint path
export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path>

mkdir -p resnet50_pod
unset http_proxy https_proxy

# running under edl
export PADDLE_RUNING_ENV=PADDLE_EDL
export PADDLE_JOB_ID="test_job_id_1234"
export PADDLE_POD_ID="not set"

python -u paddle_edl.demo.collective.job_client_demo \
    --log_level 20 \
    --package_sh ./resnet50/package.sh \
    --pod_path ./resnet50_pod \
    ./train_pretrain.sh
model dataset gpu cards total batch size acc1 acc5
Resnet50 ImageNet 16 * v100 1024 75.5 92.8

The whole example is here

Community

FAQ

License

Contribution