Skip to content

Commit 90dbd6e

Browse files
authored
Fine tune readme. (#111)
* change readme test=develop * change readme test=develop * change readme test=develop * add test=develop * add test=develop
1 parent 00c717a commit 90dbd6e

File tree

1 file changed

+45
-44
lines changed

1 file changed

+45
-44
lines changed

README.md

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,43 +2,64 @@
22

33
<img src="https://github.com/elasticdeeplearning/artwork/blob/master/horizontal/color/edl-horizontal-color.png" width="500" style="display:inline;vertical-align:middle;padding:2%">
44

5-
EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
5+
# Motivation
6+
Elastic Deep Learning(EDL) is a framework with the ability to dynamically adjust the parallelism (number of training workers) for deep neural network training. It can support multi-tenant cluster management to balance job completion time and job waiting time, maximize the use of idle resources, and so on.
67

7-
EDL includes two parts:
8+
This project contains EDL framework and its applications such as distillation and NAS.
89

9-
1. A Kubernetes controller for the elastic scheduling of distributed
10-
deep learning jobs and tools for adjusting manually.
10+
Now EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
1111

12-
1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
12+
<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png" width="200" style="display:inline;vertical-align:middle;padding:2%">
1313

14-
EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
14+
# Installation
15+
You can install with ```pip install paddle_edl```. But we highly **recommend** you use it in our docker:
1516

16-
<img src="https://github.com/lfai/artwork/blob/master/lfai-project-badge/incubation/color/lfai-projectlogos_incubation-color.png" width="200" style="display:inline;vertical-align:middle;padding:2%">
17+
```
18+
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
19+
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 /bin/bash
20+
```
21+
22+
# EDL Applications:
1723

18-
While many hardware and software manufacturers are working on
19-
improving the running time of deep learning jobs, EDL optimizes
24+
<p align="center">
25+
<img src="doc/distill.gif" width="700">
26+
</p>
2027

21-
1. the global utilization of the cluster, and
22-
1. the waiting time of job submitters.
28+
## Quick Start
29+
- [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
2330

24-
## Key Features:
25-
- Efficiency: Provides parallelism strategies to minimize adjustment overheads.
26-
- Consistency: Accuracy verification on multiple models compared those without scaling.
27-
- Flexibility: Any components can be killed or joined at any time.
28-
- Easy to use: Few lines of code need to be added to support EDL.
31+
# EDL Framework
32+
## How to change from a normal train program to an EDL train program
33+
The main change is that you should `load_checkpoint` at the beginning of training and `save_checkpoint` at the end of every epoch and the checkpoint should be on a distributed file system such as HDFS so all trainers can download from it. A complete example is [here](https://github.com/elasticdeeplearning/edl/tree/develop/example/collective/resnet50)
2934

30-
## Quick start demo: EDL Resnet50 experiments on a single machine:
31-
We highly **recommand** you run it in our docker:
35+
```
36+
fs=HDFSClient(args.hdfs_name, args.hdfs_ugi,20*60*1000, 3 * 1000)
37+
38+
train_status =TrainStatus()
39+
tmp_s = fleet.load_checkpoint(exe, args.checkpoint, fs=fs, trainer_id=trainer_id)
40+
if tmp_s is not None:
41+
train_status = tmp_s
42+
43+
for pass_id in range(train_status.next(), params["num_epochs"]):
44+
train()
45+
46+
if trainer_id == 0:
47+
saved_status = TrainStatus(pass_id)
48+
fleet.save_checkpoint(exe, train_status=saved_status,
49+
path=args.checkpoint, fs=fs)
50+
```
3251

33-
1. Start a Jobserver on one node.
52+
## Quickstart
53+
### EDL Resnet50 experiments on a single machine in docker:
54+
55+
1. Start a JobServer on one node which generates changing scripts.
3456

3557
```
36-
docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7
37-
cd example/demo/collective
58+
cd example/demo/collective
3859
./start_job_server.sh
3960
```
4061

41-
2. Start a Jobclient which controls the worker process.
62+
1. Start a Jobclient which controls the worker process.
4263

4364
```
4465
#Set the ImageNet data path
@@ -50,32 +71,12 @@ mkdir -p resnet50_pod
5071
./start_job_client.sh
5172
```
5273

53-
3. Experiments result
74+
1. Experiments result
5475

5576
| total batch size | acc1 | acc5 |
5677
| :-----: | ----: | ----: |
57-
| 1024 | 76.0 | 75.8 |
58-
59-
60-
## Design Docs
61-
- A scheduler on Kubernetes:
62-
- [Scheduler](./doc/edl_design_doc.md)
63-
- EDL framework on PaddlePaddle:
64-
- [Fault-Tolerant Training in PaddlePaddle](./doc/fault_tolerance.md)
65-
- [EDL framework](./doc/edl_collective_design_doc.md)
66-
67-
## Applications:
68-
69-
<p align="center">
70-
<img src="doc/distill.gif" width="700">
71-
</p>
78+
| 1024 | 75.5 | 92.8 |
7279

73-
- EDL Distillation:
74-
- [EDL Distillation design](./doc/edl_distill_design_doc.md)
75-
- [Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
76-
- [EDL Distillation performance: Resnet50](./doc/experiment/distill_resnet50.md)
77-
- EDL CTR
78-
- [EDL CTR training and deployment on Baidu Cloud](./example/ctr/deploy_ctr_on_baidu_cloud_cn.rst)
7980

8081
## FAQ
8182

0 commit comments

Comments
 (0)