You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EDL is an Elastic Deep Learning framework designed to help deep learning cloud service providers to build cluster cloud services using deep learning framework PaddlePaddle.
5
+
# Motivation
6
+
Elastic Deep Learning(EDL) is a framework with the ability to dynamically adjust the parallelism (number of training workers) for deep neural network training. It can support multi-tenant cluster management to balance job completion time and job waiting time, maximize the use of idle resources, and so on.
6
7
7
-
EDL includes two parts:
8
+
This project contains EDL framework and its applications such as distillation and NAS.
8
9
9
-
1. A Kubernetes controller for the elastic scheduling of distributed
10
-
deep learning jobs and tools for adjusting manually.
10
+
Now EDL is an incubation-stage project of the [LF AI Foundation](https://lfai.foundation).
11
11
12
-
1. Making PaddlePaddle a fault-tolerable deep learning framework with usability API for job management.
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda10.0-cudnn7 /bin/bash
20
+
```
21
+
22
+
# EDL Applications:
17
23
18
-
While many hardware and software manufacturers are working on
19
-
improving the running time of deep learning jobs, EDL optimizes
24
+
<palign="center">
25
+
<img src="doc/distill.gif" width="700">
26
+
</p>
20
27
21
-
1. the global utilization of the cluster, and
22
-
1. the waiting time of job submitters.
28
+
## Quick Start
29
+
-[Run EDL distillation training demo on Kubernetes or a single node](./example/distill/README.md)
23
30
24
-
## Key Features:
25
-
- Efficiency: Provides parallelism strategies to minimize adjustment overheads.
26
-
- Consistency: Accuracy verification on multiple models compared those without scaling.
27
-
- Flexibility: Any components can be killed or joined at any time.
28
-
- Easy to use: Few lines of code need to be added to support EDL.
31
+
# EDL Framework
32
+
## How to change from a normal train program to an EDL train program
33
+
The main change is that you should `load_checkpoint` at the beginning of training and `save_checkpoint` at the end of every epoch and the checkpoint should be on a distributed file system such as HDFS so all trainers can download from it. A complete example is [here](https://github.com/elasticdeeplearning/edl/tree/develop/example/collective/resnet50)
29
34
30
-
## Quick start demo: EDL Resnet50 experiments on a single machine:
0 commit comments