hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group

Imperial College London

SpotnikDesigning Distributed Machine Learning for Transient Cloud

Resources

Distributed ML

2

Train a machine learning model

worker 0 worker 1

worker 2

Δ Δ

Δ

Distributed ML

3


worker 0 worker 1

worker 2

Δ Δ

ΔLearn a model

Distributed ML

4


worker 0 worker 1

worker 2

Δ Δ

Δ

Data parallelism

Distributed ML

5


worker 0 worker 1

worker 2

Δ Δ

Δ

Ring all-reduce

Challenges of distributed ML

• Distributed ML is resource-hungry

• Accelerated resources are expensive

Example Megatron-LM3

• Training of BERT-like model

• 512 V100 GPUs

• One epoch (68,507 iterations) takes 2.1 days

• Cost on Azure: $92,613

63Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

Transient cloud resources

1https://azure.microsoft.com/en-us/pricing/spot/7

• Examples: AWS Spot instances, Azure Spot VMs

• Follows the law of a free market

• Revocations

• Notifications

• Economic incentive

• Offers a cost reduction of up to 90%1

A Megatron-LM epoch would drop from $92,613 to $15,152

worker 2

90%

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

8

worker 0

worker 1

worker 2

worker 3

worker 4

Cluster

Size

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

• Cluster sizes are unknown beforehand

➙ System must adapt to different conditions

9

HogWild!

AD-PSGD

EA-SGD

SMAS-SGD

Network efficiency

Model accuracy

Current approach: Checkpoint & recovery

• Tensorflow and Pytorch

• Changes to the cluster are not considered

• Recovery takes about 20 seconds with ResNet50 and ImageNet

Recovery Training

10

CheckpointTraining

• Mix dedicated resources with transient resources

• Proteus2: Placement of parameter server on dedicated resources so that the training state is save

Parameter server

Worker 0 Worker 1 Worker 2

Dedicated resources

Transient resources

2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. EuroSys, 201711

Current approaches: Hybrid

Spotnik

12

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Spotnik

13








Revocation recovery algorithm

• Handle revocations within a ring topology

• Number of total messages is bounded by messages

• K is the number of simultaneous revocations

• N is the number of workers

➙ Scale to many transient resources

• No reliance on revocation notifications

O(N ⋅ K)

14

Revocation recovery algorithmRepairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 1, 2, 3, 4, 5]

1 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

15

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

16

ReconcileRevocation



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Reconcile

Bypass


17


W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

18

Reconcile



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

4

5 [0, 1, 2, 3, 4, 5]

19

Reconcile

Revocation



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

20

Reconcile

Bypass



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

21

Reconcile



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

22

Reconcile



W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

23

Accept



W0

W2

W3

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

24

Restart


Spotnik

25








Atomic worker state update

• Pipelined synchronisation

2626

Parameter Parameter

Sync. Sync.

Parameter

Sync.

Update Update Update


• Pipelined synchronisation

• Revocations can happen meanwhile

➙ Partial update leads to inconsistency

2727

Parameter Parameter

Sync. Sync.

Parameter

Sync.



• Atomicity: Wait for all synchronisation communications to finish

• Discard updates if communication fails

2828

Parameter Parameter

Sync. Sync.

Parameter

Sync.


Barrier

Spotnik

29








Adaptive synchronisation strategies

• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

30

Adaptive synchronisation strategies• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

• Define a policy in the beginning

• When to choose which sync strategy

31

W0

W1W2

3 workers

W0

W3

W2

6 workers

W1

W4

W5

AD-PSGD S-SGD

EvaluationHow does the recovery latency change with increasing number of revocations?

32

Cluster • 16 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

No significant increase of recovery latency if the number of revocation increases

EvaluationHow much does the training slow down if we use atomic worker state updates?

33

Cluster • 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15

*different Setup Cluster • 16 workers Hardware • Huawei ModelArts

• Nvidia V100 • InfiniBand

Software • KungFu 0.2.1 • Tensorflow 1.12

Throughput decrease is small

EvaluationHow does the throughput change, if the cluster changes?

34

Cluster • up to 32 workers Hardware • Azure NC6 VMs


Cluster size5 10 15 20 25 30


35



Cluster size5 10 15 20 25 30


Switching from S-SGDto AD-PSGD

36



Changing clusters need adaptation

Cluster size5 10 15 20 25 30

Conclusion

37

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

Conclusion

38



Spotnik• Repair cluster with low overhead

• Ensure consistent model updates

• Adapt to changes of the cluster

Conclusion

39



Spotnik• Repair cluster with low overhead



KungFu github.com/lsds/KungFu

Conclusion

40



• Repair cluster with low overhead



Website lsds.doc.ic.ac.uk | E-Mail [email protected] Twitter @marwage

SpotnikKungFu github.com/lsds/KungFu

hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Documents

Transcript of hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...