hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

40
Marcel W agenl änder, Luo Mai, Guo Li, Peter Pietzuch L arge-Scale Data & Systems (LSDS) Group Imperial College London Spotnik Designing Distributed Machine Le arning for Tr ansient Cloud Resources

Transcript of hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Page 1: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group

Imperial College London

SpotnikDesigning Distributed Machine Learning for Transient Cloud

Resources

Page 2: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Distributed ML

2

Train a machine learning model

worker 0 worker 1

worker 2

Δ Δ

Δ

Page 3: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Distributed ML

3

Train a machine learning model

worker 0 worker 1

worker 2

Δ Δ

ΔLearn a model

Page 4: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Distributed ML

4

Train a machine learning model

worker 0 worker 1

worker 2

Δ Δ

Δ

Data parallelism

Page 5: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Distributed ML

5

Train a machine learning model

worker 0 worker 1

worker 2

Δ Δ

Δ

Ring all-reduce

Page 6: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Challenges of distributed ML

• Distributed ML is resource-hungry

• Accelerated resources are expensive

Example Megatron-LM3

• Training of BERT-like model

• 512 V100 GPUs

• One epoch (68,507 iterations) takes 2.1 days

• Cost on Azure: $92,613

63Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

Page 7: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Transient cloud resources

1https://azure.microsoft.com/en-us/pricing/spot/7

• Examples: AWS Spot instances, Azure Spot VMs

• Follows the law of a free market

• Revocations

• Notifications

• Economic incentive

• Offers a cost reduction of up to 90%1

A Megatron-LM epoch would drop from $92,613 to $15,152

worker 2

90%

Page 8: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

8

worker 0

worker 1

worker 2

worker 3

worker 4

Cluster

Size

Page 9: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

• Cluster sizes are unknown beforehand

➙ System must adapt to different conditions

9

HogWild!

AD-PSGD

EA-SGD

SMAS-SGD

Network efficiency

Model accuracy

Page 10: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Current approach: Checkpoint & recovery

• Tensorflow and Pytorch

• Changes to the cluster are not considered

• Recovery takes about 20 seconds with ResNet50 and ImageNet

Recovery Training

10

CheckpointTraining

Page 11: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

• Mix dedicated resources with transient resources

• Proteus2: Placement of parameter server on dedicated resources so that the training state is save

Parameter server

Worker 0 Worker 1 Worker 2

Dedicated resources

Transient resources

2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 201711

Current approaches: Hybrid

Page 12: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Spotnik

12

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Page 13: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Spotnik

13

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Page 14: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Revocation recovery algorithm

• Handle revocations within a ring topology

• Number of total messages is bounded by messages

• K is the number of simultaneous revocations

• N is the number of workers

➙ Scale to many transient resources

• No reliance on revocation notifications

O(N ⋅ K)

14

Page 15: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Revocation recovery algorithmRepairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 1, 2, 3, 4, 5]

1 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

15

Page 16: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

16

ReconcileRevocation

Revocation recovery algorithm

Page 17: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Reconcile

Bypass

Revocation recovery algorithm

17

Page 18: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

18

Reconcile

Revocation recovery algorithm

Page 19: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

4

5 [0, 1, 2, 3, 4, 5]

19

Reconcile

Revocation

Revocation recovery algorithm

Page 20: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

20

Reconcile

Bypass

Revocation recovery algorithm

Page 21: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

21

Reconcile

Revocation recovery algorithm

Page 22: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

22

Reconcile

Revocation recovery algorithm

Page 23: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W1

W2

W3

W4

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

4

5 [0, 2, 3, 5]

23

Accept

Revocation recovery algorithm

Page 24: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Repairing a broken all-reduce ring

W0

W2

W3

W5

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

24

Restart

Revocation recovery algorithm

Page 25: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Spotnik

25

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Page 26: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Atomic worker state update

• Pipelined synchronisation

2626

Parameter Parameter

Sync. Sync.

Parameter

Sync.

Update Update Update

Page 27: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Atomic worker state update

• Pipelined synchronisation

• Revocations can happen meanwhile

➙ Partial update leads to inconsistency

2727

Parameter Parameter

Sync. Sync.

Parameter

Sync.

Update Update Update

Page 28: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Atomic worker state update

• Atomicity: Wait for all synchronisation communications to finish

• Discard updates if communication fails

2828

Parameter Parameter

Sync. Sync.

Parameter

Sync.

Update Update Update

Barrier

Page 29: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Spotnik

29

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Page 30: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Adaptive synchronisation strategies

• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

30

Page 31: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Adaptive synchronisation strategies• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

• Define a policy in the beginning

• When to choose which sync strategy

31

W0

W1W2

3 workers

W0

W3

W2

6 workers

W1

W4

W5

AD-PSGD S-SGD

Page 32: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

EvaluationHow does the recovery latency change with increasing number of revocations?

32

Cluster • 16 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

No significant increase of recovery latency if the number of revocation increases

Page 33: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

EvaluationHow much does the training slow down if we use atomic worker state updates?

33

Cluster • 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15

*different Setup Cluster • 16 workers Hardware • Huawei ModelArts

• Nvidia V100 • InfiniBand

Software • KungFu 0.2.1 • Tensorflow 1.12

Throughput decrease is small

Page 34: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

EvaluationHow does the throughput change, if the cluster changes?

34

Cluster • up to 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

Cluster size5 10 15 20 25 30

Page 35: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

EvaluationHow does the throughput change, if the cluster changes?

35

Cluster • up to 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

Cluster size5 10 15 20 25 30

Page 36: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

EvaluationHow does the throughput change, if the cluster changes?

Switching from S-SGDto AD-PSGD

36

Cluster • up to 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

Changing clusters need adaptation

Cluster size5 10 15 20 25 30

Page 37: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Conclusion

37

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

Page 38: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Conclusion

38

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

Spotnik• Repair cluster with low overhead

• Ensure consistent model updates

• Adapt to changes of the cluster

Page 39: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Conclusion

39

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

Spotnik• Repair cluster with low overhead

• Ensure consistent model updates

• Adapt to changes of the cluster

KungFu github.com/lsds/KungFu

Page 40: hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group. Imperial College

Conclusion

40

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

• Repair cluster with low overhead

• Ensure consistent model updates

• Adapt to changes of the cluster

Website lsds.doc.ic.ac.uk | E-Mail [email protected] Twitter @marwage

SpotnikKungFu github.com/lsds/KungFu