hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group

Imperial College London

SpotnikDesigning Distributed Machine Learning for Transient Cloud

Resources

Distributed ML

Train a machine learning model

worker 0 worker 1

worker 2

Distributed ML

worker 0 worker 1

worker 2

ΔLearn a model

Distributed ML

worker 0 worker 1

worker 2

Data parallelism

Distributed ML

worker 0 worker 1

worker 2

Ring all-reduce

Challenges of distributed ML

• Distributed ML is resource-hungry

• Accelerated resources are expensive

Example Megatron-LM3

• Training of BERT-like model

• 512 V100 GPUs

• One epoch (68,507 iterations) takes 2.1 days

• Cost on Azure: $92,613

63Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

Transient cloud resources

1https://azure.microsoft.com/en-us/pricing/spot/7

• Examples: AWS Spot instances, Azure Spot VMs

• Follows the law of a free market

• Revocations

• Notifications

• Economic incentive

• Offers a cost reduction of up to 90%1

A Megatron-LM epoch would drop from $92,613 to $15,152

worker 2

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

worker 0

worker 1

worker 2

worker 3

worker 4

Cluster

Implications of transient resources

• New workers become available or old workers get revoked

➙ System must cope with disappearing resources

• Changes can happen at any time

➙ System must ensure consistency of updates

• Cluster sizes are unknown beforehand

➙ System must adapt to different conditions

HogWild!

AD-PSGD

EA-SGD

SMAS-SGD

Network efficiency

Model accuracy

Current approach: Checkpoint & recovery

• Tensorflow and Pytorch

• Changes to the cluster are not considered

• Recovery takes about 20 seconds with ResNet50 and ImageNet

Recovery Training

CheckpointTraining

• Mix dedicated resources with transient resources

• Proteus2: Placement of parameter server on dedicated resources so that the training state is save

Parameter server

Worker 0 Worker 1 Worker 2

Dedicated resources

Transient resources

2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. EuroSys, 201711

Current approaches: Hybrid

Spotnik

Challenges Solutions

Workers become available or get revoked

Reuse communication channels for synchronisation to repair the cluster

Changes can happen at any time

Ensure atomic model updates by waiting for all synchronisations to finish first

Cluster sizes are unknown beforehand

Change the synchronisation strategy based on the number of workers

Spotnik

Revocation recovery algorithm

• Handle revocations within a ring topology

• Number of total messages is bounded by messages

• K is the number of simultaneous revocations

• N is the number of workers

➙ Scale to many transient resources

• No reliance on revocation notifications

O(N ⋅ K)

Revocation recovery algorithmRepairing a broken all-reduce ring

Worker Membership

0 [0, 1, 2, 3, 4, 5]

1 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Repairing a broken all-reduce ring

Worker Membership

0 [0, 1, 2, 3, 4, 5]

2 [0, 1, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

ReconcileRevocation

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 1, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Reconcile

Bypass

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

4 [0, 1, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Reconcile

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 4, 5]

5 [0, 1, 2, 3, 4, 5]

Reconcile

Revocation

Worker Membership

0 [0, 2, 3, 4, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

Reconcile

Bypass

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 4, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

Reconcile

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

Reconcile

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

Accept

Worker Membership

0 [0, 2, 3, 5]

2 [0, 2, 3, 5]

3 [0, 2, 3, 5]

5 [0, 2, 3, 5]

Restart

Spotnik

Atomic worker state update

• Pipelined synchronisation

Parameter Parameter

Sync. Sync.

Parameter

Update Update Update

• Pipelined synchronisation

• Revocations can happen meanwhile

➙ Partial update leads to inconsistency

Parameter Parameter

Sync. Sync.

Parameter

• Atomicity: Wait for all synchronisation communications to finish

• Discard updates if communication fails

Parameter Parameter

Sync. Sync.

Parameter

Barrier

Spotnik

Adaptive synchronisation strategies

• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

Adaptive synchronisation strategies• Support a range of synchronisation primitives

• collective and point-to-point synchronisation

• Monitor a metric

• Number of workers

• Define a policy in the beginning

• When to choose which sync strategy

3 workers

6 workers

AD-PSGD S-SGD

EvaluationHow does the recovery latency change with increasing number of revocations?

Cluster • 16 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet

No significant increase of recovery latency if the number of revocation increases

EvaluationHow much does the training slow down if we use atomic worker state updates?

Cluster • 32 workers Hardware • Azure NC6 VMs

• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15

*different Setup Cluster • 16 workers Hardware • Huawei ModelArts

• Nvidia V100 • InfiniBand

Software • KungFu 0.2.1 • Tensorflow 1.12

Throughput decrease is small

EvaluationHow does the throughput change, if the cluster changes?

Cluster • up to 32 workers Hardware • Azure NC6 VMs

Cluster size5 10 15 20 25 30

Switching from S-SGDto AD-PSGD

Changing clusters need adaptation

Cluster size5 10 15 20 25 30

Conclusion

• Transient cloud resources offer potential to save money for ML training

• No system that runs exclusively on transient resources and has low overhead

Conclusion

Spotnik• Repair cluster with low overhead

• Ensure consistent model updates

• Adapt to changes of the cluster

Conclusion

Spotnik• Repair cluster with low overhead

KungFu github.com/lsds/KungFu

Conclusion

• Repair cluster with low overhead

Website lsds.doc.ic.ac.uk | E-Mail marcel.wagenlander19@imperial.ac.uk Twitter @marwage

SpotnikKungFu github.com/lsds/KungFu

hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Documents

Transcript of hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...

Medientheorie Slides

Austausch macht Schule 2015 Konferenz Slides

Mmf D Slides Bis

Linux slides 2013_upload

Menschenwürde und Folterverbot · 2015. 3. 6. · Wagenländer, Zur strafrechtlichen Beurteilung der Rettungsfolter, Berlin 2006. 9 Vgl. Erb, a.a.O., S. 161. 10 Vgl. Dieter Birnbacher,

Process partner slides

Dorfgemeinschaftshaus Umbau- und Sanierungsmaßnahmen Günther Daniel Bin Guo Schulz Sebastian Rothe Steve Völker Eduard Müller Robert Kretzschmar Rene 03.4.2012.

Verdi seminar slides

Zur strafrechtlichen Beurteilung der Rettungsfolter · 2018. 9. 19. · Georg Wagenländer Duncker & Humblot · Berlin. GEORG WAGENLÄNDER Zur strafrechtlichen Beurteilung der Rettungsfolter.

CT Metrics Stakeholder Meeting Slides - Energy Star

Crashkurs Finanzmärkte - Slides Präsentation Wiener Börse

Sketching Slides - edoc.hu-berlin.de

Medien Zukunftspreis 2014: Dieter Zirnig/neuwal.com Slides

120608 slides ppt

L2 herz-slides

Slides gate webinar session 1 c

WSDM2010 Kohlschuetter Slides

Galeria Rammstein Slides

Slides 28.4.2011

Slides Workshop - Facebook Updates