hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...
Transcript of hotcloud20 slides wagenlaender - USENIX · hotcloud20 slides wagenlaender. Marcel Wagenländer, Luo...
Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group
Imperial College London
SpotnikDesigning Distributed Machine Learning for Transient Cloud
Resources
Distributed ML
2
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
Distributed ML
3
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
ΔLearn a model
Distributed ML
4
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
Data parallelism
Distributed ML
5
Train a machine learning model
worker 0 worker 1
worker 2
Δ Δ
Δ
Ring all-reduce
Challenges of distributed ML
• Distributed ML is resource-hungry
• Accelerated resources are expensive
Example Megatron-LM3
• Training of BERT-like model
• 512 V100 GPUs
• One epoch (68,507 iterations) takes 2.1 days
• Cost on Azure: $92,613
63Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017
Transient cloud resources
1https://azure.microsoft.com/en-us/pricing/spot/7
• Examples: AWS Spot instances, Azure Spot VMs
• Follows the law of a free market
• Revocations
• Notifications
• Economic incentive
• Offers a cost reduction of up to 90%1
A Megatron-LM epoch would drop from $92,613 to $15,152
worker 2
90%
Implications of transient resources
• New workers become available or old workers get revoked
➙ System must cope with disappearing resources
• Changes can happen at any time
➙ System must ensure consistency of updates
8
worker 0
worker 1
worker 2
worker 3
worker 4
Cluster
Size
Implications of transient resources
• New workers become available or old workers get revoked
➙ System must cope with disappearing resources
• Changes can happen at any time
➙ System must ensure consistency of updates
• Cluster sizes are unknown beforehand
➙ System must adapt to different conditions
9
HogWild!
AD-PSGD
EA-SGD
SMAS-SGD
Network efficiency
Model accuracy
Current approach: Checkpoint & recovery
• Tensorflow and Pytorch
• Changes to the cluster are not considered
• Recovery takes about 20 seconds with ResNet50 and ImageNet
Recovery Training
10
CheckpointTraining
• Mix dedicated resources with transient resources
• Proteus2: Placement of parameter server on dedicated resources so that the training state is save
Parameter server
Worker 0 Worker 1 Worker 2
Dedicated resources
Transient resources
2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 201711
Current approaches: Hybrid
Spotnik
12
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
Spotnik
13
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
Revocation recovery algorithm
• Handle revocations within a ring topology
• Number of total messages is bounded by messages
• K is the number of simultaneous revocations
• N is the number of workers
➙ Scale to many transient resources
• No reliance on revocation notifications
O(N ⋅ K)
14
Revocation recovery algorithmRepairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 1, 2, 3, 4, 5]
1 [0, 1, 2, 3, 4, 5]
2 [0, 1, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
15
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 1, 2, 3, 4, 5]
2 [0, 1, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
16
ReconcileRevocation
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 1, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
Reconcile
Bypass
Revocation recovery algorithm
17
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 4, 5]
4 [0, 1, 2, 3, 4, 5]
5 [0, 1, 2, 3, 4, 5]
18
Reconcile
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 4, 5]
4
5 [0, 1, 2, 3, 4, 5]
19
Reconcile
Revocation
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 4, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
20
Reconcile
Bypass
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 4, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
21
Reconcile
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
22
Reconcile
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W1
W2
W3
W4
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
4
5 [0, 2, 3, 5]
23
Accept
Revocation recovery algorithm
Repairing a broken all-reduce ring
W0
W2
W3
W5
Worker Membership
0 [0, 2, 3, 5]
2 [0, 2, 3, 5]
3 [0, 2, 3, 5]
5 [0, 2, 3, 5]
24
Restart
Revocation recovery algorithm
Spotnik
25
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
Atomic worker state update
• Pipelined synchronisation
2626
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
Atomic worker state update
• Pipelined synchronisation
• Revocations can happen meanwhile
➙ Partial update leads to inconsistency
2727
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
Atomic worker state update
• Atomicity: Wait for all synchronisation communications to finish
• Discard updates if communication fails
2828
Parameter Parameter
Sync. Sync.
Parameter
Sync.
Update Update Update
Barrier
Spotnik
29
Challenges Solutions
Workers become available or get revoked
Reuse communication channels for synchronisation to repair the cluster
Changes can happen at any time
Ensure atomic model updates by waiting for all synchronisations to finish first
Cluster sizes are unknown beforehand
Change the synchronisation strategy based on the number of workers
Adaptive synchronisation strategies
• Support a range of synchronisation primitives
• collective and point-to-point synchronisation
• Monitor a metric
• Number of workers
30
Adaptive synchronisation strategies• Support a range of synchronisation primitives
• collective and point-to-point synchronisation
• Monitor a metric
• Number of workers
• Define a policy in the beginning
• When to choose which sync strategy
31
W0
W1W2
3 workers
W0
W3
W2
6 workers
W1
W4
W5
AD-PSGD S-SGD
EvaluationHow does the recovery latency change with increasing number of revocations?
32
Cluster • 16 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
No significant increase of recovery latency if the number of revocation increases
EvaluationHow much does the training slow down if we use atomic worker state updates?
33
Cluster • 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15
*different Setup Cluster • 16 workers Hardware • Huawei ModelArts
• Nvidia V100 • InfiniBand
Software • KungFu 0.2.1 • Tensorflow 1.12
Throughput decrease is small
EvaluationHow does the throughput change, if the cluster changes?
34
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Cluster size5 10 15 20 25 30
EvaluationHow does the throughput change, if the cluster changes?
35
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Cluster size5 10 15 20 25 30
EvaluationHow does the throughput change, if the cluster changes?
Switching from S-SGDto AD-PSGD
36
Cluster • up to 32 workers Hardware • Azure NC6 VMs
• Nvidia K80 Software • KungFu 0.2.1 • Tensorflow 1.15 ML • ResNet50 • ImageNet
Changing clusters need adaptation
Cluster size5 10 15 20 25 30
Conclusion
37
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
Conclusion
38
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
Spotnik• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
Conclusion
39
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
Spotnik• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
KungFu github.com/lsds/KungFu
Conclusion
40
• Transient cloud resources offer potential to save money for ML training
• No system that runs exclusively on transient resources and has low overhead
• Repair cluster with low overhead
• Ensure consistent model updates
• Adapt to changes of the cluster
Website lsds.doc.ic.ac.uk | E-Mail [email protected] Twitter @marwage
SpotnikKungFu github.com/lsds/KungFu