Take a cluster orchestartor that supports both elastic resource-adaptive jobs and heterogenous cluster, add a throughput estimator in the mix and you’ve got the overachiever that is Sia.
There are 2 problems in scheduling of DL workloads: heterogeneity of clusters (different GPU types in the cluster) and rigidity/elasticity of jobs (job scaling as opposed to static configuration of batch size, #GPUs). Prior art exists in both these spaces, heterogeneity-aware schedulers and adaptivity-aware schedulers respectively, but none in the intersection of both.
Architecture of Sia
At a high level, they have a round-based scheduler that optimizes the scheduling of jobs on GPUs for each round. They assume pre-emption using checkpoint-restore of DL training jobs.
The other significant contribution is a throughput estimator. Ideally, they need throughput modelling for every possible configuration of jobs to GPU types and then batch size/#GPUs as well, which is costly to build. So, they have a more intelligent modelling framework.
Apart from that they have:
- Scaling policy:
- Start from 1 GPU for each job in the beginning
- Scale up by 2x each round if possible
- Scheduling:
- Allotment: Decide number and type of resources
- Placement: Actual which physical resources
- Avoid fragmentation of nodes
- Using the throughput model solve an optimization problem for best goodput across entire cluster
The throughput model is more criticial, so I’ll dive deeper into it.
Throughput model
Scaling essentially changes the batch size of the underlying job. But the ideal batch size has to be decided with the statistical efficiency factor. They model this similar to Pollux.
At a high level, they bootstrap a throughput model (one model for 1 job on 1 GPU type) with a few profiling runs, then update it over time based on run information. So basically, initialize with 1 GPU profiles across different GPU types. Then, estimate a linear corelation between #GPUs and throughput (which is untrue, but just as a rule of thumb) to take scaling decisions initially. Monitor continuously and update throughput x #GPU model with real throughput numbers as they come in. Share the learnings across GPU types
Implementation and execution:
They implement on the AdaptDL framework (Pollux) and runs on kubernetes cluster.
They evaluate on a wide range of jobs, from small ResNet model to LLMs and have a wide variety of experiments and microbenchmarks.
They only have one exp (Figure 4) on the real system, others are all using the simulator on traces.
- Table 4: GPU hours is lesser how? Wasted time due to checkpoint-restore also counts
Key Takeaways
- Their intro clearly explains, motivates and presents a lot of prior art in the 2 spaces of adaptive scheduling and heterogenous cluster management. This was good to read and understand and while there is a lot of prior art in this space, I think Sia does carve a niche for itself by handling both adaptivity and heterogenous clusters.
- They depend heavily on Pollux, both using its simulator and the batch size algorithm. They could have open-sourced their changes on top of Pollux though, seeing that the authors are from the same group.
- I really liked their evaluation section, it had a great main results (Table 3) which compared over different axes like prior art and different workloads and a lot of performance metrics. Apart from the expected ones of cluster utilization, makespan and JCT, they had some unique ones like Contention, # restarts etc. They also had a lot of microbenchmarks over a variety of scenarios.