Exploring ParSync: Partitioned Synchronization
A summary of this interesting algorithm.
Link to the original paper: Scaling Large Production Clusters with Partitioned Synchronization
Note: This post is not a paper review.
Computer systems are ever-growing. Today, a cluster may have thousands of machines and may execute billions of tasks each day. To incorporate this rapid growth, schedulers need to find an optimal way to improve resource utilization. In this post, we’re going to explore a staleness-aware state sharing design, developed at Alibaba, called ParSync (partitioned synchronization).
ParSync is a distributed scheduler architecture that aims to handle the scale of Alibaba’s cluster size (100k machines) and task submission rate (40k/s), while at the same time achieving low-latency and high-quality scheduling.
Low latency means a task is quickly assigned to a machine and starts executing. High quality means a task is assigned to a suitable machine (machines with larger memory or faster CPUs, etc.).
Before we have a look at ParSync, here’s a quick glance at the different scheduling architectures:
- Monolithic: Single-instance architecture, same scheduling logic is applied for all incoming jobs.
- Statically Partitioned: Schedulers are deployed onto dedicated, statically-partitioned clusters of machines.
- Two-level: A central coordinator assigns resources to sub-schedulers. For example, in Mesos, resources are distributed to the frameworks in the form of offers, which contain only available (unused) resources. The allocator avoids conflicts by only offering a given resource to one framework at a time. Since it effectively holds a lock on that resource for the duration of a scheduling decision, concurrency control is pessimistic.
- Shared-state: One or more schedulers read shared cluster metadata about resources and then use that metadata to make scheduling decisions. To schedule tasks, independent schedulers try to modify the shared state. Conflicts might occur. This approach eliminates two of the issues of the two-level scheduler approach – limited parallelism due to pessimistic concurrency control, and restricted visibility of resources in a scheduler framework. However, it comes at the potential cost of redoing work when the optimistic concurrency assumptions are incorrect.
Omega is a shared-state scheduler architecture. In Omega, a master maintains the cluster state, which indicates the availability of resources in each machine in the cluster. There are multiple schedulers, each of which maintains a local copy of the cluster state by synchronizing with the master copy periodically.
Each scheduler schedules tasks optimistically based on the local state (could possibly be stale) and sends resource requests to the master. As multiple schedulers may request the same piece of resource, this results in scheduling conflicts.
In the paper, the authors state that the contention on high-quality resources and the staleness of local states are the two main contributors to high scheduling latency, as they both increase scheduling conflicts.
In an ideal situtation, we would want to avoid the limitations of pessimistic scheduling and reduce conflicts in optimistic scheduling, while enjoying their benefits. This is where ParSync comes into picture.
The authors found that the scheduling delay increases disproportionately within the period G (synchronization gap). When the cluster state is just synchronized, it is fresher and scheduling has fewer conflicts. But when the state becomes more stale towards the end of G, scheduling decisions result in more conflicts.
ParSync aims to reduce the staleness of the local states and to find a good balance between resource quality and scheduling efficiency.
ParSync logically partitions the master state into P parts. Assume that we have N schedulers, where N ≤ P. Each scheduler keeps a local cluster state, but different schedulers synchronize different partitions of the state each time, in contrast to synchronizing the entire state (StateSync).
Instead of synchronizing all schedulers’ state every G seconds, the system refreshes each scheduler’s view of different partitions in each synchronization round. Therefore a synchronization round is shorter, it takes only G/N seconds. The partitions are reassigned round-robin, so after N rounds all schedulers have synced all partitions and the cycle repeats.
Let’s visualize this:
Let N = 3, P = 6 partitions.
How ParSync reduces staleness:
Without partitions, each scheduler’s view of the slots’ state is refreshed every G seconds. All schedulers’ staleness ranges from 0 to G, and the average is G/2.
But with two partitions, each synchronization round takes G/2 seconds and refreshes half the partitions. At the beginning, its view of Partition 1 has 0 staleness and its view of Partition 2 has G / 2 staleness.
At the end of this round, the Partition 1 staleness has grown to G/2, and the Partition 2 staleness to G. Then the scheduler refreshes Partition 2 and Partition 2 has 0 staleness, but Partition 1’s staleness grows to G, and it goes on.
Average total staleness is always G/2. But the range of staleness is smaller with two partitions. Instead of ranging from 0 to G, it ranges from G/4 to 3G/4. This range decreases with increase in partitions.
After each round, each scheduler has a fresh view of some partitions and a stale view of others. A scheduler can minimize conflicts by committing slots from the most fresh partitions, or it can maximize quality by committing the “quality” slots first.
The authors describe three scheduling strategies:
- Latency-first: Scheduler schedules a task by first picking slots from its “freshest” partitions.
- Quality-first: Scheduler schedules a task by first choosing the partition with the highest slot score.
- Adaptive: The adaptive strategy uses quality-first strategy as the default, but changes it to latency-first when the EMA (exponential moving average) of the scheduling delay exceeds a threshold
The authors simulate an experiment with a cluster having 200K slots, where each task takes 1 slot and runs for 5s. Two groups of schedulers are used, and the timeline is divided into three phases:
- 0–30 seconds: Scheduler groups A and B are both operating at 2⁄3 of their full capacity
- 30–60 seconds: Scheduler group A is operating at full capacity
- 60–90 seconds: Both scheduler groups (A and B) are operating at full capacity
Here are the following charts:
As you can observe, the quality-first scheduler tries to pick slots with more “quality”, then as “high-quality” slots are used up over time, the slot score drops. The latency graph shows that the scheduler group A starts to get overwhelmed and the latency grows linearly to 100% by Phase 3 (60–90 seconds). Thus, Quality-first is not effective when the scheduling load is high.
In the case of the latency-first strategy, the latency increase is comparatively better as compared to Quality-First; but the “quality” drops after Phase 3.
In Adaptive strategy, the latency is bounded to the threshold, but the “quality” drops after Phase 3 (when strategy is switched from Quality-First to Latency-First).
- Quality-first works well when the scheduling load is low
- When the load is heavy, latency-first is preferred as it achieves high efficiency.
- Adaptive keeps the latency bounded, while it can also try to get “quality” slots when the load is not heavy.
ParSync vs StateSync
The authors compared ParSync with StateSync, which simulates Omega.
Both ParSync and StateSync used 20 schedulers. Tasks were submitted during a 300-second period, which was divided into five 60-second phases with different submission rates: 50%, 80%, 95%, 80%, 50% of 40k/s.
As you can see from the graph, ParSync performs better than StateSync in terms of latency. The scheduling quality of StateSync is better, however, its high scheduling delay makes it infeasible for production use.
ParSync is a brilliant algorithm; the paper discusses a number of interesting scheduler architectures (which I would love to explore more in the future).
I hope we get a blog from Alibaba’s team on how ParSync is being used in Fuxi 2.0 (the distributed cluster scheduler used at Alibaba). 😄