Exploring ParSync: Partitioned Synchronization

A summary of this interesting algorithm.

Link to the original paper: Scaling Large Production Clusters with Partitioned Synchronization

Note: This post is not a paper review.

Introduction

Computer systems are ever-growing. Today, a cluster may have thousands of machines and may execute billions of tasks each day. To incorporate this rapid growth, schedulers need to find an optimal way to improve resource utilization. In this post, we’re going to explore a staleness-aware state sharing design, developed at Alibaba, called ParSync (partitioned synchronization).

ParSync is a distributed scheduler architecture that aims to handle the scale of Alibaba’s cluster size (100k machines) and task submission rate (40k/s), while at the same time achieving low-latency and high-quality scheduling.

Low latency means a task is quickly assigned to a machine and starts executing. High quality means a task is assigned to a suitable machine (machines with larger memory or faster CPUs, etc.).

Scheduling Architectures

Before we have a look at ParSync, here’s a quick glance at the different scheduling architectures:

Different Scheduling Architectures
Different Scheduling Architectures

Extending Omega

Omega is a shared-state scheduler architecture. In Omega, a master maintains the cluster state, which indicates the availability of resources in each machine in the cluster. There are multiple schedulers, each of which maintains a local copy of the cluster state by synchronizing with the master copy periodically.

Each scheduler schedules tasks optimistically based on the local state (could possibly be stale) and sends resource requests to the master. As multiple schedulers may request the same piece of resource, this results in scheduling conflicts.

In the paper, the authors state that the contention on high-quality resources and the staleness of local states are the two main contributors to high scheduling latency, as they both increase scheduling conflicts.

Partitioned Synchronization

In an ideal situtation, we would want to avoid the limitations of pessimistic scheduling and reduce conflicts in optimistic scheduling, while enjoying their benefits. This is where ParSync comes into picture.

The authors found that the scheduling delay increases disproportionately within the period G (synchronization gap). When the cluster state is just synchronized, it is fresher and scheduling has fewer conflicts. But when the state becomes more stale towards the end of G, scheduling decisions result in more conflicts.

ParSync aims to reduce the staleness of the local states and to find a good balance between resource quality and scheduling efficiency.

ParSync logically partitions the master state into P parts. Assume that we have N schedulers, where NP. Each scheduler keeps a local cluster state, but different schedulers synchronize different partitions of the state each time, in contrast to synchronizing the entire state (StateSync).

Instead of synchronizing all schedulers’ state every G seconds, the system refreshes each scheduler’s view of different partitions in each synchronization round. Therefore a synchronization round is shorter, it takes only G/N seconds. The partitions are reassigned round-robin, so after N rounds all schedulers have synced all partitions and the cycle repeats.

Let’s visualize this:

Let N = 3, P = 6 partitions.

Visualization of Partition Synchronization
Partition Synchronization

How ParSync reduces staleness:

Without partitions, each scheduler’s view of the slots’ state is refreshed every G seconds. All schedulers’ staleness ranges from 0 to G, and the average is G/2.

But with two partitions, each synchronization round takes G/2 seconds and refreshes half the partitions. At the beginning, its view of Partition 1 has 0 staleness and its view of Partition 2 has G / 2 staleness.

At the end of this round, the Partition 1 staleness has grown to G/2, and the Partition 2 staleness to G. Then the scheduler refreshes Partition 2 and Partition 2 has 0 staleness, but Partition 1’s staleness grows to G, and it goes on.

Staleness for different values of G
Staleness for different values of G

Average total staleness is always G/2. But the range of staleness is smaller with two partitions. Instead of ranging from 0 to G, it ranges from G/4 to 3G/4. This range decreases with increase in partitions.

After each round, each scheduler has a fresh view of some partitions and a stale view of others. A scheduler can minimize conflicts by committing slots from the most fresh partitions, or it can maximize quality by committing the “quality” slots first.

An Experiment

The authors describe three scheduling strategies:

The authors simulate an experiment with a cluster having 200K slots, where each task takes 1 slot and runs for 5s. Two groups of schedulers are used, and the timeline is divided into three phases:

Here are the following charts:

Quality First Scheduling
Quality-First

As you can observe, the quality-first scheduler tries to pick slots with more “quality”, then as “high-quality” slots are used up over time, the slot score drops. The latency graph shows that the scheduler group A starts to get overwhelmed and the latency grows linearly to 100% by Phase 3 (60–90 seconds). Thus, Quality-first is not effective when the scheduling load is high.

Latency First Scheduling
Latency-First

In the case of the latency-first strategy, the latency increase is comparatively better as compared to Quality-First; but the “quality” drops after Phase 3.

Adaptive Scheduling
Adaptive

In Adaptive strategy, the latency is bounded to the threshold, but the “quality” drops after Phase 3 (when strategy is switched from Quality-First to Latency-First).

Hence:

ParSync vs StateSync

The authors compared ParSync with StateSync, which simulates Omega.

Both ParSync and StateSync used 20 schedulers. Tasks were submitted during a 300-second period, which was divided into five 60-second phases with different submission rates: 50%, 80%, 95%, 80%, 50% of 40k/s.

ParSync vs StateSync
ParSync vs StateSync

As you can see from the graph, ParSync performs better than StateSync in terms of latency. The scheduling quality of StateSync is better, however, its high scheduling delay makes it infeasible for production use.

Conclusion

ParSync is a brilliant algorithm; the paper discusses a number of interesting scheduler architectures (which I would love to explore more in the future).

I hope we get a blog from Alibaba’s team on how ParSync is being used in Fuxi 2.0 (the distributed cluster scheduler used at Alibaba). 😄