Shortly after the DS8888F model was introduced toward the end of 2016, IBM published SPC-1 v3 benchmark numbers of 1.5M, more than 3x the previous DS8000 generation models, and at the time the highest SPC-1 v3 number. The DS8888F have two servers with up to 48 POWER8 cores and 1TB of memory per server. The previous DS8870 can have up to 16 POWER7 cores per server. As we have seen in the past decade, there has not been a significant increase in CPU clock speeds from one generation to the next. In fact, the CPU clock speed have become slower recently and to get better performance, the hardware must do more operations in parallel by increasing CPU pipelines, allow higher simultaneous multithreading (SMT) and pack more cores per chip. This blog highlights some extremely important innovations in the DS8888F, although not highly publicized, are key to properly use the modern CPU hardware to increase performance.
Synchronization with Locks
Locks are synchronization primitives that are crucial for synchronizing update to shared data structures in a multiprocessor environment. The use of locks is very common in all multithreaded applications and code running in a storage server is no exception. The most frequently implemented type of locks in all operating systems is the spinlock. In a spinlock, the first thread trying for the lock will get it while other threads will repeatedly check (i.e., spin) for the lock to be freed. Implementation of the lock typically require hardware support in the form of atomic instructions such as test-and-set or compare-and-swap. Below is pseudo-C code of how a spinlock can be implemented using test-and-set, which writes 1 to a location and returns the old value in the location.
while (test_and_set(lockP) != 0);
*lockP = 0;
The implementation of the spinlock above is very efficient in term of code path length. However, there is a huge disadvantage to this implementation: the lack of fairness. If there are many threads trying to get the same lock, there is no guaranteed order of when a thread gets the lock. In fact, in modern CPU architectures, there is no guaranteed that a thread will even get the lock if there are multiple other threads constantly vying for the same lock. In a storage system, this type of locking behavior will lead to erratic response time or even deadlock.
A variation of the test-and-set implementation of a spinlock above that accounts for fairness is a ticket lock. The concept of a ticket lock is just like the serving system at a deli station in a grocery store. To be served, you first take a ticket with a number. When your number matches the current serving number, you get your turn. When youâ€™re done, the current serving number increases by 1 and person with next ticket number gets the turn. The pseudo-C code below illustrates how ticket locks can be implemented using a fetch-and-increment atomic instruction which increments the value in a location and return the old value.
lockP->servingNumber = 0;
lockP->ticketNumber = 0;
myTicket = fetch_and_inc(&(lockP->ticketNumber));
while (lockP->servingNumber != myTicket);
While the ticket lock is fair, it has a major problem in that as the number of threads spinning for a lock increases, the contention for spinning and updating the lock grows exponentially, severely slowing down the obtaining of the lock.
Another type of spinlock called queued spinlocks are designed to scale with the thread increases while still maintaining fairness. When a thread wants to take a lock, it puts its entry, with a flag in the entry set, into the queue and fetch the last entry in the queue atomically. If the queue is empty, then it gets lock. Otherwise, it will spin on a flag in the obtained entry until that flag is clear. When it frees the lock, if the last entry in the queue is its entry, it removes that entry from the queue. Otherwise, it clears the flag in its entry so the next thread waiting for the lock can proceed. Because each thread is spinning on a different location, queued spinlocks scale very nicely with the increased in the number of running threads. However, the memory requirement also increases on the order of the number of threads. There can be several hundred thousand locks running on the DS8888F so the memory increase would prevent wide-spread usage of queued spinlocks. The code path length for queued spinlocks is also longer than that of ticket locks which is already slightly longer than test-and-set spinlocks.
The DS8888F exploits the advantages of all three types of spinlocks described. For most of the locks, it uses test-and-set spinlocks for its code efficiency if the lock is not being contested. As soon as there is a waiter for the lock, the lock autonomously transforms into a ticket lock to ensure fairness. A lock measurement tool is developed and used to identify extremely highly contested locks while running various workloads. For these locks, either code is rewritten to reduce their usage or converted to using queued locks. By properly using the correct types of spinlocks, the execution of serialized sections of code can be efficient as possible resulting in an increased overall performance.
Tasks dispatching and pooling
Another core function in all operating systems is tasks dispatching services. All DS8000 already makes heavy use of proprietary lightweight threads to efficiently run tasks without the large penalty of context switching. There can be upward of 32 thousand tasks running in core code providing host access and advanced functions. Many of these tasks could complete its work in just a few microseconds. Normal task dispatch overhead of several microseconds would incur almost 50% overhead. The proprietary lightweight threads scheme used in the DS8000 eliminate over 90% of this overhead.
In addition, the DS8888F incorporates several innovative tasks dispatching concepts to improve performance. In a typical modern multiprocessor system, several levels of cache memory are used to store recently accessed data from main memory. Cache is much faster memory placed closer to the CPUs than main memory. While the accessing time can be 5x-10x faster, cache size is extremely small compared to main memory. For example, each processor complex in the DS8888F can have 1TB of main memory while the L3 cache memory is only 8MB per core. It is extremely important to make good use of the limited fast cache memory so the tasks can run faster. This means we want to have as much reuse of data resided in cache as possible. To accomplish this, the lightweight task dispatcher in the DS8888F uses a sophisticated and yet efficient IBM-patented algorithm to assign tasks to processor cores. The algorithm remembers and dispatches a task on a thread in the same CPU core that last executes that task. It auto-balances the dispatching to idle CPU cores based on thresholding of categorized tasks. Developers can group certain tasks to move to an idle CPU only when there are so many tasks waiting to be dispatched on a CPU core. Certain category of tasks may be restricted to stay in a subset of CPU cores avoid lock contention or too much cross-chips traffic that could degrade overall performance. Certain tasks may also be marked to be dispatched in a round-robin manner such as producer tasks that fill up each local thread or CPU core queues with consumer data.
In current modern CPU architectures with many chips and multiple hierarchy of memory access latency, it is crucial to limit the amount of traffic between chips or modules of chips to reduce memory access latency. The tasks pooling concept was employed in the DS8888F to increase cache locality as well as reducing cross-chips traffic. Tasks are grouped into pools. Tasks in each pool will only be dispatched on a certain distinct group of CPU cores, that belongs to the same chip or module, using the auto-balancing algorithm mentioned above. Work in the storage system are separated into pools to ensure that end-to-end operations done by various tasks will be executed by tasks in the same pool if possible. For example, an IO request from the host coming into a certain host adapter results in work that will be picked up and passed along by several tasks which typically will be in the same tasks pool. Tasks pooling can yield many benefits that contributes to the goal of achieving optimal performance. It increases cache locality by dividing the amount of data being accessed across distinct groups of CPUs leading to faster code execution. It allows better reuse of data even when not in cache and reduces the amount of traffic between CPU chips. Since tasks in a pool tend to obtain the same locks due to the separation of work, this also reduces cross-chips traffic and lock contention which eliminates bottlenecks as well as allowing faster code execution.
The DS8888F provides many ground-breaking features in an Enterprise storage system with unmatched reliability. Highly innovative features such as IBM Easy TierÂ® along with a wide-range of advanced disaster recovery solutions including multi-site mirroring are not only well-known but highly publicized. What is not well-known is core services in the DS8888F also come with many inventions and innovations. These core services provide a solid foundation for enterprise-critical features in the DS8888F running smoothly and optimally.