@ibm+build smart
+buildsmart
Learn more >
@redhat+build open
+buildopen
Article
by Gurudutt Kumar V J | Published May 20, 2013
Systems
Archived date: 2019-08-09
Computing hardware is fast evolving. Transistor density is increasing while clock speeds are leveling off. Processor manufactures are looking to increase the multiprocessing capabilities by having more cores and hardware threads per chip. For example, IBM POWER7® symmetric multiprocessor architecture achieves massive parallelism by supporting up to 4 threads per core, 8 cores per chip and 32 sockets per server. That is a total of 1024 simultaneous hardware threads. In comparison IBM POWER6® architecture supported only 2 threads per core, 2 cores per chip and 32 sockets per server, a total of 128 parallel hardware threads.
While developing software, designers are now required to consider the multiprocessor, multi-core architectures that the software might be deployed on. This is because:
This article outlines some of the key considerations for designing software for multi-core, multiprocessor environments.
Applications are expected to scale and perform better on multi-core, multiprocessor environments. However, an inefficiently designed application might perform poorly on such an environment rather than scaling well and performing better by using the available computing resources. Some of the key impediments to this scalability could be:
Low processor utilization could obviously mean suboptimal resource utilization. In order to understand performance issues, you need to evaluate if an application has too few or too many threads, locking or synchronization issues, network or I/O latencies, memory thrashing or other memory management. High processor utilization is normally good as long as resources are spent in application threads doing meaningful work.
Before discussing the design considerations for chip multithreaded, multi-core, multiprocessor environments lets take a brief look at such a system. The system depicted in A typical chip multithreaded, multi-core, multiprocessor system has two processors, each with two cores and each core has two hardware threads. There is one L1 cache and one L2 cache per core. As such, each core could have its own L2 cache or the cores on the same processor could share the L2 cache. Hardware threads on the same core share L1 and L2 cache.
All the cores and processors share the system bus and access the main memory or RAM through the system bus. For applications and the operating system, this system looks like eight logical processors.
The following key concepts will help us understand the challenges in designing applications for such a chip multithreaded, multi-core, multiprocessor environment.
Cache coherency is a state where the value of a data item in a processor’s cache is the same as that in system memory. This state is transparent to the software. However, the operations performed by the system to maintain cache coherency can affect the software’s performance.
Consider the following example. Let’s assume that Thread 1 is running on processor 0 while Thread 2 is running on processor 1 of the system depicted in A typical chip multithreaded, multi-core, multiprocessor system. If both these threads are reading and writing to the same data item, then the system has to perform extra operations to ensure that the threads are seeing the same data value as each read and write operation occurs.
When Thread 1 writes to the data item that is shared with thread 2, the data item is updated in its processors cache and in system memory but not immediately updated in the cache of thread 2s processor because thread 2 might no longer require access to the data item. If thread 2 then accesses the data item, the cache subsystem on its processor must first obtain the new data value from system memory. So, the write by thread 1 forces thread 2 to wait for a read from system memory the next time that it accesses the data. This sequence only takes place if the data is modified by one of the threads. If a series of writes is done by each thread, the performance of the system can seriously degrade because of all of the time spent waiting to update the data value from system memory. This situation is referred to as “ping ponging” and avoiding it is an important software design consideration when running on multiprocessor and multi-core systems.
It is the cache subsystem that keeps track of the state of each of its cache lines. It uses a technique called “bus snooping” or otherwise called “bus sniffing” to monitor all transactions that take place on the system bus to detect when a read or write operation takes place on an address that is in its cache.
When the cache subsystem detects a read on the system bus to a memory region loaded in its cache it changes the state of the cache line to “shared”. If it detects a write to that address, then the state of the cache line is changed to “invalid”.
The cache subsystem knows if it has the only copy of the data in its cache as it is snooping the system bus. If the data is updated by its own CPU, the cache subsystem will change the state of the cache line from “exclusive” to “modified”. If the cache subsystem detects a read to that address from another processor, it can stop the access from occurring, update the data in system memory, and then allow the processor’s access to continue. It will also mark the state of the cache line as shared.
Refer to the article on “Software Design Issues for multi-core multi-processor systems” in the Related topics section for more information on these concepts.
When designing software to run on a multi-core or multiprocessor system, the main consideration is how to allocate the work that will be done on the available processors. The most common way to allocate this work is by using a threading model where the work can be broken down to separate execution units that can run on different processors in parallel. If the threads are completely independent of one another, their design does not have to consider how they will interact. For example, two programs running on a system as separate processes each on its own core do not have any awareness of each other. Performance of the programs is not affected unless they contend for a shared resource such as system memory or the same I/O device.
Main focus of the discussion that follows will be on the way that the cores and processors interface with the main memory and how this has an impact the software design decisions.
See the following key design considerations.
Different cores share a common data region, in memory and cache, which needs to be synchronized among them. Memory contention is when different cores concurrently access the same data region. Synchronizing data among different cores has big performance penalty because of bus traffic, locking cost and cache misses.
If an application has multiple threads and all the threads are updating or modifying the same memory address, then, as discussed in the previous section, there could be significant ping-ponging in order to maintain cache coherency. This will lead to degraded performance.
For more details, refer to the “Memory Contention” section in the article “Memory issues on multi-core platforms” in the Related topics section. The article includes a simple program that demonstrates the ill effects of memory contention. The example demonstrates that even if only one variable is shared among multiple threads, the performance penalty could be very significant even when atomic instructions are used for updates.
Don’t share writable state among cores:
If two or more processors are writing data to different portions of the same cache line, then a lot of cache and bus traffic might result for effectively invalidating or updating every cached copy of the old line on other processors. This is called “false sharing” or also “CPU cache line interference”. Unlike true sharing where two or more threads share the same data (thereby needing programmatic synchronization mechanisms to ensure ordered access), false sharing occurs when two are more threads access unrelated data which resides on the same cache line.
Consider the following code snippet to understand false sharing better.
double sumLocal[N_THREADS]; . . . . . . . . . . void ThreadFunc(void *data) { . . . . . . . int id = p->threadId; sumLocal[id] = 0.0; . . . . . . . . . . . . for (i=0; i<N; i++) sumLocal[id] += p[i]; . . . . . . }
In the above code example, the variable, sumLocal is of the same size as the number of threads. The array sumLocal can potentially cause false sharing as multiple threads write to the array when the elements that they are modifying lie on the same cache line. False sharing (Ref: Avoiding and identifying false sharing among thread in Related topics section.) illustrates the false sharing between thread 0 and thread 1 while modifying two consecutive elements in sumLocal. Thread 0 and thread 1 are modifying different but consecutive elements in the array sumLocal. These elements are adjacent to each other in memory and hence will be on the same cache line. As illustrated, the cache line is loaded into caches of CPU0 and CPU1 (grey arrows). Even though the threads are modifying different areas in memory (red and blue arrows), in order to maintain cache coherency the cache line is invalidated on all the processors that have loaded it, forcing an update.
False sharing could severely degrade performance of the application and it is hard to detect. Refer to the article “Memory issues on Multi-core platform – CS Liu” in the Related topics section for a simple program that demonstrates the ill effects of false sharing.
If it is needed to assume a cache line size for enforcing alignment, then use 32 bytes. Note that:
The objective should ideally be to eliminate sharing not just false sharing. Software design should try to eliminate the need for locks and synchronization mechanisms and sharing in general. See this article by Dmitriy Vyukov for the important perspective.
False sharing is hard to detect but there are a few tools such as Oprofile and the Data Race Detection (DRD) module of Valgrind that could be of help.
This design consideration is an extension of the previous two considerations to avoid memory contentions and false sharing. As discussed in the previous section, the primary goal of the software designer should be to eliminate sharing so that there is no contention for resources between threads or processes. Some of the techniques discussed in the previous sections such as use of thread local variables instead of global shared regions could prevent both memory contentions and false sharing. However, this technique might not be applicable in all cases.
For example, if there is a data structure that maintains the state of a resource, then it might not be possible to have copies of it in each thread. This data structure might have to be read and modified by all threads in the application. Hence, synchronization techniques would be necessary to maintain data coherency and integrity of such shared data structures. When ever there are locks or synchronization constructs protecting a shared resource, then there might be contentions for the locks between multiple threads or processes potentially degrading performance.
On a multi-core, multiprocessor system, there might be room to run a significantly large number of threads or processes concurrently, however if these threads have to constantly contend with each other to access or modify shared resources or data structures then the overall throughput of the system drops. This results in the application not being able to scale to utilize the available computing resources efficiently. In the worst cases of performance degradation due to lock contentions, as the number of cores or processors increase, the performance of an application could drop.
Some examples of lockless algorithms are:
Detecting lock contentions and eliminating or reducing lock contentions is important for improving scalability of applications on multi-core, multiprocessor environments. Operating systems provide utilities to detect and measure performance bottlenecks due to lock contentions. For example, Solaris provides the Lockstat utility to measure the lock contentions in kernel modules. Similarly the Linux kernel provides the Lockstat and Lockdep framework for detecting and measuring lock contentions and performance bottlenecks. “Windows Performance Toolkit – Xperf” provides similar capabilities on Windows. See Related topics section for more details.
C/C++ standard memory management routines are implemented using platform-specific memory management APIs, usually based on the concept of heap. These library routines (whether it is single thread version or multi-thread version) allocate or free memory on a single heap. Its a global resource that is shared and contended among threads within a process. This heap contention is one of the bottlenecks of multi-threaded applications that are memory intensive.
The article “Memory issues on Multi-core platform – CS Liu” in the Related topics section provides an example on heap contention. In the reference, the author demonstrates that using private heap will result in around 3 times performance gain as compared to using the global heap.
Processor affinity is a thread or process attribute that tells the operating system which cores, or logical processors, a process can run on. This is more applicable in embedded software design.
System with multiple cores sharing an L2 cache shows a system configuration with two processors each with two cores that have a shared L2 cache. In this configuration, the behavior of the cache subsystems will vary between two cores on the same processor and two cores on different processors. If two related processes or threads of a process are assigned to the two cores on the same processor then they can take advantage of the shared L2 cache better and lower overhead of maintaining cache coherency.
Linux Example: /* Get a process' CPU affinity mask */ extern int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *cpuset); /* Set a process's affinity mask */ extern int sched_setaffinity(pid_t pid, size_t cpusetsize, const cpu_set_t *cpuset); Windows Example: /* Set processor affinity */ BOOL WINAPI SetProcessAffinityMask(Handle hProcess, DWORD_PTR dwProcessAffinityMask); /* Set Thread affinity */ DWORD_PTR WINAPI SetThreadAffinityMask(Handle hThread, DWORD_PTR dwThreadAffinityMask);
A software designer can consider two different programming models when assigning work to threads in an application. They are:
Functional decomposition
Domain decomposition or otherwise called data decomposition
The types of operations and characteristics of the data that the software is required to operate on, influences the choice of the model, but it is required to understand how these models perform in multiprocessor or multi-core environments.
The reference Software Design Issues for Multi-core Multi-processor Systems in the Related topics section has detailed description of the programming models, challenges in applying them in software design for multi-core, multi-processor architectures and their advantages.
The article briefly explained the characteristics of multi-core, multithreaded environments. Some of the key issues that could cause performance degradation or impede scalability of applications on multi-core, multiprocessor systems were also touched upon. Some of the key considerations and techniques in designing software for such environments were also discussed. Software or application designed with these considerations could utilize the available computing resources efficiently and also avoid performance and scalability issues.
Conference
October 2, 2019
AnalyticsData management+
Webcast
September 26, 2019
IBM ZSystems
You may already be using an integrated web services server to expose ILE programs and service programs as RESTful web…
IBM iIBM Power Systems+
Back to top