Software engineering tends to be an evolutionary process. Problems are addressed with solutions that can create new problems and subsequently new solutions. Ideally, the new problems that are created justify the original solution. The technology discussed here is one of the secondary solutions to a problem created by server virtualization. Before delving into KSM, however, let’s take a quick look at the original solution and how KSM is applied here and elsewhere.
Virtualization as a technology has been around since the 1960s, popularized by the IBM® System/360® mainframe. Five decades later, virtualization has exploded, making it possible to share a single server with multiple operating systems and applications. This particular use, called server virtualization, is transforming data centers, as a single physical machine can be used to host typically 10 or more virtual machines (VMs), as shown in Figure 1. This virtualization enables the infrastructure to be much more dynamic, power efficient, and (therefore) cost-efficient. See Related topics for more information on server virtualization and its benefits.
Figure 1. Server consolidation through virtualization
With only 10-15% of a typical server in use at any given time, consolidating virtual servers onto a single physical server makes perfect sense. But what about resources? The major resources available include CPU, memory, and networking bandwidth. CPU and networking are typically underutilized, so the real issue becomes available memory. Each operating system makes its own unique demands on available memory resources. But the question that arises is, how unique are those demands?
It turns out that if you virtualize many of the same operating system and application sets, many memory pages are identical. This makes sense, given that operating system and application code and constant data are identical among VMs. And when pages are unique, they can be consolidated, which frees memory for use in other applications. Figure 2 illustrates memory sharing and shows the benefit of more free memory available when pages are shared among VMs whose contents are the same.
Figure 2. Memory sharing across VMs
What you’ll soon discover is that although memory sharing in Linux is advantageous in virtualized environments (KSM was originally designed for use with the Kernel-based Virtual Machine [KVM]), it’s also useful in non-virtualized environments. In fact, KSM was found to be beneficial even in embedded Linux systems, indicating the flexibility of the approach. Now, let’s explore the Linux approach to memory sharing and how you can use it to increase the memory density of a server and therefore increase its ability to host additional applications or VMs.
Parallels to other technologies
A recent advancement in storage technologies called de-duplication is a precursor to memory sharing in Linux and other hypervisors. De-duplication is a technology in which stored data is reduced by removing redundant data (either on a block basis or on larger segments of data such as files). Common segments are merged (in a copy-on-write [CoW] fashion), freeing space for other uses. Storage de-duplication is an important technology moving forward, as it optimizes the capacity of storage by removing duplicate data. In this way, storage is cheaper, because less is ultimately required. Given the rate of data growth, this functionality is very relevant.
KSM exists as a daemon in the kernel (called
ksmd) that periodically performs page scans to identify duplicate pages and merges duplicates to free pages for other uses. It does this in a way that’s transparent to the user. For example, duplicate pages are merged (and subsequently marked read only), but if one of the users of the page changes it for any reason, that user will receive his or her own copy (in a CoW fashion). You can find the full implementation of the KSM kernel module within the kernel source at ./mm/ksm.c.
KSM relies on a higher-level application to provide instruction on which memory regions are candidates for merging. Although it would be possible for KSM to simply scan all anonymous pages in the system, it would be wasteful of CPU and memory (given the space necessary to manage the page-merging process). Therefore, applications can register the virtual areas that are likely to contain duplicate pages.
The KSM application programming interface (API) is implemented through the
madvise system call (see Listing 1) and a new advice parameter called
MADV_MERGEABLE (which indicates that the defined region is mergeable). A region can be removed from the mergeable state through the
MADV_UNMERGEABLE parameter (which immediately un-merges any merged pages from that region). Note that removing a region of pages through
madvise can result in an
EAGAIN error, as the operation could run out of memory during the un-merge process, potentially leading to even greater trouble (out-of-memory conditions).
Listing 1. The madvise system call
#include <sys/mman.h> int madvise( void *start, size_t length, int advice );
Once a region is defined as mergeable, KSM adds this region to its list of working memory. When KSM is enabled, it searches for identical pages, keeping one page in a write-protected CoW fashion and freeing one for other uses.
KSM uses a approach that’s different from the approach used in storage de-duplication. In traditional de-duplication, objects are hashed, and the hash value is used as the initial check for similarity. When hashes are identical, the next step is a comparison of the actual objects (in this case, a memory comparison) to formally determine whether the objects are identical. KSM used this approach in its first implementation, but a more straightforward approach was developed to simplify it.
In the current KSM, pages are managed by two red-black trees, one of which is ephemeral. The first tree, call the unstable tree, is used to store new pages that are not yet understood to be stable. In other words, pages that are candidates for merging (unchanged for some period of time) are stored in the unstable tree. Pages in the unstable tree are not write protected. The second tree, called the stable tree, stores those pages that have been found to be stable and merged by KSM. To identify whether a page is volatile or non-volatile, KSM uses a simple 32-bit checksum. When a page is scanned, its checksum is calculated and stored with the page. On a subsequent scan, if a newly computed checksum is different from the previously generated checksum, the page is changing and is therefore not a good candidate for merging.
The first step in handling a single page using KSM’s process is to see whether the page can be found in the stable tree. The process of searching the tree is interesting in that each page is treated as a very large number (the contents of the page). A
memcmp (memory compare) operation is performed on the page and node’s page of consideration. If the
memcmp returns 0, the pages are equal and a match is found. Otherwise,
memcmp can return -1 (which means that the candidate page is less than the current node’s page) or 1 (the candidate page is greater than the current node’s page). In each case, the operation leads the search down the red-black tree (to the left if less than and to the right if greater than). Although comparing 4KB pages seems fairly heavyweight, in most cases, the
memcmp will end prematurely once a difference is found. Therefore, the process itself is fast and efficient. See Figure 3 for a visualization of this process.
Figure 3. Search process for a page in the tree
If the candidate page is found in the stable tree, then it is merged and the candidate page freed. This code can be found in ksm.c /
stable_tree_search() (which is called by ksm.c /
cmp_and_merge_page()). Otherwise, if the candidate page is not found, you move on to the unstable tree (see ksm.c /
The first step in the unstable tree search is to recalculate the checksum over the page. If it’s found to be different than the original checksum, the page is discarded from further search in this scan (because if it changes, it’s not worthwhile to track). If the checksum has not changed, the unstable tree is searched for the candidate page. The unstable tree processing is a bit different than the stable tree. First, if the search does not find the page in the unstable tree, you add a new node in the unstable tree for this page. But if the page is found in the unstable tree, you merge the pages, and then migrate the node to the stable tree.
When the scan is complete (performed through ksm.c /
ksm_do_scan()), the stable tree is kept, but the unstable tree is removed and rebuilt at the time of the next scan. This process simplifies things quite a bit, as the organization of the unstable tree can change based upon changes to the pages (recall that pages in the unstable tree are not write protected). Because all pages in the stable tree are write protected, a page fault is generated when a page tries to be written, allowing the CoW process to un-merge the page for the writer (see ksm.c /
break_cow()). Orphaned pages in the stable tree are subsequently removed (unless two or more users of the page exist, indicating that the page is still shared).
As discussed, KSM uses red-black trees for managing pages for fast lookup. Linux actually includes red-black trees as a reusable data structure, and they can be used generically. Red-black trees are also used by the Completely Fair Scheduler (CFS) for storage of tasks in time order. You can find the implementation of red-black trees in the kernel at ./lib/rbtree.c.
KSM configuring and monitoring
Managing and monitoring KSM occurs through sysfs (at the root /sys/kernel/mm/ksm). In this sysfs subdirectory, you’ll find a collection of files, some used for control and others used for monitoring.
The first file, run, is used to enable or disable paging merging with KSM. By default, KSM is disabled (
0) but can be enabled by writing a
1 to the file to enable the KSM daemon (for example,
echo 1 > sys/kernel/mm/ksm/run). From the run state, you can disable the daemon (but keep the current set of merged pages) by writing a
0. Also from the run (
1) state, you can stop KSM and request that all merged pages become un-merged by writing a
While KSM runs, you can control it with three parameters (files within sysfs). The sleep_millisecs file defines how many milliseconds
ksmd should sleep before performing another page scan. The max_kernel_pages file defines the maximum number of pages that
ksmd can use (the default is 25% of the available memory, but you can write a
0 to specify no limit). Finally, the pages_to_scan file defines how many pages can be scanned in a given scan. Any user can view these files, but the user must have root privileges to change them.
There are also five monitorable files exported through sysfs that indicate
ksmd‘s operation and effectiveness (these are read only). The full_scans file indicates the number of full area scans that have been performed. The remaining four files indicate the page-level statistics of KSM:
- pages_shared: The number of unswappable kernel pages that KSM is using
- pages_sharing: An indication of memory savings
- pages_unshared: The number of unique pages repeatedly checked for merging
- pages_volatile: The number of pages that are changing too often
The KSM authors define that a high ratio of pages_sharing to pages_shared indicates efficient sharing of pages (whereas the reverse is indicative of wasted effort).
Linux is not alone in using page sharing to improve memory efficiency, but it is unique in its implementation as an operating system feature. VMware’s ESX server hypervisor provides this feature under the name Transparent Page Sharing (TPS), while XEN calls it Memory CoW. But whatever the name or implementation, the feature provides better memory utilization, allowing the operating system (or hypervisor, in the case of KVM) to over-commit memory to support greater numbers of applications or VMs. You can find KSM — and many other interesting features — in the latest 2.6.32 Linux kernel.