The purpose of this guide is to provide performance tuning procedures for IBM Spectrum Scale Shared Nothing architecture clusters, including the File Placement Optimizer (FPO) clusters. This guide does not include an overall description of the IBM Spectrum Scale product or instructions on the deployment of IBM Spectrum Scale. Before reading this guide, see the IBM Spectrum Scale 4.2 Advanced Administration Guide at IBM Spectrum Scale knowledge center and IBM Spectrum Scale Hadoop WIKI for more information about IBM Spectrum Scale and the data analytics solution using IBM Spectrum Scale-FPO.

Some tuning options cannot be applied after a file system is created. Therefore, it’s recommended to read through this guide entirely before creating any file system.

This guide might be updated periodically. Therefore, see GPFS developerWorks Wiki Analytics Reference sites for the latest version of this guide.

Operating system configuration and tuning
Perform the following steps to configure and tune a Linux system:

Step 1: deadline disk scheduler
Change all the disks defined to IBM Spectrum Scale to use the ‘deadline’ queue scheduler (cfg is the default for some distros, such as RHEL 6).
For each block device defined to IBM Spectrum Scale, run the following command to enable the deadline scheduler:
echo "deadline" > /sys/block//queue/scheduler

Changes made in this manner (echo’ing changes to sysfs) do not persist over reboots. To make these changes permanent, enable the changes in a script that runs on every boot, or (generally preferred) you will have to create a udev rule.

The following sample script sets deadline scheduler for all disks in the cluster that are defined to IBM Spectrum Scale (this example must be run on the node with passwordless access to all the other nodes):#!/bin/bash
/usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | /bin/grep dev |
while read device node ; do
device=$(echo $device | /bin/sed 's/\/dev\///' )
/usr/lpp/mmfs/bin/mmdsh -N $node "echo deadline > /sys/block/$device/queue/scheduler"

As previously stated, changes made by echo’ing to sysfs files (as per this example script) take effect immediately on running the script, but do not persist over reboots. One approach to making such changes permanent is to enable a udev rule, as per this example rule to force all block devices to use deadline scheduler after rebooting. To enable this rule, you can create the following file as ‘/etc/udev/rules.d/99-hdd.rules’):
ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/model}=="*", ATTR{queue/scheduler}="deadline"

In the next step, give an example of how to create udev rules that apply only to the devices used by IBM Spectrum Scale.

Step 2: disk IO parameter change
To further tune the block devices used by IBM Spectrum Scale, run the following commands from the console on each node:
echo 16384 > /sys/block//queue/max_sectors_kb
echo 256 > /sys/block//queue/nr_requests
echo 32 > /sys/block//device/queue_depth

These block device tuning settings must be large enough for SAS/SATA disks. For /sys/block//queue/max_sectors_kb, the tuning value chosen must be less than or equal to /sys/block//queue/max_hw_sectors_kb. Many SAS/SATA devices allow setting 16384for max_hw_sectors_kb, but not all devices may accept these values. If your device does not allow for the block device tuning recommendations above, try setting smaller values and cutting the recommendations in half until the tuning is successful. For example, if setting max_sectors_kb to 16384 results in a “write error”:
echo 16384 > /sys/block/sdd/queue/max_sectors_kb
-bash: echo: write error: Invalid argument

Try setting max_sectors_kb to 8192:
echo 8192 > /sys/block/sdd/queue/max_sectors_kb

If your disk is not SAS/SATA, check the disk specification from the disk vendor for tuning recommendations.
Note: If the max_sectors_kb of your disks is small (e.g. 256 or 512) and you are not allowed to tune the above values (i.e., you get an “invalid argument” as per the example above), then your disk performance might be impacted because IBM Spectrum Scale IO requests might be split into several smaller requests according to the limits max_sectors_kb places at the block device level.

As discussed in the “Step 1: deadline disk scheduler” tuning recommendations, any tuning done by echo’ing to sysfs files will be lost when a node reboots. To make such a tuning permanent, either create appropriate udev rules or place these commands in a boot file that is run on each reboot.
As udev rules are the preferred way of accomplishing this kind of block device tuning, give an example of a generic udev rule that enables the block device tuning recommended in steps 1 and 2 for all block devices. This rule can be enabled by creating the following rule as a file “/etc/udev/rules.d/100-hdd.rules”):
ACTION=="add|change", SUBSYSTEM=="block", ATTR{device/model}=="*", ATTR{queue/nr_requests}="256", ATTR{device/queue_depth}="32", ATTR{queue/max_sectors_kb}="16384"

If it is not desirable to tune all block devices with the same settings, multiple rules can be created with specific tuning for the appropriate devices. To create such device specific rules, you can use the ‘KERNEL’ match key to limit which devices udev rules apply to (e.g., KERNEL==sdb). The following example script can be used to create udev rules that tune only the block devices used by IBM Spectrum Scale:

#clean up any existing /etc/udev/rules.d/99-hdd.rules files
/usr/lpp/mmfs/bin/mmdsh -N All "rm -f /etc/udev/rules.d/100-hdd.rules"

#collect all disks in use by GPFS and create udev rules one disk at a time
/usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | /bin/grep dev |
while read device node ; do
device=$(echo $device | /bin/sed 's/\/dev\///' )
echo $device $node
echo "ACTION==\"add|change\", SUBSYSTEM==\"block\", KERNEL==\"$device\", ATTR{device/model}==\"*\", ATTR{queue/nr_requests}=\"256\", ATTR{device/queue_depth}=\"32\", ATTR{queue/max_sectors_kb}=\"16384\" "> /tmp/100-hdd.rules
/usr/bin/scp /tmp/100-hdd.rules $node:/tmp/100-hdd.rules
/usr/lpp/mmfs/bin/mmdsh -N $node "cat /tmp/100-hdd.rules >>/etc/udev/rules.d/100-hdd.rules"

Note that the previous example script must be run from a node that has ssh access to all nodes in the cluster. This previous example script will create udev rules that will set the recommended block device tuning on future reboots. To put the recommended tuning values from steps 1 and 2 in place immediately in effect, the following example script can be used:
/usr/lpp/mmfs/bin/mmlsnsd -X | /bin/awk ' { print $3 " " $5 } ' | /bin/grep dev |
while read device node ; do
device=$(echo $device | /bin/sed 's/\/dev\///' )
/usr/lpp/mmfs/bin/mmdsh -N $node "echo deadline > /sys/block/$device/queue/scheduler"
/usr/lpp/mmfs/bin/mmdsh -N $node "echo 16384 > /sys/block/$device/queue/max_sectors_kb"
/usr/lpp/mmfs/bin/mmdsh -N $node "echo 256 > /sys/block/$device/queue/nr_requests"
/usr/lpp/mmfs/bin/mmdsh -N $node "echo 32 > /sys/block/$device/device/queue_depth"

Step 3: disk cache checking
On clusters that don’t run Hadoop/Spark workloads, disks used by IBM Spectrum Scale must have physical disk write caching disabled, regardless of whether RAID adapters are used for these disks.
When running other (non-Hadoop/Spark) workloads, write caching on the RAID adapter(s) can be enabled if the local RAID adapter cache is battery protected, but the write cache on the physical disks must not be enabled.
Check the specification for your RAID adapter to figure out how to turn on/off the RAID adapter write cache, as well as the physical disk write cache.
For common SAS/SATA disks without RAID adapter, run the following command to check whether the disk in question is enabled with physical disk write cache:
sdparm --long /dev/ | grep WCE

If WCE is 1, it means the disk write cache is on.
The following commands can be used to turn on/off physical disk write caching:
# turn on physical disk cache
sdparm -S -s WCE=1 /dev/

# turn off physical disk cache
sdparm -S -s WCE=0 /dev/

Note: The physical disk read cache must be enabled no matter what kind of disk is used. For SAS/SATA disks without RAID adapters, run the following command to check whether the disk read cache is enabled or not:
sdparm --long /dev/ | grep RCD
If the value of RCD (Read Cache Disable) is 0, the physical disk read cache is enabled. On Linux, usually the physical disk read cache is enabled by default.

Step 4: Tune vm.min_free_kbytes to avoid potential memory exhaustion problems.
When vm.min_free_kbytes is set to its default value, in some configurations it is possible to encounter memory exhaustion symptoms when free memory must be available. It is recommended that vm.min_free_kbytes be set to between 5~6 percent of the total amount of physical memory, but no more than 2GB should be allocated for this reserve memory.
To tune this value, add the following into /etc/sysctl.conf and then run ‘sysctl -p’ on Red Hat or SuSE:
vm.min_free_kbytes = $your-min-free-KBmemory

Step 5: OS network tuning
If your network adapter is 10Gb Ethernet adapter, you can put the following into /etc/sysctl.conf and then run “/sbin/sysctl -p /etc/sysctl.conf” on each node:
sunrpc.udp_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128
net.core.netdev_max_backlog = 300000
net.core.somaxconn = 10000
net.ipv4.tcp_rmem = 4096 4224000 16777216
net.ipv4.tcp_wmem = 4096 4224000 16777216

If your cluster is based on InfiniBand adapters, see the guide from your InfiniBand adapter vendor.
If you bond two adapters and configure xmit_hash_policy=layer3+4 with bond mode 4(802.3ad, the recommended bond mode), IBM Spectrum Scale of one node has only one TCP/IP connection with another node in the cluster for data transfer. This might make the network traffic only over one physical connection if the network traffic is not heavy.
If your cluster size is not large (e.g. only one physical switch is enough for your cluster nodes), you could try bonding mode 6 (balance-alb, no special support from switch). This might give better network bandwidth as compared with bonding mode 4(802.3ad, require support from switch). See the link about advantages and disadvantages on Linux bonding 802.3ad versus balance-alb mode.

IBM Spectrum Scale configuration and tuning
Perform the following steps to tune the IBM Spectrum Scale cluster and file systems:

Step 1: Data replica and metadata replica:
While creating IBM Spectrum Scale filesystems, ensure that the replication settings meet the data protection needs of the cluster.
For production cluster over internal disks, it is recommended to take replica 3 for both data and meta data. If you have local RAID5 or RAID6 adapters with battery protected, you can take replica 2 for the data.
When a filesystem is created, the default number of copies of data and metadata are respectively defined by the ‘-r’ (DefaultDataReplicas) and ‘-m’ (DefaultMetadataReplicas) options to the ‘mmcrfs’ command. Also, the value of -R(MaxDataReplicas) and -M(MaxMetadataReplicas) cannot be changed after the file system is created. Therefore, it is recommended to take 3 for -R/-M for flexibility to change the replica in the future.
The first instance (copy) of the data is referred to as the first replica. For example, setting the DefaultDataReplicas=1 (via ‘-d 1’ option to mmcrfs) results in only a single copy of each piece of data, which is typically not desirable for a shared-nothing environment.

Query the number of replicas kept for any given filesystem by running the command:
/usr/lpp/mmfs/bin/mmlsfs | egrep " -r| -m"

Change the level of data and metadata replication for any file system by running mmchfs by using the same ‘-r’ (DefaultDataReplicas) and ‘-m’ (DefaultMetadataReplicas) flags to change the default replication options and then mmrestripefs (with the ‘-R’ flag) to restripe the filesystem to match the new default replication options.
For example:
/usr/lpp/mmfs/bin/mmchfs -r -m
/usr/lpp/mmfs/bin/mmrestripefs -R

Step 2: Additional considerations for file system:

While creating the file system, consider tuning /usr/lpp/mmfs/bin/mmcrfsparameters based on your applications:
-L: by default, it's 4MB for file system log file.
It's recommended that any filesystems be created with at least a 16MB log (-L 16M), or, if your application is meta-operation sensitive, you could increase this to 32MB (-L 32M)
-E: by default, it's "yes" for exact mtime. if your application doesn't depend on exact mtime, you could change this as "-E no" for better performance.
-S: Suppress atime update(it's no by default). if your application doesn't depend on exact atime, "-S yes" will be better for performance.
--inode-limit MaxNumInodes: usually, default MaxNumInodes is ok for small file number. If you don’t know how many files you will have in your file system, it’s recommended that you specify this as large as possible to avoid “No IO Space” issues because all inodes have been used up. For Spectrum Scale 4.1+, the default inode size is 4K. you could count the maximal inode number by: (meta-data-disk-size * meta-data-disk-number)/(inode-size * DefaultMetadataReplicas)

Step 3: Define the data and the metadata distribution across the NSD server nodes in the cluster:
Ensure that clusters larger than 4 nodes are not defined with a single (dataAndMetadata) system storage pool.
For performance and RAS reasons, it is recommended that data and metadata be separated in some configurations (which means that not all the storage is defined to use a single dataAndMetadata system pool).
These guidelines focus on the RAS considerations related to the implications of losing metadata servers from the cluster. In IBM Spectrum Scale Shared Nothing configurations (which recommend setting the ‘unmountOnDiskFail=meta’ option), a given filesystem is unmounted when the number of nodes experiencing metadata disk failures is equal to or greater than the value of the DefaultMetadataReplicas option defined for the filesystem (the ‘-m’ option to the ‘mmcrfs’ command as per above). So, for a filesystem with the typically configured value DefaultMetadataReplicas=3, the filesystem will unmount if metadata disks in three separate locality group IDs fail (when a node fails, all the internal disks in that node will be marked down). Note that all the disks in the same filesystem on a given node must have the same locality group ID. The Locality ID refers to all three elements of the extended failure group topology vector (e.g. the vector 2,1,3 could represent rack 2, rack position 1, node 3 in this portion of the rack). To avoid filesystem unmounts associated with losing too many nodes serving metadata, it is recommended that the number of metadata servers be limited when possible. Also metadata servers must be distributed evenly across the cluster, to avoid the case of a single hardware failure (such as the loss of a frame/rack or network switch) leading to multiple metadata node failures.

Some suggestions for separation of data and metadata based on cluster size:

1. If you are not considering the IOPS requirement from the meta IO operations, usually 5% of the total disk size in the file system must be kept for meta data. If you can predict how many files your file system will be filled and the average file size, the requirement of the meta space size could be calculated roughly.
2. In a Shared Nothing framework, it is recommended that all nodes have similar disks in disk number and disk capacity size. If not, it might lead to hot disks when some nodes with small disk number or small disk capacity size are running out of disk space.
3. As for the number of nodes that are considered as one virtual rack, it is recommended to keep the node number even from each virtual rack.
4. It is always recommended to configure SSD or other fast disks as metadataOnly disks. This speeds up some maintenance operations, such as mmrestripefs, mmdeldisk, and mmchdisk.
5. If you are not sure about failure group definiation, contact

Step 4: When running a sharing nothing cluster, choose a failure group mapping scheme suited to Spectrum Scale.
Defining more than 32 failure groups IDs for a given filesystem will slow down the execution of a lot of concurrent disk space allocation operations, such as restripe operations ‘mmrestripefs -b’.
On FPO-enabled clusters, defining more than 32 locality groups per failure group ID will slow down the execution of restripe operations, such as ‘mmrestripefs -r’.
To define an IBM Spectrum Scale FPO-enabled cluster containing a storage pool, set the option allowWriteAffinity to yes. This option can be checked by running the ‘mmlspool all -L’ command. In FPO-enabled clusters, currently all disks on the same node must be assigned to the same locality group ID (three integer vector x,y,z), which also defines a failure group ID . It is recommended that failure group IDs refer to sets of common resources, with nodes sharing a failure group ID having a common point of failure, such as a shared rack or a network switch

Step 5: Do not configure allowWriteAffinity=yes for a metadataOnly system pool.
For a ‘metadataOnly’ storage pool (not a dataAndMetadata pool), set allowWriteAffinity to no. Setting allowWriteAffinity to yes for metadataOnly storage pool slows down the inode allocation for the pool.

Step 6: Any FPO-enabled storage pool (any pool with allowWriteAffinity=yes defined) must define blockGroupFactor to be larger than 1 (regardless of the value of writeAffinityDepth).
When allowWriteAffinity is enabled, more RPC (Remote Procedure Call) activity might occur compared to the case of setting allowWriteAffinity=no.
To reduce some of the RPC overhead associated with setting allowWriteAffinity=1, for pools with allowWriteAffinity enabled, it is recommended that the BlockGroupFactor be set to greater than 1. Starting point recommendations are blockGroupFactor=2 (for general workloads), blockGroupFactor=10 (for database workloads), and blockGroupFactor=128 (Hadoop workloads).

Step 7: Tune the block size for storage pools defined to Spectrum Scale.
For storage pools containing both data and metadata (pools defined as dataAndMetadata), a block size of 1M is recommended.
For storage pools containing only data (pools defined as dataOnly), a block size of 2M is recommended.
For storage pools containing only metadata (pools defined as metadataOnly), a block size of 256K is recommended.

The following sample pool stanzas (used when creating NSDs via the ‘mmcrnsd’ command) are based on the tuning suggestions from steps 4-7:
#for a metadata only system pool:
%pool: pool=system blockSize=256K layoutMap=cluster allowWriteAffinity=no
#for a data and metadata system pool:
%pool: pool=system blockSize=1M layoutMap=cluster allowWriteAffinity=yes writeAffinityDepth=1 blockGroupFactor=2
#for a data only pool:
%pool: pool=datapool blockSize=2M layoutMap=cluster allowWriteAffinity=yes writeAffinityDepth=1 blockGroupFactor=10

Step 8: Tune the size of the IBM Spectrum Scale pagepool by setting the pool size of each node to be between 10% and 25% of the real memory installed.
Note that the Linux buffer pool cache is not used for IBM Spectrum Scale filesystems. The recommended size of the pagepool depends on the workload and the expectations for improvements due to caching. A good starting point recommendation is somewhere between 10% and 25% of real memory. If machines with different amounts of memory are installed, use the ‘-N’ option to mmchconfig to set different values according to the memory installed on the machines in the cluster. Though these are good starting points for performance recommendations, some customers use relatively small page pools, such as between 2-3% of real memory installed, particularly for machines with more than 256GB installed.
The following example shows how to set a page pool size equal to 10% of the memory (this assumes all the nodes have the same amount of memory installed):
TOTAL_MEM=$(cat /proc/meminfo | grep MemTotal | tr -d \"[:alpha:]\" | tr -d \"[:punct:]\" | tr -d \"[:blank:]\")
mmchconfig pagepool=${PAGE_POOL}M –i

Step 9: Change the following IBM Spectrum Scale configuration options and then restart IBM Spectrum Scale.
Note: For IBM Spectrum Scale or 4.2.1 and later, the restart of IBM Spectrum Scale can be delayed until the next step, because tuning workerThreads will require a restart.
Set each configuration option individually:
mmchconfig readReplicaPolicy=local
mmchconfig unmountOnDiskFail=meta
mmchconfig maxStatCache=512 or mmchconfig maxStatCache=0
mmchconfig restripeOnDiskFailure=yes
mmchconfig nsdThreadsPerQueue=10
mmchconfig nsdMinWorkerThreads=48
mmchconfig prefetchaggressivenesswrite=0
mmchconfig prefetchaggressivenessread=2

Set all the configuration option at once by using the mmchconfig command:
mmchconfig readReplicaPolicy=local,unmountOnDiskFail=meta,maxStatCache=512,

The maxMBpS tuning option must be set as per the network bandwidth available to IBM Spectrum Scale. If you are using one 10 Gbps link for the IBM Spectrum Scale network traffic, the default value of 2048 is appropriate. Otherwise scale the value of maxMBpS to be about twice the value of the network bandwidth available on a per node basis.

For example, for two bonded 10 Gbps links an appropriate setting for maxMBpS is:
mmchconfig maxMBpS=4000 # this example assumes a network bandwidth of about 2GB/s (or 2 bonded 10 Gbps links) available to Spectrum Scale
Note: In a special user scenario (e.g. active-to-active disaster recovery deployment), restripeOnDiskFailure must be configured as “no” for internal disk cluster.

Some of these configuration options do not take effect until IBM Spectrum Scale is restarted.

Step 10: Depending on the level of code installed, follow the tuning recommendation for Case A or Case B:
A) If running Spectrum Scale 4.2.0 PTF3, 4.2.1, or any higher level, either set workerThreads to 512, or try setting workerThreads=8*cores per node (both require a restart of IBM Spectrum Scale to take effect). For lower code levels, setting worker1Threads to 72 (with the ‘-i’, immediate, option to mmchconfig does not require restarting IBM Spectrum Scale.)
mmchconfig workerThreads=512 # for Spectrum Scale 4.2.0 PTF3, 4.2.1, or any higher levels
mmchconfig workerThreads=8*CORES_PER_NODE # for Spectrum Scale 4.2.0 PTF3, 4.2.1, or any higher levels

Change workerThreads to 512 (the default is 128) to enable additional thread tuning. This change requires that IBM Spectrum Scale be restarted to take effect.
Note: For IBM Spectrum Scale or 4.2.1 or later, it is recommended that the following configuration parameters not be changed (setting workerThreads to 512, or (8*cores per node), will auto-tune these values): parallelWorkerThreads, logWrapThreads, logBufferCount, maxBackgroundDeletionThreads, maxBufferCleaners, maxFileCleaners, syncBackgroundThreads, syncWorkerThreads, sync1WorkerThreads, sync2WorkerThreads, maxInodeDeallocPrefetch, flushedDataTarget, flushedInodeTarget, maxAllocRegionsPerNode, maxGeneralThreads, worker3Threads, and prefetchThreads.

After you enable auto-tuning by tuning the value of workerThreads, if you previously changed any of these settings (parallelWorkerThreads, logWrapThreads, etc) you must restore them back to their default values by running “mmchconfig =Default”.

B) For IBM Spectrum Scale 4.1.0.x, 4.1.1.x,,,, the default values will work for most scenarios. Generally only worker1Threads tuning is required:
mmchconfig worker1Threads=72 -i # for Spectrum Scale 4.1.0.x, 4.1.1.x,,,
For IBM Spectrum Scale 4.1.0.x, 4.1.1.x,,,, worker1Threads=72 is a good starting point (the default is 48), though larger values have been used in database environments and other configurations that have many disks present.

Step 11: Customers running IBM Spectrum Scale 4.1.0, 4.1.1, and 4.2.0 must change the default configuration of trace to run in overwrite mode instead of blocking mode.

To avoid potential performance problems, customers running IBM Spectrum Scale 4.1.0, 4.1.1, and 4.2.0 must change the default IBM Spectrum Scale tracing mode from blocking mode to overwrite mode as follows:
/usr/lpp/mmfs/bin/mmtracectl --set --trace=def --tracedev-write-mode=overwrite --tracedev-overwrite-buffer-size=500M # only for Spectrum Scale 4.1.0, 4.1.1, and 4.2.0
(This assumes that 500MB can be made available on each node for Spectrum Scale trace buffers– if 500MB are not available, then set a lower appropriately sized trace buffer.)

Step12: Consider whether pipeline writing must be enabled
By default, data ingestion node will write 2 or 3 replicas of the data to the target nodes over the network in parallel when pipeline writing is disabled (enableRepWriteStream=0). This will take additional network bandwidth. If pipeline writing is enabled, the data ingestion node will only write one replica over the network and the target node will write the additional replica. Enabling pipeline writing (mmchconfig enableRepWriteStream=1 and restarting IBM Spectrum Scale daemon on all nodes) can increase IO write performance in the following two scenarios:
1st scenario: Data is ingested from the IBM Spectrum Scale client and the network bandwidth from the data-ingesting client is limited.
2nd scenario: Data is written through rack-to-rack switch with limited bandwidth (e.g. 30 nodes per rack, the bandwidth of rack-to-rack switch is 40Gb. When all the nodes are writing data over the rack-to-rack switch, each node will only get 40Gb/30, which is approximately 1.33Gb average network bandwidth).
For other scenarios, enableRepWriteStream must be kept as 0.

4 comments on"IBM Spectrum Scale Sharing Nothing Cluster Performance Tuning"

  1. […] Spectrum Scale Sharing Nothing Cluster performance tuning guide has been posted and please refer to link before you doing the below […]

  2. […] Spectrum Scale Sharing Nothing Cluster performance tuning guide has been posted and please refer to link before you doing the below […]

  3. […] Spectrum Scale Sharing Nothing Cluster performance tuning guide has been posted and please refer to link before you doing the below […]

  4. WeiGong@Scale November 27, 2017

    Good blog. Look forwarding to seeing more topic about Spectrum Scale blogs.

Join The Discussion

Your email address will not be published. Required fields are marked *