NOTE: This content was originally published under the IBM developerWorks site. Since the location where this content was published is being taken offline the content is being copied here so it can continue to be accessed.







Common Network Recommendations:

 

sysctl
Parameter

Recommended

Value

Comments

Description

kernel.sysrq

1

 

Enables
kernel debugging via sysrq interface

kernel.shmmax

137438953472

Total
amount of memory installed

The
maximum size of a shared memory segment – optimal PE MPI support requires a
large value and other applications required large shared memory segments as
well

net.core.netdev_max_backlog

250000

 

max
packets to be queued at interface layer

net.core.optmem_max

16777216

 

max
socket buffer size

net.core.rmem_default

16777216

 

default
socket buffer read size

net.core.wmem_default

16777216

 

default
socket buffer write size

net.core.rmem_max

16777216

 

max
socket buffer read size (over-rides tcp_wmem param)

net.core.wmem_max

16777216

 

max
socket buffer write size (over-rides tcp_wmem
param)

net.ipv4.conf.all.arp_filter

1

 

set
to 1 means only accept ARP replies for addresses on the same subnet as
address being requested via ARP

net.ipv4.conf.all.arp_ignore

1

 

set
to 1 means a device only answers to an ARP request if the address matches its
own.

net.ipv4.neigh.ib0.mcast_solicit

(if
other ibX interfaces, e.g. ib1, also set)

(set
same ipv6 tuning value if using IPV6)

9

18

(9 preferred though some large systems
set 18)

max
attempts to resolve an ib0 IPV4 address via mcast/bcast before marking the entry as unreachable

net.ipv4.neigh.ib0.ucast_solicit

(if
other ibX interfaces, e.g. ib1, also set)

(set
same ipv6 tuning value if using IPV6)

 

9

 

in
resolving an ib0 PV4 address max unicast probes sent before new ARP broadcast

net.ipv4.neigh.default.gc_thresh1

(set
same ipv6 tuning value if using IPV6)

30000

30000

(30000 is sufficient for current
largest system-X clusters. Minimum value is= total interfaces in cluster
which may require ARP entries – generally num_nodes*interfaces)

min
IPV4 entries to keep in ARP cache –  garbage collection never runs if
this many or less entries are in cache

net.ipv4.neigh.default.gc_thresh2

(set
same ipv6 tuning value if using IPV6)

32000

32000

(32000 is sufficient for current
largest system-X clusters. Minimum value is= extra buffer (2000) + total
interfaces in cluster which may require ARP entries – generally num_nodes*interfaces)

IPV4
entries allowed in ARP cache before garbage collection will be scheduled in 5
seconds

net.ipv4.neigh.default.gc_thresh3

(set
same ipv6 tuning value if using IPV6)

32768

32768

(32768 is sufficient for current
largest system-X clusters. Minimum value is= larger extra buffer (2768) +
total interfaces in cluster which may require ARP entries – generally num_nodes*interfaces)

maximum
IPV4 entries allowed in ARP cache; garbage collection runs when this many
entries reached

net.ipv4.neigh.ib0.gc_stale_time

(if
other ibX interfaces, e.g. ib1, also set)

(set
same ipv6 tuning value if using IPV6)

 

2000000

 

 

defines
how long a stale ib0 IPV4 ARP entry must be inactive (not used) in the cache
before it is a candidate for deletion on the next garbage collection run

net.ipv4.neigh.default.gc_interval

(set
same ipv6 tuning value if using IPV6)

 

2000000

 

defines
how often IPV4 ARP garbage collection runs

net.ipv4.tcp_adv_win_scale

2

 

defines how much socket buffer
space is used for TCP window size vs how much to save for an application
buffer

2=1/4 space is app. buffer

  net.ipv4.tcp_low_latency

1

 

intended to give preference to low
latency over higher throughput; setting =1 disables IPV4 tcp
prequeue processing, which Mellanox has recommended
for large clusters

net.ipv4.tcp_mem

16777216
16777216 16777216

 

 IPV4 TCP memory usage
values:

min, pressure, max (in pages)

    min: no contraints below this
value

    pressure: threshold for moderating memory consumption


max: hard max

net.ipv4.tcp_reordering

3

 

The
maximum times an IPV4 packet can be reordered in a TCP packet stream without
TCP assuming packet loss and going into slow start

net.ipv4.tcp_rmem

4096
87380 16777216

 

IPV4 TCP receive socket buffer mem
:

min, default, max

    min: Minimal size of TCP receive buffer

    default: initial size of TCP receive buffer (over-rides
mem value used for other protocols)

    max: max size of receive buffer allowed (limited by net.core.rmem_max)

net.ipv4.tcp_wmem

4096
87380 16777216

 

IPV4
TCP socket socket buffer mem (see tcp_rmem)

net.ipv4.tcp_sack

0

 

 setting
to 1 enables selective acknowledgment for IPV4, which requires enabling tcp_timestamps and adds some packet overhead – only
advised in cases of packet loss on the network

net.ipv4.tcp_timestamps

0

 

setting
to 1 enables IPV4 timestamp generation, which can have a performance overhead
and is only advised in cases where sack is needed (see tcp_sack)

net.ipv4.tcp_window_scaling

1

 

RFC
1323 – support for IPV4 TCP window sizes larger than 64K – generally needed
on high b/w networks



All ‘ib0’ tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just an example — all interfaces for which critical subsystems (e.g. GPFS, LSF) have dependencies on should be tuned as per the ib0 examples except for cases which the tuning recommendations under (2) Less Reliable/Lower Bandwidth Networks are being followed and these recommendations conflict.

IPoIB Mode Mode Recommendations:

As per the sysctl recommendations, ‘ib0’tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just the example interface that most clusters are concerns with.

Unless the is a strong need for optimal IPoIB performance we are currently recommending using datagram mode on clusters.  Mellanox has agreed with this recommendation and points out that datagram in OFED 2 should be very close to the performance of connected mode.

We recommend:

/sys/class/net/ib0/mode = datagram

(again if there’s an ib1 interface, /sys/class/net/ib1/mode should be set to datagram, etc.)

which is typically achieved by one of these two approaches, which must be applied to every IB interface, e.g. ib0, ib1, etc):

(1)  For qlogic/Intel, In the appropriate ifcfg-ibX file (e.g. for ib0 ifcfg-ib0),
set  ifcfg-ib2
CONNECTED_MODE=no

(2)  For Mellanox adapters, in  /etc/infiniband/openib.conf:
SET_IPOIB_CM=no

IP Interface Tuning:

Mellanox Adapter Interface Tuning

We should verify that all IB IP interfaces match the recommended tuning.  The output of ‘ip -s link’  returns the following (example for ib0 but ALL interfaces, e.g. ib1, it2, etc, need to be verified):

ip -s link’ example output for an (example ib0) interface:
"ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc pfifo_fast state UP qlen 16384"

Things to verify:
The IPoIB MTU should be: 4092  (put in /etc/modprobe.d/mlx4k.conf  set_4k_mtu=1  options mlx4_core set_4k_mtu=1)

Also for cases with large GPFS page pools, /etc/modprobe.d/mlx4k.conf should also set:
options mlx4_core log_num_mtt=20 log_mtts_per_seg=3

The IPoIB flags should be: BROADCAST,MULTICAST,UP,LOWER_UP

and the state of any working interface of course should be "UP"

The IPoIB interface QDISC has been tested on large clusters with the: pfifo_fast setting; however the mq (multi-queue) setting may be advisable on later adapters.

The IPoIB QLEN=16384

This has been tested on a cluster with more than 4000 nodes (smaller clusters may work well with smaller QLEN values but the overhead of increasing QLEN is believed not to be significant).  The size of the QLEN reported by ifconfig will be twice the size of the queue sizes defined in the /etc/modprobe.d/ib_ipoib.conf file. e.g.:

options ib_ipoib lro=1
send_queue_size=8192
recv_queue_size=8192

The txqlen and rxqlen values reported by ifconfig will be twice the values loaded by the driver.  The actual values that have been configured to the IB module can be determined by running:

cat /sys/module/ib_ipoib/parameters/send_queue_size
cat /sys/module/ib_ipoib/parameters/recv_queue_size

Intel Adapter Interface Tuning

Define IP over IB receive queue length (on some IBM systems, we’ve set receive queue tuning in the /etc/modprobe.d/ib_ipoib.conf file) 

options ib_ipoib recv_queue_size=1024
send_queue_size=512




Also it is recommended that for ethernet adapters that have performance or reliability metrics, the length of the ethernet IP transmit and receive queues should be increased to 2048:

in /etc/rc.local:

ifconfig eth0 txqueuelen 2048   # will set the transmit queue to 2048 (if the adapter supports this length)

ethtool -G eth0 rx 2040               # will set the receive queue to 2048 (if the adapter supports this length)

(repeat for other ethernet devices, e.g. eth1 that support higher transmit and receive queue lengths):

The state of any working interface of course should be "UP"

 

Settings to avoid Linux Out of Memory Issues:

/proc Parameter

Recommended

Value

Comments

Description

/proc/sys/vm/oom_kill_allocating_task

0

 

0=when OOM killer invoked it employes heuristics
to select a process making intensive memory allocations

1=when the OOM killer is invoked it kills the last process to allocate
memory

 

/proc/sys/vm/overcommit_memory

2

(some IBM clusters set this this value to 0, which may be
workloads malloc() much more memory than they touch as long as a cgroups solution is in place to protect against failures
that may occur in over-committing real memory)

 

0= heuristic memory over- commit allowed

1=allocations  always succeed

2=allocations succeed up to  swap+(RAM*overcommit_ratio)

 

/proc/sys/vm/overcommit_ratio

99

Maximum — 110

(the extent of memory over-commit is dependent on the discrepancy between
memory malloc’ed and touched – when running sparse
matrix applications – higher overcommit_ratio
values may be more appropriate)

 

this value is only relevant in cases in which sys/vm/overcommit_memory=2 (see
description)

 

 

Ulimits Tuning:

The following limits are recommended for default user limits on large clusters. Note that the ulimits are not a reliable method of enforcing memory limitations so it is recommended that ulimits be defined to effectively set unlimited memory limits and cgroups definitions be used to enforce memory limits. 

Set these values in /etc/security/limits.conf:

  *    soft    memlock      -1

  *     hard    memlock      -1

  *     soft    rss          -1

  *     hard    rss          -1

  *     soft    core          -1

  *     hard    core          -1

  *     soft    maxlogins     8192

  *     hard    maxlogins     8192

  *     soft    stack         -1

  *     hard    stack         -1

  *     soft    nproc         2067554

  *     hard    nproc         2067554

 

 

 

Join The Discussion

Your email address will not be published. Required fields are marked *