In the Linux kernel community there is a saying: unused memory is wasted memory. Well, when using CPLEX on multiprocessor (not just multi-core) linux systems, this is not quite true…
While benchmarking on our 2-cpu 12 core machines, we noticed significant performance variability for some problem instances when running them multiple times. Even though the execution path was the same, wallclock runtime differed by up to 50%, even when running just a plain single threaded dual simplex algorithm! It took us quite a while to figure out the cause of the issue, and this blog entry describes our findings.
We started by looking at various cron jobs to see if they could be the cause, but found nothing. Then we looked at the nfs filesystem to see if that could be blamed somehow, but that was a blind alley, too. Eventually we noticed that we always got faster runtimes when that single thread was running on a core that was on
cpu1, and slower runtimes when the core was on
cpu0. Then we started to investigate what was going on.
First we examined the state of the machine using
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 node 0 size: 12099 MB node 0 free: 427 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 12288 MB node 1 free: 11240 MB node distances: node 0 1 0: 10 11 1: 11 10
What sticks out in this output is that almost all memory is free on
node1 and there is almost no free memory on
node0. That is, no matter on which node CPLEX was running, all allocated memory came from the memory bank of
node1. Searching around the internet suggested that this might be caused by the kernel saving too many dentry (directory entry) and inode information in cache. And indeed, issuing
$ sudo -c "echo 3 > /proc/sys/vm/drop_caches"
(that is, clear out all caches) resolved the issue! Well, almost… Not everyone has sudo powers on our machines, so this is not an ideal solution. Also,
node0 runs were still a bit slower than
node1 runs, though now by only about 2%-3%.
So now we have looked at why
node1 runs are still a bit faster. It turned out that the problem instance we used for testing needed ~11GB memory to solve, which just fits into the free memory on
node1. The file we read in was ~1GB. This means that after the instance is read from file there was ~1GB of memory from
node0‘s bank used by caches. So if CPLEX was running on
node0 then it got 10GB memory from
node0‘s bank (the Linux memory management tries to allocate memory from the bank of the same node where the thread is running) and 1GB from
node1‘s bank. When CPLEX was running on
node1 then all the needed memory was allocated from the same node’s memory. Could this explain the speed difference?
numactl‘s nodedistance output, accessing memory from the other node’s bank takes only 1.1 times as much time as accessing from the local bank. Therefore if roughly 10% of the memory comes from “far” then we should have less than 1% difference. After all, CPLEX does some computing, too, not just memory accesses :-). However, this 1.1 multiplier applies only when memory is streamed, and CPLEX accesses memory in a very random fashion (accessing entries in a sparse matrix is not cache friendly and jumps around in memory a lot). So we have done the following experiment:
for node in 0 1; do for mem in 0 1; do numactl --bindcpunode=$node --preferred=$mem \ cplex -c "r foo.mps.gz" "set thread 1" "tranopt" > $node$mem.log done done
That is, instruct the kernel to run CPLEX on node$node while try to allocate memory from the bank of node$mem. The result was rather interesting (running everything 10 times and taking average):
|where memory is allocated|
|where the code runs||node0||40 sec||65 sec|
|node1||63 sec||39 sec|
And these results are exactly what we expected, apart from the magnitude of the difference (66% in the worst case!). It is fastest when CPLEX runs on
node1 and gets memory from
node1. Same on
node0 is a bit slower because some memory allocation is forced onto
node1 due to the cached file. Slowest is running on
node0 while getting memory from
node1. Finally running on
node1 but getting memory from
node0 is a bit faster, again because some of the memory will be served from
node1. When we ran a smaller instance where despite the cached memory we did fit into the remaining memory on
node0, there were only two different timings, depending on whether we got the memory from the same node or not.
Therefore, given how CPLEX accesses memory, there is a very significant penalty if memory is allocated from a memory bank of a node that is not the one where CPLEX is running. By keeping files cached and thus potentially serving memory from the “wrong” memory bank the Linux kernel’s “waste no memory” policy is harmful for us.
Then we have turned our attention to the sudo issue. That is, how could we make sure without sudo powers that the cache is as empty as possible when running CPLEX. It turns out there are a number of kernel parameters that can be set to control the kernel’s behavior. Bradley Dean’s wiki gives a good short summary and has excellent links to the details of Linux’s kernel memory management.
For us, fiddling with “/proc/sys/vm/vfs_cache_pressure” proved to be the most useful. Increasing it to 1000 from the default 100 effectively resulted in the kernel evicting cached filesystem data fast enough to make the cache issue disappear. And this parameter needs to be set (by root) only once. Note that changing this parameter this way probably has adverse performance effects for any application that regularly accesses the filesystem. However, on our benchmarking machines, all we do is to read in the instance. Then we stay cpu-bound, so we do not need anything cached. So for this use-case the cache is useless, and actually even worse, as it introduces a variance in the runtime.
- On a multi-cpu Linux machine, fiddling with kernel parameters may make CPLEX run faster and in a timing-wise more reproducible manner, although this must be exercised with care if there are other applications running on the same machine.
- When not using all cores on a multi-cpu machine running Linux, it is essential to pay attention to which node CPLEX is running on and where it gets its memory from. This is important both for getting reliable timings for benchmarking purposes and for getting more performance.