This post contains a link to a presentation on how to optimize Streams applications. The slides are meant to be presented, but they were written with enough context so that they should be understandable on their own. There have been several requests to make these generally available.
The slides present and explain the following performance lessons:
- Compile with -a.
- Fuse operators into the same PE to reduce communication costs.
- Insert threaded ports into PEs to increase throughput through pipeline parallelism.
- Prefer threaded ports over PEs to obtain pipeline parallelism.
- Use multiple PEs in an application to take advantage of multiple hosts.
- Use one PE per host.
- If there are two PEs on the same host, they should probably be fused into one PE. Insert threaded ports to regain parallelism.
- Improve the performance of bottlenecks to improve the throughput of an application.
- Trying to improve the performance of an application without knowing who is the bottleneck is a waste of time.
- When a parallel region is no longer the bottleneck, further parallelism will not help.
- Know your hardware. Distribute PEs to hosts so as to avoid over-subscribing any resource (cores, memory, disk, etc.) on that host.