Co-author: Sreepurna Jasti (email@example.com)
This is the story of an experiment which we conducted within node.js’s event management infrastructure (libuv) to investigate improvement in network throughput by exploiting the operating system tunables in the TCP stack and leveraging them at the data transport sites.
Introduction to libuv
TCP streams in libuv / node
TCP streams are sophisticated abstractions around full-duplex sockets that provide asynchronous APIs for reading and writing, routines for managing the data flow through the socket channel such as pause, resume, pipe etc. Once the connection is established, an I/O – a read for example is initiated through uv_read_start call. This allows the caller to customize the read chunk size apart from returning as a callback each time when a chunk full of data is read.
However, given the workload patterns in the use cases of libuv (such as node) the data chunk size can never be determined optimally. For example, a web server may predominantly be used to write large amount of data to its client (a download site) in which case the write chunk can be enlarged to a great extend while keeping the read chunk to be of low / moderate size. Conversely, read chunk can be enlarged for a web server which implements an upload function. The fact that a full-duplex socket is capable of reading as well as writing and can be used for both purposes interchangeably, it is difficult to foresee ‘direction’ and ‘volume’ of data transport through the streams. The end result of this un-certainty is that the default read chunk size is 64KB and is un-altered.
For the most common workload scenarios – single page web servers, micro-services, proxies and gateways – the amount of data transferred between end-points per unit transaction may be well contained in this default, so this approach works fine and scales well. However, streaming use case offers a challenge. When a multimedia resource is being transported from one system to another, the sender is willing to ‘expend’ all its resources to write the data as fast as it can, and expecting the receiver to perform the same for reading, but the existing chunking approach can stand against this ‘willingness’.
The current design of node.js / libuv for TCP streams at I/O site are:
- write as much as possible
- place the subject fd into the event loop
- wait for the written data to drain and fd becomes writable again.
- repeat the steps
- fire the callback, if any
- attempt to read 64 KB of data
- if less data is obtained, fire the callback if any (read complete)
- if 64KB data is obtained, fire the callback if any (read incomplete), and repeat from 1.
We believed that there is a little bottleneck around this area and can be enhanced – while write logic is aggressive and optimal for any workloads, read logic may result in different behavior based on the data volume, network latency, rate of writing at the remote end-point etc.
The core of the enhancement we are talking about is to:
- Expose an API uv_tcp_enable_largesocket() on a specified stream
- Query the OS tunables to see what is the maximum possible data chunk size for a socket [ refer to rfc1323 which specifies TCP extensions to improve performance and reliability of data transmissions over very high-speed paths, and defines Window Scale option that supports larger receive windows than 64 KB]
- Apply that to the stream’s underlying socket: for read and write buffers
- Record this value for allocating read buffer chunks size at future read operations
We worked in the libuv community and took and approach of iterative refinement through discussions and ratifications.
We first simply increased the read chunk size to 128KB, 256KB, 1MB etc. and observed how standard benchmarks respond: there wasn’t anything promising.
This led to developing a custom streaming scenario: 128 concurrent clients accessing 5MB of binary data from a server few times showed the real difference:
Platform: Linux x64
Network: Short RTT
Tests with original node
Tests with modified node
Summary of the above¬†data is that we saw ~8% improvement in the total run time for the modified case while variations in utilization of other system resources stayed flat.
We tested in LRTT (Long round-trip-time) networks as well and the results were similar.
Next step was to make sure the benchmark has some meaning – the data it transports has a reasonable bearing on one or more real world workload patterns. So the modified test had around 100 clients running in parallel, and then sequential by each clients a 100 times, download a 100MB data each time: roughly 1.0E+12 bytes of data being flown across.
The summary of improvements (percentage) in few platforms is as follows:
strace¬†was used to figure out time split between system calls – calibrate the read calls to see what impact the enlarged buffers have on it.
We soon switched to perf based on community suggestion – as strace is heavy weight and incurs its own burden. Perf not only provided time statistics, but showed the lowest level ripples as well – such as variations in context switches on small and large reads. I cannot write further without mentioning the¬† wonderful documentation by Brendan Gregg on perf.
The single thread’s availability in the uv loop is central to concurrency, and we used¬†appmetrics tool to measure the loop latency: this is defined as the time elapsed between an requesting a task to be run on the loop and its execution commencement – practically a measure of concurrency – with lower the value, the better.
Average loop latency measured in milliseconds:
Summary of this data is that we see drastic reduction in terms of loop latency.
- With enlargement in the buffer, the amount of time spent in read has increased.
- With enlargement in the buffer, the overall amount of time spent in kernel has increased.
- The performance gain is a function of concurrency in the system. The more number of transactions in the system, the gain due to this enlargement becomes more apparent.
- The performance gain is also a function of the actions in the read callback. If an empty callback is employed, the performance gain is not visible.
- Because we are reading more and looping less, the availability in the uv loop has increased to a considerable extent.
1. The net effect of socket enlargement in big data workload is:
- Large reads implies less callbacks
- Less garbage to collect
- Better cache efficiency brought through improved spacial locality
- Reduced Object creation efforts
- Reduced run time linking and dispatch of disparate subroutines: C -> C++ -> JS and reverse.
- JIT factors that advantage the callback code which now works on larger volume of data.
2. While proximity (measured in terms of round-trip-time) of the end-points which take part in the data transport critically influences the drain latency, it does not alter the improvement ratio brought through this experiment.
3. Distributed computing deployments which is highly concurrent (uv loop is sufficiently exercised) with large data flow between end-points can potentially benefit from this enhancement. Any overhead incurred due to large reads are masked by the gain obtained through better utilization of network resources in the system, reduced callbacks and reduced starvation in the uv loop and thereby improved concurrency.
The¬†Pull Request is being actively pursued in the community. Due to the fact that major changes are planned for the upcoming version (v2) in the code area where this PR applies to, there seems to be challenges in terms of pursuing this in the current streams (v1.x) as the code would need refactoring and separate handling for both the streams. Meanwhile, we are looking for some industrial-strength streaming workloads to see if their systems can exercise our changes and bring-in insights.
Feel free to collaborate with us if this interests you – in terms of doing joint experiments, testing your workload, refining the approaches – in expectation of bringing together collective intelligence with a goal of enhancing transport throughput for big data in node.js deployments.
Special thanks to those who provided vital insights and profound contributions to the experiment:
Ben Noordhuis – firstname.lastname@example.org – https://github.com/bnoordhuis
Sam Roberts – email@example.com – https://github.com/sam-github
Sa√ļl Ibarra Corretg√© – firstname.lastname@example.org – https://github.com/saghul
Santiago Gimeno – email@example.com – https://github.com/santigimeno
artityod – https://github.com/artityod
Aaron Bieber – firstname.lastname@example.org – https://github.com/qbit
Jamie Davis – email@example.com – https://github.com/davisjam