Co-author: Sreepurna Jasti (jsreepur@in.ibm.com)

 

 

This is the story of an experiment which we conducted within node.js’s event management infrastructure (libuv) to investigate improvement in network throughput by exploiting the operating system tunables in the TCP stack and leveraging them at the data transport sites.

Introduction to libuv
libuv is a software module that implements an event loop to support asynchronous I/O in a platform independent manner. This is the central entity which helps node to demonstrate event-driven and asynchronous behavior. For finest level of performance, it is implemented in C. While many time- incurring and time-bound operations churn through the libuv’s event loop, the TCP streams exhibit the model behavior of out of bounds data flow through implementing non-blocking sockets and performing evented I/O (read upon-arrival and write when-possible.). Apart from node.js (which uses javascript binding), libuv is also used in a number of other projects, where it has bindings to Python, Lua, C#, Perl, PHP, Ruby etc. and even C++.

TCP streams in libuv / node
TCP streams are sophisticated abstractions around full-duplex sockets that provide asynchronous APIs for reading and writing, routines for managing the data flow through the socket channel such as pause, resume, pipe etc. Once the connection is established, an I/O – a read for example is initiated through uv_read_start call. This allows the caller to customize the read chunk size apart from returning as a callback each time when a chunk full of data is read.

However, given the workload patterns in the use cases of libuv (such as node) the data chunk size can never be determined optimally. For example, a web server may predominantly be used to write large amount of data to its client (a download site) in which case the write chunk can be enlarged to a great extend while keeping the read chunk to be of low / moderate size. Conversely, read chunk can be enlarged for a web server which implements an upload function. The fact that a full-duplex socket is capable of reading as well as writing and can be used for both purposes interchangeably, it is difficult to foresee ‘direction’ and ‘volume’ of data transport through the streams. The end result of this un-certainty is that the default read chunk size is 64KB and is un-altered.

Problem statement
For the most common workload scenarios – single page web servers, micro-services, proxies and gateways – the amount of data transferred between end-points per unit transaction may be well contained in this default, so this approach works fine and scales well. However, streaming use case offers a challenge. When a multimedia resource is being transported from one system to another, the sender is willing to ‘expend’ all its resources to write the data as fast as it can, and expecting the receiver to perform the same for reading, but the existing chunking approach can stand against this ‘willingness’.

The current design of node.js / libuv for TCP streams at I/O site are:

-> writing:

  • write as much as possible
  • place the subject fd into the event loop
  • wait for the written data to drain and fd becomes writable again.
  • repeat the steps
  • fire the callback, if any

-> reading:

  • attempt to read 64 KB of data
  • if less data is obtained, fire the callback if any (read complete)
  • if 64KB data is obtained, fire the callback if any (read incomplete), and repeat from 1.

We believed that there is a little bottleneck around this area and can be enhanced – while write logic is aggressive and optimal for any workloads, read logic may result in different behavior based on the data volume, network latency, rate of writing at the remote end-point etc.

Experiments

The core of the enhancement we are talking about is to:

  • Expose an API uv_tcp_enable_largesocket() on a specified stream
  • Query the OS tunables to see what is the maximum possible data chunk size for a socket [ refer to rfc1323 which specifies TCP extensions to improve performance and reliability of data transmissions over very high-speed paths, and defines Window Scale option that supports larger receive windows than 64 KB]
  • Apply that to the stream’s underlying socket: for read and write buffers
  • Record this value for allocating read buffer chunks size at future read operations

We worked in the libuv community and took and approach of iterative refinement through discussions and ratifications.
We first simply increased the read chunk size to 128KB, 256KB, 1MB etc. and observed how standard benchmarks respond: there wasn’t anything promising.
This led to developing a custom streaming scenario: 128 concurrent clients accessing 5MB of binary data from a server few times showed the real difference:

Platform: Linux x64
Network: Short RTT
Node: v7.x

Tests with original node


Tests with modified node


Summary of the above data is that we saw ~8% improvement in the total run time for the modified case while variations in utilization of other system resources stayed flat.

We tested in LRTT (Long round-trip-time) networks as well and the results were similar.

Next step was to make sure the benchmark has some meaning – the data it transports has a reasonable bearing on one or more real world workload patterns. So the modified test had around 100 clients running in parallel, and then sequential by each clients a 100 times, download a 100MB data each time: roughly 1.0E+12 bytes of data being flown across.

 

The summary of improvements (percentage) in few platforms is as follows:


Tools

strace was used to figure out time split between system calls Рcalibrate the read calls to see what impact the enlarged buffers have on it.

We soon switched to perf based on community suggestion Рas strace is heavy weight and incurs its own burden. Perf not only provided time statistics, but showed the lowest level ripples as well Рsuch as variations in context switches on small and large reads. I cannot write further without mentioning the  wonderful documentation by Brendan Gregg on perf.

The single thread’s availability in the uv loop is central to concurrency, and we used¬†appmetrics tool to measure the loop latency: this is defined as the time elapsed between an requesting a task to be run on the loop and its execution commencement – practically a measure of concurrency – with lower the value, the better.

Average loop latency measured in milliseconds:


Summary of this data is that we see drastic reduction in terms of loop latency.

Observations

  • With enlargement in the buffer, the amount of time spent in read has increased.
  • With enlargement in the buffer, the overall amount of time spent in kernel has increased.
  • The performance gain is a function of concurrency in the system. The more number of transactions in the system, the gain due to this enlargement becomes more apparent.
  • The performance gain is also a function of the actions in the read callback. If an empty callback is employed, the performance gain is not visible.
  • Because we are reading more and looping less, the availability in the uv loop has increased to a considerable extent.

Inferences

1. The net effect of socket enlargement in big data workload is:

  • Large reads implies less callbacks
  • Less garbage to collect
  • Better cache efficiency brought through improved spacial locality
  • Reduced Object creation efforts
  • Reduced run time linking and dispatch of disparate subroutines: C -> C++ -> JS and reverse.
  • JIT factors that advantage the callback code which now works on larger volume of data.

2. While proximity (measured in terms of round-trip-time) of the end-points which take part in the data transport critically influences the drain latency, it does not alter the improvement ratio brought through this experiment.

3. Distributed computing deployments which is highly concurrent (uv loop is sufficiently exercised) with large data flow between end-points can potentially benefit from this enhancement. Any overhead incurred due to large reads are masked by the gain obtained through better utilization of network resources in the system, reduced callbacks and reduced starvation in the uv loop and thereby improved concurrency.

Current status
The Pull Request is being actively pursued in the community. Due to the fact that major changes are planned for the upcoming version (v2) in the code area where this PR applies to, there seems to be challenges in terms of pursuing this in the current streams (v1.x) as the code would need refactoring and separate handling for both the streams. Meanwhile, we are looking for some industrial-strength streaming workloads to see if their systems can exercise our changes and bring-in insights.

Feel free to collaborate with us if this interests you – in terms of doing joint experiments, testing your workload, refining the approaches – in expectation of bringing together collective intelligence with a goal of enhancing transport throughput for big data in node.js deployments.

Acknowledgements

Special thanks to those who provided vital insights and profound contributions to the experiment:

Ben Noordhuis – info@bnoordhuis.nl – https://github.com/bnoordhuis
Sam Roberts – vieuxtech@gmail.com – https://github.com/sam-github
Sa√ļl Ibarra Corretg√© – saghul@gmail.com – https://github.com/saghul
Santiago Gimeno – santiago.gimeno@gmail.com – https://github.com/santigimeno
artityod – https://github.com/artityod
Aaron Bieber – aaron@bolddaemon.com – https://github.com/qbit
Jamie Davis – davisjam@vt.edu – https://github.com/davisjam

1 comment on"node.js, libuv: TCP stream Performance Experiments"

  1. […] started with my 1st project: “Nodejs performance improvement Experiment” which was to¬†Improve the I/O in Node JS through in libuv module by exploiting the underlying tcp configuration sy…¬†For that project, to get the base idea, we had started to research about libuv internals, tcp […]

Join The Discussion

Your email address will not be published. Required fields are marked *