Digital Developer Conference on Data and AI: Essential data science, machine learning, and AI skills and certification Register for free

Conducting a post-mortem analysis of a Node.js application

Understanding what causes an application to fail is critical to creating a healthy cloud infrastructure. In my current role, I work directly with customers to help them navigate their cloud adoption journey. As part of this role, I collaborate with them on best practices for creating, building, and maintaining their cloud infrastructure.

As organizations adopt cloud-centric technologies such as containers, they find holes in their operating model. For the most part, these gaps always existed, but the acceleration in their software development lifecycle highlights these shortfalls sooner. One key aspect of an organization’s cloud operating model is observability, which is the ability to understand the internal state of a system from the available outputs. Most organizations focus their observability efforts on logging and monitoring, but other debugging techniques include tracing and post-mortem analysis.

Post-mortem analysis (PMA) is the ability to take a snapshot of a process just as it’s failing so that you can examine the snapshot away from the production environment. While this has been possible on traditional VM-based environments, using PMA in a container environment hasn’t been clear.

In this tutorial, I walk you through a sample scenario that demonstrates how to manage and debug a crashing Node.js process to help you understand the benefits of post-mortem analysis and how to use it effectively. I show you how to install a core dump handler and analyzer and how to use them with a crashing Node.js process.

Although this article focuses on Node.js, you can apply this flow to any runtime including Java, Go, Rust, and more.

A brief introduction to post-mortem analysis

Post-mortem analysis refers to the technique of collecting a core dump and ancillary information when a process crashes and using that information to diagnose what caused the process to fail. Core dumps are one of the oldest debugging techniques. Check out an in-depth piece about core dumps and how they’re generated on my personal blog.

Until the introduction of Kubernetes, post mortem analysis in Linux containers was a thorny and unresolved item. As this issue highlights, the interaction between the host and the container runtime for system facilities was never resolved satisfactorally. Kubernetes solved this problem by offering pod policies as a way to safely access host resources in a manageable way and provide a route to bring a rich investigation technique to the container ecosystem.

Prerequisites

To complete the steps in this tutorial, you need to:

If you are deploying into OpenShift, then oc is also required as well as the IBM Cloud CLI.

Component overview

In this tutorial, I show you how to install the following components, crash a process, and then debug the crashed process.

  • Core dump handler: Responsible for deploying the core processor on the host to take the crashed process from the container to IBM Cloud Object Storage
  • Core dump analyser: A preconfigured container that will access cloud object storage and act as the debug environment.
  • Crashing Node.js application: The application that we are going to debug

Architecture Diagram

Steps

  1. Install a core dump handler
  2. Install a crashing service
  3. Install the debugger environment
  4. Crash the service
  5. Verify the crash dump
  6. Debug the crash dump

Step 1. Install a core dump handler

First, you need to install the IBM Core Dump Handler in either IBM Kubernetes Service or RedHat OpenShift Kubernetes Service. The internal architecture of the handler is detailed as part of the README.

Take a note of the --set pvc.bucketName=A_UNIQUE_NAME option that you supply because the debugger needs it to complete the install.

Step 2. Install a crashing service

  1. To follow the steps in this tutorial, you need to install this pre-configured Node.js service since we use it to demonstrate debugging.

    The container for the service is defined in this Docker file and published to Docker hub.

    The service is configured to abort on an uncaught exception. While this approach is valid, you may want to elaborate on this with the NODE_OPTIONS environment variable available since v8.0 because some teams use npm scripts to start their applications. Using NODE_OPTIONS is probably a more consistent way to ensure core dumps are enabled.

    The actual error that causes the crash is from the original Node.js core dump post.

     var http = require('http');
     var host = process.env.HOST
     var port = process.env.PORT
     server = http.createServer(function myRequestListener(req, res) {
     res.writeHead(200, {'Content-Type': 'text/plain'});
     res.end('Hello World\n');
     res.not_a_function()
     }).listen(port, host);
     console.log(`Server process ${process.pid} running at http://${host}:${port}/`);
    

    Specifically, res.not_a_function() doesn’t exist and will cause an error.

  2. Simply install the service into the default namespace with:

     $ kubectl create deployment crashing-app --image=number9/example-crashing-nodejs-app
    
  3. Ensure the service is running by executing:

     $ kubectl get po
     NAME READY STATUS RESTARTS AGE
     crashing-app-699c49b4ff-86wrh 1/1 Running 21 20d
    

Step 3. Install the debugger environment

  1. Clone the debugger project and change directory into the Node.js folder.

     $ git clone git@github.com:IBM/core-dump-containers.git
     $ cd core-dump-containers/nodejs/deploy
    
  2. Run the Helm chart with the following parameters where the namespace is the same namespace you installed with the debug handler, and the bucket name is the name you provided to the debug handler.

     $ helm install nodecoretools . --namespace ibm-observe --set pvc.bucketName=A_UNIQUE_NAME
    

    If you prefer to install the debugger into a different namespace then you have to copy the cos-write-access secret from the namespace you installed the handler (e.g. ibm-observe) into the namespace you want to install the core dump analyser.

  3. Validate the install:

     $ kubectl get pods -n ibm-observe
     NAME READY STATUS RESTARTS AGE
     coredump-daemonset-gdg82 1/1 Running 0 6d4h
     nodecoretools-nodejs-core-dump-7cc56b4cfc-zzcbc 1/1 Running 0 40s
    

Step 4. Crash the service

Now that you completed the installation, the first step is to generate a core dump by crashing the service. First, you need to expose the pod to the network. Log in using kubectl and curl the endpoint on localhost.

  1. Execute into the pod using the pod name. Your command should look something like:

     $ kubectl exec -it crashing-app-699c49b4ff-86wrh -- /bin/bash
    

    Before I show you how to crash the service, let me explain how the cores are captured.

  2. Run the following command in the pod:

     $ cat /proc/sys/kernel/core_pattern
     |/node/kcdt.sh -c=%c -e=%e -p=%p -s=%s -t=%t -d=/node/core -h=%h -E=%E
    

    The pattern returned is set up by 1ibm-core-dump-handler1 on the host node and every pod started on the node has this configuration. The command basically pipes the core dump to a bash command hosted on the server with the arguments.

    Detail on the arguments can be found in the Linux man pages and a brief description below.

    • /node/kcdt.sh – The bash file that processes the core.

    • c – Core file size soft resource limit of crashing process (since Linux 2.6.24)

    • e – Executable filename (without path prefix)
    • p – PID of dumped process, as seen in the PID namespace in which the process resides
    • d – The directory on the host that is mapped to the Object Store bucket
    • s – number of signal causing dump
    • t – Time of dump, expressed as seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC)
    • h – Hostname (same as nodename returned by uname(2))
    • E – Pathname of executable, with slashes (‘/’) replaced by exclamation marks (‘!’) (since Linux 3.0).
  3. Now call the endpoint that causes the crash.

    The command should return Hello World. After a few seconds, the pod will terminate and you will be returned to your prompt.

     $ curl http://localhost:3000/
     Hello World
     node@crashing-app-699c49b4ff-86wrh:~/app$ command terminated with exit code 137
     $
    

Step 5. Verify the crash dump

You should be able to verify that that crash dump has been captured by listing the contents of the bucket. This can be done from the IBM Cloud CLI as follows:

$ ibmcloud cos list-objects -bucket A_UNIQUE_NAME
2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-dump-info.json May 02, 2020 at 23:34:28 300 B 
2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-image-info.json May 02, 2020 at 23:34:33 116 B 
2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-runtime-info.json May 02, 2020 at 23:34:32 8.64 KiB 
2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4.core.gz May 02, 2020 at 23:34:32 1.22 MiB

You can also log into the IBM Cloud UI and browse to the bucket to see the files.

Dump file descriptions

Along with the core dump file stored in the compressed core.gz, there are 3 associated JSON files to help with the creation of downstream tools and processing.

  • x.dump-info.json – Contains the arguments that were passed to the collector by the operating system core_pattern
  • x.image-info.json – Contains the image that the running container was based on. This is useful for obtaining a copy of the original executable.
  • x.runtime-info.json – Depending on the platform, this contains varied information but key aspects of the environment such as namespace and labels are available.

Step 6. Debug the crash dump

Now that you have the artifacts, I walk you through how to debug the core dump.

  1. Find the name of your debug pod and log into it.

     $ kubectl get po -n ibm-observe
     NAME READY STATUS RESTARTS AGE
     coredump-daemonset-gdg82 1/1 Running 0 6d5h
     nodecoretools-nodejs-core-dump-xxxx-xxxx 1/1 Running 0 67m
    
  2. Use the following command to login to the pod

     $ kubectl exec -it nodecoretools-nodejs-core-dump-xxxx-xxxx -- /bin/bash
    
  3. All the crash dumps available are in the /cores folder. Take a look at what’s there.

     $ ls /cores
     2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-dump-info.json
     2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-image-info.json
     2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-runtime-info.json
     2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4.core.gz
    
  4. As you are logged into the pod as the node user, the files in /cores are read-only, so you need to create a working folder to work with the files

     $ cd ~
     $ mkdir work
     $ cd work
     $ cp /cores/* .
    
  5. Inspect the contents of the JSON files.

     $ cat 2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4-dump-info.json
    
     { "uuid":"2fd506c1-4030-4b91-a37e-0242941273d7", "dump_file":"2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-     crashing-app-699c49b4ff-86wrh-node-10-4.core.gz", "ext": ".gz", "timestamp": "1588462466", "hostname": "crashing-     app-699c49b4ff-86wrh", "exe": "node", "real_pid": "10", "signal": "4" }
    
  6. Now, copy the core file to a working folder and extract the core file from the compressed archive.

     $ cp /tmp/2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4.core.gz
     $ gunzip 2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4.core.gz
    
  7. Run the debugger:

     $ llnode /usr/local/bin/node -c 2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-      node-10-4.core
     (lldb) target create "/usr/local/bin/node" --core "2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-    app-699c49b4ff-86wrh-node-10-4.core"
     Core file '/home/node/work/2fd506c1-4030-4b91-a37e-0242941273d7-dump-1588462466-crashing-app-699c49b4ff-86wrh-node-10-4.core' (x86_64) was loaded.
     (lldb) plugin load '/usr/local/lib/node_modules/llnode/llnode.so'
     (lldb) settings set prompt '(llnode) '
    
  8. Now you can print the code that was executing when the application exited:

     (llnode) v8 bt
     * thread #1: tid = 10, 0x00000000019b94e9 node`v8::base::OS::Abort() + 9, name = 'node', stop reason = signal SIGILL
     * frame #0: 0x00000000019b94e9 node`v8::base::OS::Abort() + 9
     frame #1: 0x0000000000cb4b19 node`v8::internal::Isolate::Throw(v8::internal::Object, v8::internal::MessageLocation*) + 457
     frame #2: 0x0000000001024cb7 node`v8::internal::Runtime_ThrowCalledNonCallable(int, unsigned long*,  v8::internal::Isolate*) + 71
     frame #3: 0x00000000013a71b9 <exit>
     frame #4: 0x0000000001326765 <internal>
     frame #5: 0x000000000132cb64 myRequestListener(this=0x26fd6ebe0501:<Object: Server>, 0x3710416c4219:<Object:     IncomingMessage>, 0x3710416c57b1:<Object: ServerResponse>) at /home/node/app/index.js:5:54 fn=0x000026fd6ebe04a1
     frame #6: 0x000000000132cb64 emit(this=0x26fd6ebe0501:<Object: Server>, 0x1d0deae2a979:<String: "request">) at  events.js:1:10 fn=0x000021ff29e62749
     frame #7: 0x00000000013261bc <adaptor>
     frame #8: 0x000000000132cb64 parserOnIncoming(this=0x30b6948004b1:<undefined>, 0x26fd6ebe0501:<Object: Server>, 0x3710416c03c9:<Object: Socket>, 0x3710416c3051:<Object: Object>, 0x3710416c4219:<Object: IncomingMessage>, 0x30b694800639: <true>) at _http_server.js:1:10 fn=0x000026fd6ebde2e1
     frame #9: 0x000000000132cb64 parserOnHeadersComplete(this=0x3710416c2149:<Object: HTTPParser>, <Smi: 1>, <Smi: 1>, 0x3710416c41f9:<Array: length=6>, <Smi: 1>, 0x21ff29e70d31:<String: "/">, 0x30b6948004b1:<undefined>, 0x30b6948004b1:<undefined>, 0x30b6948006e9:<false>, 0x30b694800639:<true>) at _http_common.js:1:10 fn=0x000026fd6ebd8389
     frame #10: 0x000000000132a0dd <internal>
     frame #11: 0x0000000001329eb8 <entry>
     frame #12: 0x0000000000c9f890 node`v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal:: (anonymous namespace)::InvokeParams const&) + 432
     frame #13: 0x0000000000c9fd48 node`v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) + 88
     frame #14: 0x0000000000b6f77b node`v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) + 363
     frame #15: 0x0000000000a04ee1 node`node::(anonymous namespace)::Parser::Proxy<int (node::(anonymous namespace)::Parser::*)(), &(node::(anonymous namespace)::Parser::on_headers_complete())>::Raw(llhttp__internal_s*) + 1105
     frame #16: 0x000000000181cb8e node`llhttp__internal_execute + 990
     frame #17: 0x0000000000a075df node`node::(anonymous namespace)::Parser::OnStreamRead(long, uv_buf_t const&) + 239
     frame #18: 0x0000000000ab0686 node`node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) + 166
     frame #19: 0x0000000001319a19 node`uv__read(stream=0x0000000003206ba8) at stream.c:1238
     frame #20: 0x000000000131a040 node`uv__stream_io(loop=<unavailable>, w=0x0000000003206c30, events=1) at stream.c:1305
     frame #21: 0x0000000001320188 node`uv__io_poll(loop=0x0000000002a73820, timeout=899) at linux-core.c:421
     frame #22: 0x000000000130e26b node`uv_run(loop=0x0000000002a73820, mode=UV_RUN_DEFAULT) at core.c:375
     frame #23: 0x0000000000a31ec3 node`node::NodeMainInstance::Run() + 611
     frame #24: 0x00000000009c1cc8 node`node::Start(int, char**) + 664
     frame #25: 0x00007f5294be809b libc.so.6`__libc_start_main + 235
     frame #26: 0x000000000095c085 node`_start + 41
    
  9. Now you can follow the remaining steps in the original lldb tutorial to finish the debugging exercise.

    The callback is named myRequestListener, so you can see that being run at frame #5. You can select that frame in lldb and use the llnode extensions to dump the source out:

     (llnode) frame select 5
     frame #5: 0x000000000132cb64 node`Builtins_InterpreterEntryTrampoline at base.tq:357
    
  10. View the source:

     (llnode) v8 source list
     5 server = http.createServer(function myRequestListener(req, res) {
     6 res.writeHead(200, {'Content-Type': 'text/plain'});
     7 res.end('Hello World\n');
     8 res.not_a_function()
    
  11. Finally, you can inspect variables (with v8 inspect). To review the arguments to the listener, let’s look at the request object and then at the URL field inside it:

     (llnode) v8 inspect 0x3710416c4219
     0x3710416c4219:<Object: IncomingMessage properties {
     ._readableState=0x3710416c43a1:<Object: ReadableState>,
     .readable=0x30b694800639:<true>,
     ._events=0x3710416c46c1:<Object: Object>,
     ._eventsCount=<Smi: 1>,
     ._maxListeners=0x30b6948004b1:<undefined>,
     .Symbol('kCapture')=0x30b6948006e9:<false>,
     .socket=0x3710416c03c9:<Object: Socket>,
     .connection=0x3710416c03c9:<Object: Socket>,
     .httpVersionMajor=<Smi: 1>,
     .httpVersionMinor=<Smi: 1>,
     .httpVersion=0x3710416c5599:<String: "1.1">,
     .complete=0x30b6948006e9:<false>,
     .headers=0x3710416c4cc1:<Object: Object>,
     .rawHeaders=0x3710416c41f9:<Array: length=6>,
     .trailers=0x3710416c4e99:<Object: Object>,
     .rawTrailers=0x3710416c4ed1:<Array: length=0>,
     .aborted=0x30b6948006e9:<false>,
     .upgrade=0x30b6948006e9:<false>,
     .url=0x21ff29e70d31:<String: "/">,
     .method=0x26fd6ebd8839:<String: "GET">,
     .statusCode=0x30b6948001b9:<null>,
     .statusMessage=0x30b6948001b9:<null>,
     .client=0x3710416c03c9:<Object: Socket>,
     ._consuming=0x30b6948006e9:<false>,
     ._dumped=0x30b6948006e9:<false>}>
    

You specify the object to inspect using its address in memory, not its name. Notice that llnode lets you inspect variables and drill down to see the fields inside them. As well as being incredibly useful, the v8 inspect and v8 source list commands also highlight that everything is in the core dump — your users data and even your code.

There’s much more you can do with llnode and ways you can use it to investigate problems at a higher level than just examining individual objects. Brendan Gregg has done an excellent article on using llnode for Node.js Memory Analysis. If there’s anything in the article you have questions about or on the core dump topic in general, feel free to open an issue in the core-dump repositiory.

Summary

Hopefully that was worth the journey and you now understand the benefits PMA can bring to isolating why an application has died.

I strongly encourage you to create a core dump management process on any system, not just Kubernetes. Applications crash more often than anyone wants and allowing your orchestration system to turn processes back on again is not a good strategy.

Feel free to use the components shared here as a starting point for building out your implementation. If you would like to collaborate, use the open source components in this tutorial. I would be delighted to collaborate with you on this.

Core Dump Handler projectCore Dump Containers project

Future work

Currently the core dump handler uses an S3 backed file system. While this makes configuration of the host and file sharing trivial, the configuration is platform specific. It may be worth looking at including an S3 client in the core dump handler and the debugger to make installation on any Kubernetes environment simpler.

The management of the files in the debugger pod (that is, creating a working folder and copying the files) would be smoother with a simple CLI in the debugger pod to perform those actions.

This tutorial works simply because the crashing application and the debugger both use Node.js version 12. If the crashing app used a different version of Node, then that would need to be extracted from the container image. The information required to locate the correct version is in x.image-info.json associated with the core.

Support for more runtime types. Currently https://github.com/No9/core-dump-containers only has a definition for node. Other runtimes and languages should be simple to add using that layout.

RBAC and/or Namespace partitioned access to the dumps. Core dumps carry sensitive data and a debug user currently has read only access to all the core dumps taken on the cluster. Organisations would probably want to restrict that and integrate it with their standard authentication systems.

Take arbitrary core dumps – Unfortunately apps don’t always crash when they are misbehaving. They can become totally unresponsive or possibly even worse just respond slowly. Core dumps can help in those scenarios to so it would be nice to use a tool like gcore from a privileged pod to take a snapshot of a running application.

Thanks

Gireesh Punathil, Beth Griggs, Joe Sepi, Michael Dawson — Thanks for suggesting that this may be of interest to the wider world, helping me navigate the review process and providing feedback.