- Linux is a primary server grade operating system powering everything from banking servers to the cloud.
- Many enterprise Linux distributions exist, each with customized configuration and different setup.
- Setting up a Linux server for a service can be a challenge given the breadth of options and steps to get the setup right.
- For Linux to be an enterprise-grade platform, it is important to have an easy way for customers to provide all the data needed for diagnosing a problem, what IBM calls first-failure data capture (FFDC).
This blog explains how to set up a Linux system for capturing the right FFDC for service.
Linux is a primary server grade operating system (OS) running all kinds of applications. Whether it be banking, order management, inventory, cloud, databases – Linux is everywhere. Any issue hindering the function of these servers can impact a large set of customers, affecting the service level agreement (SLA), customer satisfaction (CSAT), and other metrics detrimental to the bottom lines of the companies that offer these services. It is therefore imperative that infrastructure exists to cater to diagnose and resolve issues with a very short turnaround time.
Setting up a system to collect relevant logs and other debug data might seem simple. In the Linux context, different distributions (distros) include different tools that cater to the same function. It is also possible that many such tools are not installed by default, leading to a situation where the user is left scrambling to get the appropriate tools to gather the required information to diagnose the problem. For instance, SUSE provides the supportconfig tool with its Enterprise offering SLES, while Red Hat uses the sosreport tool along with Red Hat Enterprise Linux (RHEL) for standard Linux system FFDC.
For problems that cause system downtime (such as kernel crash), configuring and collecting FFDC can be complex. Different architectures may have different frameworks and knobs that need to be setup correctly, so that the dump is captured right, during the first occurrence of the problem.
Thus far, it was incumbent on the sysadmin to ensure all relevant log collection tools are installed, all relevant daemons are installed and configured to run at the right time, and kernel crash dump is set up right – with the right amount of memory reservation, initramfs and bootloader updates, and so on. An error in any of the intermediate steps can be detrimental to the dump being captured.
The problem is compounded because the memory reservation needed for the crash kernel to boot is dependent on the system configuration such as size of RAM, number of cores, number and type of I/O adapters, and so on. Linux distributions default the memory reservation to some arbitrary (small) value that could be much less than what is needed for the system. If this is not fixed correctly, the system ends up crashing twice – once due to the actual kernel bug and the second time due to the dump kernel running out of memory (OOM). A few internal test and customers have encountered this issue.
Solving the FFDC configuration problem
Expecting every tester/sysadmin to run a set of commands is both tedious and untenable. The next best solution is to automate the FFDC configuration.
ServiceReport is a tool to validate and repair the FFDC configuration on a system. Written in Python using a plug-in model, ServiceReport determines the system type and with this information, it knows what relevant packages and daemons need to run on the system, kernel dump setup, and so on.
ServiceReport can run in two phases – validate and repair.
In the validate phase, ServiceReport can:
- Determine the platform type.
- Determine the validation plug-ins relevant for that platform.
- Run the validation plug-ins and determine gaps in configuration such as missing packages, incorrect configuration, dump setup validation, and so on.
- Flag the issues both on the console and syslog with a recommendation of how to resolve the issue.
ServiceReport has a systemd trigger, which can be configured to run the validate phase on every boot.
The repair phase needs to be run manually. If the system is configured to contact the appropriate distribution package repositories, ServiceReport will:
- Install and configure the missing packages.
- Trigger daemons to start appropriately.
- Configure dump parameters correctly.
It also indicates if it failed to correct any specific errors, which then need to be resolved manually.
Figure 1. ServiceReport in action
One of the primary use cases for the ServiceReport tool is to validate and fix the kernel crash dump configuration. It can detect if the kernel memory reservation is sufficient, if the crash kernel is configured to be loaded correctly, and if the early boot components (for example, initramfs and dracut) have the appropriate modules installed. It also knows to fix any of these parameters in the repair phase.
An additional utility of the ServiceReport tool is its dump test trigger. Using this, you can trigger a dummy kernel crash dump to validate if its working, before starting the workload.
ServiceReport currently caters to the Linux on Power platform but extending it to any other platform (say x86, arm or s390) is just a matter of writing the relevant plug-ins. The tool is available for download at https://github.com/linux-ras/ServiceReport.
Internal test teams at IBM have been using the ServiceReport tool for a few months now and the results are very encouraging.
ServiceReport currently caters to automatically configuring the reliability, availability, and serviceability (RAS) setup of a system. However, by virtue of using a Python plug-in model, it can easily be extended to automatically configuring workloads. Anything scriptable can be a plug-in.
If your workload needs a bunch of RPMs to be installed, and network or I/O configurations to be set up in a specific manner, you can script it to perform automatically with ServiceReport.
Setting up a server so that the right set of data is captured and made available for support personnel on the first occurrence of a problem is a key requirement for any enterprise-grade offering. The ServiceReport tool provides the ability to validate and repair issues with the FFDC setup of a Linux server, thus enabling a better service experience for customers. The plug-in model of the ServiceReport lends itself for other uses including automatically configuring the system to run a particular workload with custom plug-ins. With this facility now freely available to customers, the expectation is that the turnaround times for resolving Linux issues can come down considerably, leading to a satisfying customer experience.