In today’s hybrid cloud world, ensuring minimal service downtime is a challenging task. One of the key roles of a practitioner or site reliability engineer (SRE) is to support mission critical applications and keep them running. An SRE’s understanding of an issue or incident depends on the SRE’s skills and abilities to:
- Correctly understand the symptoms
- Diagnose the problems
- To take the immediate next best action to resolve the issue quickly
Data from IT service support, such as agent resolutions, conversation channels, and technical documents, is voluminous and unstructured with no pre-defined form or schema. Extracting insights from this rich but completely unstructured data is key for SREs.
Among the plethora of diagnosis steps that can be taken, relevant insights from historical contextual data helps to focus an investigation and quickly resolve a problem. However, sifting through large amounts of data with the service restoration clock ticking is an impossible task when done manually. AI can assist in this process by accurately recognizing the main problem components in historical data and the actions that were performed on them.
Curation of these components and the actions from historical data helps in the following ways:
- Compilation of a rich knowledge base of IT operations’ issues and the actions that were taken
- Creation of runbooks that contain a condensed summary of actions for past similar issues
- Suggestion of proactive recommendations of the next best action given the current issue context
Figure 1 shows a sample resolution note that was written by a practitioner for an issue. It contains many components and action words (both domain-specific and generic). Linking action words and components just based on the proximity of the nouns and verbs can lead to ambiguous or incorrect solutions. For example, “DNS issues” or “cluster” could be linked to “cordon off”, “remove”, and “resulted”.
Figure 1: An example of a resolution text
To address this problem, IBM Cloud Pak for Watson AIOps uses shallow semantic parsers to analyze technical support documents and then uses unsupervised learning methods to extract key domain-specific component-action links.
The framework for the component-action model in Cloud Pak for Watson AIOps, which is shown in Figure 2, consists of three basic steps:
- Key component phrase extraction
- Action words extraction
- Component-action linking
Figure 2: The framework of a component-action model in Cloud Pak for Watson AIOps.
Component phrase extraction
First, the key component phrases are extracted from the input resolution text using various linguistic and non-linguistic features to extract key components. The approach begins with a generic extraction of candidate components. Then, various noise filtering techniques are used to discard key terms that are not relevant to the domain of technical support service which include:
- Document-level relevance metrics
- IT product domain-specific glossaries and word embeddings
- IT-support-specific annotations, referring to concepts such as symptom, problem, and resolution
Action word extraction
In Cloud Pak for Watson AIOps, an action is defined as a process of performing a change operation by practitioners to fix an issue that is defined by the customer tickets. Only those action words are considered that result in a state change of a component, for example “restart” or “increase.” In the text, “referred a config file”, “referred” is not an action as it does not result in a state change. It is also possible that an action word can sometimes act as a component as shown in Figure 3. In the first sentence, the word “update” has a role of component whereas in the second sentence it is a valid action word.
Figure 3: Examples depicting the role of action words
Semantic parsing is used to get action segments, referred to as intent, from resolution texts. Next, only those sentences from the documents that are marked as intents are selected. The frequency distribution of the action words present in sentences that are marked as intents are computed. If for a word the ratio of its frequency of occurrence as a noun and its frequency of occurrence as a verb is greater than a particular threshold, then that action word is not included in the action dictionary.
Cloud Pak for Watson AIOps uses SystemT, an industrial-strength declarative rule-based information extraction system based on an algebraic framework. Cloud Pak for Watson AIOps uses the Action and Role views to capture the link between mentioned components and actions.
- The Role view contains information about the thematic roles (or thematic relations) of the action, such as the component that deliberately performs the action (agent), the action’s undergoer (theme), or the component benefiting from the action (beneficiary). In Cloud Pak for Watson AIOps, only those components that have the role of theme and agent are considered.
- The Action view is centered on the actions in each sentence.
Figure 4 shows the output of the Action and Role views for an input resolution text.
Figure 4: The output of the Action and Role views for an input resolution text
The token “team” has role agent. The tokens “capacity” and “unused pods” are themes. The words “is”, “increase”, and “delete” are the action words.
In this case, the semantic analysis returns the following links based on the aid value in the Role view and the id in the Action view:
- The action word “increase” is linked with components “capacity” and “team” (aid = id = 2)
- The action word “delete” is linked with components “unused pods” and “team” (aid = id = 6)
Here, the links (delete, unused pods) and (increase, capacity) are valid, as the action words “delete” and “increase” are resulting in a state change of the component “pod” and “capacity”, respectively. Whereas, the links (increase, team) and (delete, team) are not valid as “team” is not a valid component.
In Cloud Pak for Watson AIOps, the most common resolution steps for a given issue are recommended using the System T based component-action linking method. Figure 5 shows an example where six issues have been identified to be similar to a given runtime issue. It shows the resolution steps taken for those six issues.
Figure 5: IT Operations Example
The action-component links extracted from each of these resolutions can further be clustered to obtain the frequency distribution of each such link. As shown, the action of rebooting front-end pods is the most common resolution from the similar issues.
This type of analyses can help SREs understand what common resolution steps were taken in the past for similar issues, and then help them prioritize their tasks.
The analytical methods highlighted in this article can significantly reduce the time and expense of issue resolution in production IT environments. These capabilities are available today as an integral component of IBM Cloud Pak for Watson AIOps. Check it out and discover how large parts of your incident management process can be automated, reducing costs and improving the availability of your IT environment.