Overview

Skill Level: Any Skill Level

This recipe explains about the following.

1. Use case - Notification messages are not received
2. Incident Details related to the use case
3. How SRE is anaylsing and fixing the incident

Ingredients

IBM Cloud Pak for Multicloud Management 1.3.0 (MCM Hub)
Redhat Openshift Container Plantform 4.3 (Managed Clusters)

Step-by-step

  1. Introduction

    This document explains about how SRE is going to analyze and resolve an Incident using MCM monitoring.

     

    20-notification-1

  2. Use case

    20-notification-2

    The Wealthcare application users are not receiving Notification messages for the operation that they are doing at Wealthcare.

    An incident about Wealthcare Notification Alerts is created in the Multi-cloud Management Monitoring.

    Now SRE is going to analyze and resolve the incident using the events generated by

    • ICAM Agent for MQ
    • Golden Signals sent by the Application Runtime
  3. Note

    This use case is going to leverages the following objects for Application Monitoring and Incident management.

    Thresholds
    Runbooks
    Event Policies
    Incident Policies
    How to create and configure them are discussed in another git repo.

    https://github.com/GandhiCloudLab/mcm-monitoring-usecase-notification-configuration

     

  4. Abstract of the Incident and resolution steps

    Here is the abstract of the Incident and resolution steps.

     

    20-notification-3

     

    ICAM Agent of MQ creates an event.

    Threshold config will also create an event.

    Both the events are correlated and incident get created.

    SRE look at the events in the incident.

    He opens the Resource Dashboard of MQ and he observe that there some messages left in queue for long time.

    Then he opens the Golden signals page of the Notification service.

    He identify the problem is because of memory saturation in notification service

    He opens the runbook associated with the event and get to know that it requires 750 MB memory. He increases the memory limit.

    The incident is resolved.

  5. Incident list

    As a SRE, I login into MCM Monitoring console, and look at my group, for incidents.

    There is an incident about Wealthcare Notification Alerts

     

    11-notification-1

    Two events are associated with this incident.

    It is in Assigned state.

    It is assigned to wealthcare group.

    Lets click on Investigate to do the analysis.

    It opens up the incident details page.

  6. Incident Detail

    11-notification-2

    Click on events tab. It shows the events detail page. Two events are listed here.

    It shows the event details of wealthcare Message Queue is not being read.

     

    11-notification-3

     

    Lets goto the resource dashboard screen for MQ to understand the event in detail.

     

     

  7. MQ Dashboard

    12-notification-1

    12-notification-2

    So this event occurred at 5 PM

    You can see that there are 2 messages available in the PROD.QUEUE.1 which is not read for some period of time.

    We need to identify the reason why it is not read.

    Lets go back to incidents details to anaylse about another event associated with this incident.

  8. Notification service saturation

    11-notification-4

     

    The event is about Notification Saturation is high.

    Lets goto the resource dashboard screen to understand the event in detail.

    It opens up the golden signal page of the notification service. It shows the event details of wealthcare Message Queue is not being read.

  9. Golden Signal of Notification service

    12-notification-3

     

    The Golden signal shows

    Latency
    Error
    Traffic
    Saturation

    The graph illustrate that the saturation level is going up.

    Now I need to see, how do I resolve this problem.

    Let me check is there any Runbook associated with this incident.

  10. Resolve using Runbook

    I go to the incident page again and look for the runbook.

     

    13-notification-1

     

    Runbook, contains the sequence of steps to solve the problem. Either it can be manual or automated.

    There is an run book assoicated this event. Let me choose the runbook.

    I need to assign the incident to me.

     

    13-notification-2

     

    The first step in the runbook is not relevant to this issue.

     

    13-notification-3

    Let me go to the second step.

    13-notification-4

     

    As per step2, the notification service requires 750 MB memory. But seeing through the definition of this kubernetes resource, got to know that less memory was requested.

    I will go and increase the memory request and limit in the kubernetes resource.

    After the memory correction the Notification service was able to resume the service without any issues.

    Then the notification service was able to read the messages from Message Queue. Both the events are cleared now.

    Now I resolve this incident.

     

    13-notification-5

     

     

    13-notification-6

Join The Discussion