Overview

Skill Level: Any Skill Level

This recipe explains about the following.

1. Use case - Web UI is becoming slow
2. Incident Details related to the use case
3. How SRE is anaylsing and fixing the incident

Ingredients

IBM Cloud Pak for Multicloud Management 1.3.0 (MCM Hub)
Redhat Openshift Container Plantform 4.3 (Managed Clusters)

Step-by-step

  1. Introduction

    This document explains about how SRE is going to analyze and resolve an Incident using MCM monitoring.

     

    30-response-1

  2. Use case

    The Web user interface of the Wealthcare application is becoming very slow.

    An incident about Wealthcare UI responding slow is created in the Multi-cloud Management Monitoring.

    Now SRE is going to analyze and resolve the incident by using the events generated from Golden Signals of the Application Runtime.

    Here is the use case.

    30-response-2

  3. Note

    This use case is going to leverages the following objects for Application Monitoring and Incident management.

    Thresholds
    Synthetic Tests
    Runbooks
    Event Policies
    Incident Policies
    How to create and configure them are discussed in another git repo.

    https://github.com/GandhiCloudLab/mcm-monitoring-usecase-responsetime-configuration

     

  4. Abstract of the Incident and resolution steps

    Here is the abstract of the Incident and resolution steps.

    30-response-3

     

    Synthetic test creates an event.

    Threshold config will also create an event.

    Both the events are correlated and incident get created.

    SRE look at the events in the incident.

    He opens the Golden signals page of the web UI service.

    With the help of transaction tracing, he figure out that, the issue is with financial plan service.

    Then he opens the Golden signals page of the financial plan service and he identify the problem is because of traffic increase.

    He opens the runbook and get to know that he has to increase the replica of the POD. He did so.

    The incident is resolved.

  5. Incident list

    As a SRE, I login into MCM Monitoring console and look at my group for incidents.

    There is an incident about Wealthcare UI responding slow

    01-responsetime-1

     

    Two events are associated with this incident.

    It is in Assigned state.

    It is assigned to wealthcare group.

    Lets click on Investigate to do the analysis.

    It opens up the incident detail.

  6. Incident Detail

    01-responsetime-2

     

    Two events are listed here.

    01-responsetime-3

     

    Here is the event details of Synthetic test.

    01-responsetime-5

     

    Here is the event details of Wealthcare response time is high.

    01-responsetime-5

    Lets goto the resource dashboard screen to understand the event in detail.

     

  7. Golden Signals of Web UI

    02-responsetime-1

     

    Here is the golden signals page of the Web UI service.

    The Golden signal shows

    Latency
    Error
    Traffic
    Saturation

     

    We can observe the following from the graph.

    Latency is high
    Traffic is high
    No Errors
    Saturation is normal

     

    Now we need to see, whether this Latency is because of the dependent services.

  8. Transaction tracing in Web UI

    Click on the Transaction tracing icon of the API call.

     

    02-responsetime-2

     

    It goes to the tracing page.

     

    02-responsetime-3

     

    Choose any one of the transaction.

    02-responsetime-4

    It shows the tracing of the selected transaction.

    You can see that UI service is calling financial plan service.

    You can observe that delay is from the financial plan service.

    So lets goto the financial plan service by clicking it from service dependencies.

     

    02-responsetime-5

  9. Golden Signals of Financial plan

    This page shows the golden signals of the financial plan service.

    03-responsetime-1

    We can observe the following from the graph.

    Latency is high
    Traffic is high
    No Errors
    Saturation is normal

    It looks like the traffic is increased so the financial plan service is finding difficult to respond.

    Now I need to see, how do I resolve this problem.

    Let me check is there any Runbook associated with this incident.

     

     

  10. Resolve using Runbook

    I go to the incident page again and look for the runbook.

     

    04-responsetime-1

     

    Runbook, contains the sequence of steps to solve the problem. Either it can be manual or automated.

    There is an run book assoicated this event. Let me choose the runbook.

    I need to assign the incident to me.

     

    04-responsetime-2

     

     

    The step 1 of the runbook requests to go to search screen in MCM console

     

    04-responsetime-3

     

    The step 2 requests to enter search text as given.

    04-responsetime-4

     

    The step 3 requests to choose the deployable object for financial page.

    04-responsetime-5

     

    The step 4 requests to edit the yaml and make the replica as 2.

    04-responsetime-6

     

    After the pod count increase, the performance of the financial plan service is improved and user did not feel the slowness.

    Now I resolve this incident.

     

    04-responsetime-7

     

    04-responsetime-8

     

    04-responsetime-9

Join The Discussion