This article is for anyone tasked with cloud service operations who is looking for an approach or platform for central monitoring, event management, alerting, and automated actions. It explains a best practice of using Sysdig as the monitoring solution and IBM Cloud Event Management as a centralized operations platform.

Prerequisites

To get the most out of this article, you should have a general understanding of public cloud architecture, Cloud Foundry, Kubernetes, and containers. It’s also beneficial to have a free IBM Cloud account and IBM Cloud Event Management trial account set up.

Estimated time

It takes 15 minutes to read this article and walk through the example.

A brief introduction to cloud service management and operations

With sophisticated DevOps and continuous integration/continuous delivery (CI/CD) practices, enterprises are on a quest to deliver quality software faster with things such as development velocity, agility, and direct feedback from users.

These organizations are moving toward DevOps, unifying development and operations with tools and best practices that ensure continuous delivery and integration, which need to redefine processes, roles and responsibilities, and tools. New concepts such as Site Reliability Engineering (SRE) take a fresh approach to service management, allowing operations to sustain the increased volume coming in from development, while protecting performance and reliability of the solution.

Cloud service operations refer to all the activities that an organization does to operate the cloud services that it offers to customers. Applications and services are monitored to ensure availability and performance according to service level agreements (SLAs).

Some best practices for central monitoring and operations on IBM Cloud public

On the IBM Cloud public environment, Kubernetes and Cloud Foundry are used as the primary application development platform. IBM Cloud Foundry Enterprise Environment (CFEE) allows you to create and manage isolated environments for hosting applications exclusively for your enterprise. A CFEE instance is provisioned into a container cluster from the IBM Cloud Kubernetes Service. Multiple Kubernetes and CFEE instances together with customer applications can be deployed in an isolated network and grouped by identity and access management (IAM) resource groups.

From an end-to-end operation flow, the three steps below are best practices for operations on IBM Cloud public (these align with the complete IBM methodology of service management for IT and cloud services):

  1. Monitor: Various monitoring solutions from infrastructure to Kubernetes are chosen to monitor virtual machines, Kubernetes and CFEE instances, and applications.
  2. Event Management/Notifications/Incident Management: Manage events that are generated from different monitoring tools, alerting and notifications for services and application disruptive events, and escalation with incidents.
  3. Actions: Platform for SRE to take corrective actions for problem mitigation.

Here is the proposed architecture for building an end-to-end operations solution on IBM Cloud public:

Operations diagram

For this article, I’ll focus on a best practice of using Sysdig as the monitoring solution for the Kubernetes and CFEE instances, and leveraging IBM Cloud Event Management as a centralized operations platform for event management, incident management, notifications, and runbook automation. IBM Cloud Monitoring with Sysdig is powered by Sysdig in partnership with IBM. Cloud Event Management sets up real-time incident management for your services, applications, and infrastructure, which receive events from various monitoring sources, either on premise or in the cloud.

In this practice, Sydig monitors both Kubernetes and CFEE clusters, and posts alerts to Cloud Event Management with a generic webhook, since, at this point, Sysdig is not a default monitoring source to Cloud Event Management. You can create a generic webhook integration and define the alert normalization.

The part that requires the most configuration is the Cloud Event Management Alert field mapping, which makes it possible to send Sysdig alerts. Here are my steps:

  1. Export an alert as a JavaScript Object Notification (JSON) from Sysdig, like this sample:

    Alert

  2. Create a mandatory field mapping in Cloud Event Management with the following event attributes: in the Severity field, enter alert.severity, for Summary enter alert.description, for Resource name enter entities.entity, and for Event type enter alert.name.

  3. Create some optional field mapping: in the Details field, enter {"payload":$string($)}. This way, you’ll get the actual alert in the details field, which helps you refine the mapping. Sometimes the exported Alert is not fully consistent with the actual alert.
  4. Create URL field mapping to point to the Sysdig event/Alert. In the URL1 field, enter event.url and for URL2 enter alert.editUrl.

    Mapping2

  5. Try to trigger an alert and test. In my case, I tried to restart one of the worker nodes, which triggered a “Node Not Ready” alert, as you can see in the screen capture below:

    Incident2

Summary

Now that you’ve seen how Sysdig and Cloud Event Management integrate, I encourage you learn about integrations with other monitoring tools, such as Zabbix, Prometheus, and IBM Cloud Application Performance Management.

Acknowledgements

Many thanks to IBM senior technical staff members Fred Tucci, Paul French, and Isabell Sippli, and senior software engineer John Lee for their technical contributions and reviews.