2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Monitor your cloud

Observability is a key aspect of providing a service. As envisioned in Mikey Dickerson’s hierarchy of service reliability, monitoring is the pillar upon which all other needs of a service rely on. To provide a reliable service, monitoring must be set up and defined for the service. Specifically, well-thought-out observability that leads to meaningful alerting (pages, slack, email, phone) from meaningful data (measures, metrics).

The challenge is to be able to do this for an ever-evolving deployment of services and infrastructure that necessitate various views of metrics and alerting that are sometimes overlapping, but sometimes are not. The first part involves deploying agents to listen and generate all kinds of data, but the next part is turning that data into something meaningful. The classic problem of separating signal from noise.

IBM Cloud Monitoring provides the data collection and meaningful promotion of data for cloud service observability. Agents gather data as the deployed infrastructure and applications evolve, providing a continual level of observability that is accurate and up to date for your services. Visualization provides the ability to turn all of this data into meaningful observability through pre-defined, focused metrics dashboards for typical needs such as infrastructure, resource usage, and application overviews. In addition, you can use IBM Cloud Monitoring to build application-specific observability by defining the metrics and scopes that are tailored to your desired services.

A key part of observability is being able to alert for awareness when attention is needed for a service. This manifests into an alert (to some notification channel or service) that a value fell outside of a check. You can create several alerts from pre-defined checks and verifications of a service to proactively alert about issues affecting service reliability. IBM Cloud Monitoring also provides the ability to alert for anomalies in an environment based on deviations from observed patterns of operation of the service.

Prerequisites

Estimated time

Completing this tutorial should take you less than 15 minutes.

Steps

Set up observability with IBM Cloud Monitoring

1. Create an IBM Cloud Monitoring instance

Open the IBM Cloud Monitoring service within the IBM Cloud catalog.

Select the geographic region closest to you from the Select a location list.

Within the Configure your resource section, click Enable to receive platform metrics.

Click Create to provision your monitoring instance. The Observability dashboard opens and shows details for your new monitoring instance similar to the following screen capture image:

Screen capture of the Monitoring dashboard

2. Add a monitoring agent to your Kubernetes cluster

On the dashboard, go to your new monitoring instance and click Add sources.

The Monitoring Sources page opens with guidance for adding agents to your monitoring instance. For this tutorial, select the Kubernetes tab, copy the Public Endpoint command, and save it for later use in step 3 after your log into your Kubernetes cluster.

Screen capture of the Monitoring Sources page

Now, go to your Kubernetes clusters list and select your Kubernetes cluster.

Screen capture of the Kubernetes clusters list

On your cluster Overview page, click Actions and select Connect via CLI from the list.

Follow the instructions and custom commands on the Connect via CLI page to connect to your IBM Cloud cluster from your CLI (terminal). The commands and results will look similar to the following example:

$ ibmcloud login -a cloud.ibm.com -r us-south -g default --sso

API endpoint: https://cloud.ibm.com

Get a one-time code from https://identity-2.us-south.iam.cloud.ibm.com/identity/passcode to proceed.
Open the URL in the default browser? [Y/n] > y
One-time code >
Authenticating...
OK

Select an account:
1. MARC VELASCO's Account (-------)
Enter a number> 1
Targeted account MARC VELASCO's Account (----)

Targeted resource group default

Targeted region us-south

API endpoint:      https://cloud.ibm.com   
Region:            us-south   
User:              -------   
Account:           MARC VELASCO's Account (---)   
Resource group:    default   
CF API endpoint:      
Org:                  
Space:      

$ ibmcloud ks cluster config --cluster c2dg0cvd0sstp4gnc5sg

OK
The configuration for c2dg0cvd0sstp4gnc5sg was downloaded successfully.

Added context for c2dg0cvd0sstp4gnc5sg to the current kubeconfig file.
You can now execute 'kubectl' commands against your cluster. For example, run 'kubectl get nodes'.
If you are accessing the cluster for the first time, 'kubectl' commands might fail for a few seconds while RBAC synchronizes.

3. Install the monitoring agent

Deploy the monitoring agent by using the public endpoint command that you copied from the Monitoring Sources page within step 2. The command and results will look similar to the following example:

$ curl -sL https://ibm.biz/install-sysdig-k8s-agent | bash -s -- -a e85574b8-b784-472e-810d-e58510bb4580 -c ingest.us-south.monitoring.cloud.ibm.com -ac 'sysdig_capture_enabled: false'

* Detecting operating system
* Downloading Sysdig cluster role yaml
* Downloading Sysdig config map yaml
* Downloading Sysdig daemonset v2 yaml
* Downloading Sysdig kmod-thin-agent-slim daemonset
* Creating namespace: ibm-observe
* Creating sysdig-agent serviceaccount in namespace: ibm-observe
* Creating sysdig-agent clusterrole and binding
clusterrole.rbac.authorization.k8s.io/sysdig-agent created
* Creating sysdig-agent secret using the ACCESS_KEY provided
* Retreiving the IKS Cluster ID and Cluster Name
* Setting cluster name as sysdigcluster/c2dg0cvd0sstp4gnc5sg
* Setting ibm.containers-kubernetes.cluster.id c2dg0cvd0sstp4gnc5sg
* Updating agent configmap and applying to cluster
* Setting tags
* Setting collector endpoint
* Adding additional configuration to dragent.yaml
* Enabling Prometheus
Slim agent selected
Processing all-icr-io as all-icr-io
secret/all-icr-io created
configmap/sysdig-agent created
* Deploying the sysdig agent
daemonset.apps/sysdig-agent created

The list of agent pods deployed in the namespace "ibm-observe" are:
sysdig-agent-swmp6   0/1     Pending   0          0s

Make sure the above pods all turn to "Running" state before continuing
Should any pod not reach the "Running" state, further info can be obtained from logs as follows
'kubectl logs <agent-pod-name> -n ibm-observe'

After some time passes, you can check to see if your agent is running in the cluster with the following command:

kubectl get pods -n ibm-observe

You should see results similar to the following:

NAME                 READY   STATUS    RESTARTS   AGE
sysdig-agent-swmp6   0/1     Pending   0          28m

Work with IBM Cloud Monitoring observed data

IBM Cloud Monitoring has a number of default dashboards. After your agent is configured and reports data, the pre-existing dashboards will show data.

Default dashboards

To view the default dashboards, return to the Observability dashboard within IBM Cloud.

From your list of IBM Cloud Monitoring instances, click Open dashboard for the new instance that you created in step 1, as indicated within the following screen capture:

Screen capture image of a list of active IBM Cloud Monitoring instances

The default Explore view shows the entire infrastructure and all available metrics that can be utilized.

Screen capture image of the default Explore view with Hosts & Containers metrics selected

By clicking the Dashboards icon, you can find a number of prebuilt dashboards available. For example, the Container Resource Usage dashboard shows resource utilization by container for all of the containers that IBM Cloud Monitoring has data for, as demonstrated in the following screen capture:

Screen capture image of the Container Resource Usage dashboard

Adjusting the widgets on the header can help you focus on a specific scope or range of containers.

Create new and custom dashboards

If you want to create a new dashboard from scratch, select the Dashboards tab and click the + Add Dashboard icon.

If you want to create a custom dashboard by using one of the pre-existing dashboards as a template, click the Create Custom Dashboard button located on that pre-existing dashboard. Then you can customize the panels and layout as needed within your new dashboard.

Screen capture image of a custom dashboard example titled "Container Resource Usage – Marc"

When you first create a dashboard, you have access to many metrics captured in your environment. These range from lower-level metrics, such as resource consumption on host hardware, to application-specific metrics.

  • Examples of resource consumption on host hardware metrics

    • Cpu.used.percent
    • Cpu.idle.percent
    • Cpu.cores.used
    • Memory.bytes.used
    • Memory.bytes.available
    • System.uptime
    • Net.connection.count.in
    • Net.connection.count.out
  • Examples of application-specific metrics

    • Elasticsearch.active_shards
    • Couchdb.can_connect
    • Mongodb.can_connect
    • Nginx.net.connections

Click the Explore tab to view the metrics and groups that are available for your deployed infrastructure.

Define alerts and anomaly detection

When working with IBM Cloud Monitoring data, you can generate alerts from the data in several ways. You can generate an alert for when a specific metric is over, under, or meets a defined threshold, such as CPU utilization at 100% for a prolonged period. You can create an alert for when an event occurs a certain number of times, such as pod restarts or readiness probe failures. You can also create an anomalous alert for when the observed condition of a service changes, such as when the level of CPU utilization suddenly rises from 50% to 100%.

Metric alerts

There are several metrics that signify a service failed or is about to fail. You can define alerts in IBM Cloud Monitoring at varying levels, whether across the whole scope of all deployed services and infrastructure, or by specifying something more granular. For example, you can specify an alert to occur when CPU utilization is greater than 50% across your whole observed stack of infrastructure and services. You can configure this type of alert to be more or less fine-tuned. For example, any time that CPU utilization is greater than 50% for a specific number of minutes, hours, or days. Or any time it occurs at all. Use an average or any single measure value of the metric.

Screen capture image of a sample metric alert

When triggered, the alert can be sent to any integrated notification channel, such as PagerDuty or Slack.

You can also enable Sysdig Capture mode, which saves the state and forensic data when an event occurs. In that mode, you can use the data in postmortem and follow-up reviews outside of production.

Event alerts

Underneath deployed services, all kinds of events happen that the services are unaware of. For example, the recreation or movement of pods, or readiness probes that do not respond. Metrics tell a story of what’s happening in one small, quantifiable area. Events tell a story about all of the pieces that move as a result of external and internal forces, such as resource contention, network congestion, or failure. You can configure event alerts to look for events with certain tags or sources; this helps you pinpoint common events, such as resources or probes that fail, or pods that move to other hosts. These events create evidence of a current or future problem, in which case, you can create event alerts accordingly that match the criticality and alertness levels. Similar to other alerts, you can turn on the proper alert notification channels and capture when this type of alert occurs.

Screen capture image of a sample event alert

Anomaly detection

Anomaly detection is a powerful tool to alert you of deviations from observed norms. By utilizing observability algorithms to establish expected states of a service and infrastructure, you can generate an alert of configurable priority when an observation occurs outside of establish normal ranges (configurable when you setup the alert). Similar to other alerts, you can turn on the proper alert notification channels and capture when this type of alert occurs.

Screen capture image of a sample anomaly detection alert

Conclusion

Observability is a critical foundation upon which a reliable service is built upon. In this tutorial, you set up a monitoring agent to observe sample activity within a Kubernetes cluster running in a public cloud, reviewed the data via dashboards, and learned about several types of helpful monitoring alerts.

IBM Cloud Monitoring provides powerful tools and visualization to build observability for your services and infrastructure, from initial deployment to scaled production workloads. It also provides built-in observability and alerting for measures, events, and complex anomaly detection. Learn more about the features available with IBM Cloud Monitoring by following the Getting started tutorial and reading the recent blog post about integrated Sysdig Secure features.