Short description of Track 3
Recommended for SREs, ITOps, and security SMEs, this track provides a view on how AI can help automate IT operations management to deliver a more secure, reliable hybrid cloud environment. Topics include incident detection and resolution, DevSecOps, and application observability.
Day 1: April 20, 2021
The move to cloud native applications is a radical change. Embracing cloud technologies brings agility, scalability and reliability to an organization's applications. The ephemeral nature of cloud services requires a different approach to management. However, the business critical applications that run in the cloud still require monitoring or observing to ensure that transactions are processed in a prompt and error-free manner. Learn more about the challenges of keeping observability in synchronization with rapidly changing application landscapes and how to make sense of the prodigious volume of metric and event data produced.
Applications run business, but application downtime or performance degradation kills customer experience and disrupts developer and IT operations priorities. As organizations modernize applications using microservices architectures and innovative technologies, it has never been more important to drive application centric IT operations, deliver targeted actions for performance improvements and leverage AI across development and application teams to automate and scale their efforts. This session provides a deep technical dive on IBM’s OEM Application Resource Management (ARM - Powered by Turbonomic) to highlight how AIOps provides highly distributed and dynamic applications the resources they need to perform. AIOps is the cross section of observability and actionability helping customers automate so they can focus on innovation and deliver the best customer experiences. Learn how ARM brings higher levels of automation to APM solutions (for example, Instana) through AIOps. The combination of APM and ARM brings an application centric prioritization to resource management based on application SLOs.
These days, it's common for developers to take over more of the operational aspects of an application, commonly known as DevOps. In this session, learn about common operational challenges—and their impacts on developers and IT operations/SREs—and some solution approaches to address them using IBM Cloud Pak for AIOps.
Isabell SippliNeil Boyette
IBM Watson AIOps enables enterprises to derive insights from multiple sources of data like logs, metrics, and events. It uses AI technologies to detect hidden anomalies that are normally hard to detect using rules and in some cases detects incident causing anomalies, several hours before the incident occurs. This demo walk through provides an opportunity to view these features and more from IBM Watson AIOps by detecting anomalies and localizing the faults.
Padma MalladiTamir Arnesty
To streamline IT operations, Site Reliability Engineering (SRE) teams have invested in monitoring tools, and log aggregators, but few SREs enjoy digging into the logs with arcane searches, hoping to correlate signals that lead to resolution. Learn about a brand new log anomaly detection service in Watson AIOps, designed to understand unstructured and semi-structured data without assuming the format of the data it consumes.
When something breaks, it's often because of a change. Production systems are constantly changing. Evaluating, approving, and deploying these changes is difficult and time-consuming, but supporting this change management process is an important role of operations support. History can guide us in estimating the riskiness of changes, with change risk informing both the proactive approval process, and the responsive incident management process. See how AIOps and Shift-Left testing can help SREs estimate risk and make reactive and proactive plans for success.
Gene BrownMichael NiddPritam S GundechaRaghav Batta
Site Reliability Engineers (SREs) want to get ahead of application and IT outages and resolve incidents before they impact users, but many teams today are overwhelmed by noise as they look to detect, isolate, diagnose and resolve the incident quickly. The acceleration of idea to production demands more time and attention from people. Fortunately, there's a new way of incorporating AIOps into DevSecOps to improve efficiency and effectiveness for finding problems, fixing them,and deploying fixes before major incidents occur. AIOps can free up precious time by surfacing critical information before incidents occur by positively influencing development decisions.
As cloud-native becomes the standard for application deployment, cloud teams must shift security practices to reduce risk. Applications built on containers and Kubernetes tend to be complex and are vulnerable to a variety of security issues from bugs to excess privileges to misconfigurations. Modern, cloud-native defenses and intelligence are needed to protect against threats. In this session, we’ll discuss and demonstrate key considerations for security and compliance across your DevOps lifecycle from development through production. Hear how to secure build pipelines, detect runtime threats, and validate compliance to confidently run containers in production. Learn how deep visibility into your containers together with intelligent AIOps can help you break down information silos, reduce the number of IT incidents, and resolve issues quickly.
Eric CarterMateo Burillo
Have you ever been blocked by waiting for a testing or development environment to be provisioned and made available to you? Have you ever run into issues caused by differences in prod and test environments? Wouldn’t it be easier if there was a single source of truth for infrastructure and application configuration that allowed for some sort of consistency? Well, in this session we'll explore the concepts of Everything as Code and how we can automate environment configuration from a single source of truth contained in a git repository via GitOps—then automate provisioning using Ansible.
In cloud native applications, a large fraction of operational failures—or outages—result from violations of Service Level Objectives (SLOs) defined on either service errors or service latency, commonly referred to as two of the "golden signals." A light-weight fault localization system can greatly reduce human effort and dependency on domain knowledge for localizing such golden signal-based operational failures. Our technique establishes causal relationships among the golden signal service errors and error logs emitted by the constituent micro-services (all modeled as time series data).
Ajay GuptaPooja AggarwalSeema Nagar
While ideal from an operations perspective, few production applications take advantage of complete and detailed logging and monitoring capabilities, due to perceived impact on application performance. That's where the fault injection framework comes into play. Learn how this framework can help collect operational data to enhance monitoring, validate existing models, and more. Designing and deploying an application for efficient operations is critical in the hybrid cloud world. IT Operations evolve around deployment, monitoring, automation, incident response and security audits. Data collected by robust monitoring, logging and tracing is particularly critical for managing hybrid applications, and ideal from an operation perspective, but very few production applications take advantage of these capabilities due to perceived impact of extensive monitoring on application performance. In the absence of 'monitor everything' in production environments, how can the DevOps team ensure that all the necessary operational data is being collected to be able to capture and alert on failures that might occur? In this session, learn how the fault injection framework we've built to collect operational data addresses this question and how it can be used to generate fault-related operational data, even for new features in a planned application release, to prepare for localization of the root cause of a problem in a microservice application.
Frank BagehornJesus RiosLaura Shwartz
Any AIOps adoption effort that proceeds without the mainframe is incomplete. Why? Today the mainframe (IBM Z) is used by 71 percent of Fortune 500 companies, handles 90 percent of all credit card transactions and runs mission critical applications because of the platform's inherent benefits of scale, speed, security and resiliency. To take advantage of the promises of AIOps you need to take a holistic approach, which includes IBM Z. This involves the aggregation and correlation of anomalies, subsystem situations and end-to-end transaction tracing—from mobile to mainframe. Join us as we discuss how SRE's can improve enterprise application observability by integrating IBM Z Monitoring and Analytics tooling into a hybrid AIOps solution.
Meet the speakers
Senior Research Engineer with IBM India Research Lab
Data Science Manager and Master Inventor
Director, Product Marketing at Sysdig
Director Technical Marketing, Technology Evangelist at Turbonomic
Senior IT Architect, Hybrid Cloud Data Research IBM Research
Distinguished Engineer, IBM Hybrid Multicloud Delivery Guild - GTS, Delivery & Integrated Operation
Senior Technical Staff Member for AIOps at IBM
IBM Research Staff Member, Cognitive Computing IBM Research
IBM Distinguished Engineer, Research, AI Ops for IT IBM Research
Product Manager (EMEA) at Sysdig
Systems Management IBM Research
Senior Technical Staff Member for AIOps at IBM
IBM Developer Advocate
Architect for Watson AIOPS
Advisory Research Scientist at IBM Research, Bangalore, India
Pritam S Gundecha
Senior Data Scientist at Cloud and Cognitive Software group at IBM
Senior Developer, Hybrid Cloud Services IBM Research
Solutions Architect at GitLab serving U.S. Public Sector agencies
Advisory research engineer at IBM Research India
Technical Marketing Manager at Instana
Backend Software Developer (Intern), Cloud Pak for Watson AIOps
IBM Offering Manager in the IBM AIOps and Service Management