Can we crawl the cloud like we crawl the Web? And can we query and mine the cloud like we do in the Web? These were the questions behind our journey that led us to the Agentless System Crawler project. The reason we all love—well, at least use—the Web is that all the information from all around the World is at our fingertips with pretty much zero effort on our end. It is organized, curated, categorized and indexed for us, so that we can reach information we need easily and we can find it fast. Web crawlers do not ask us questions like: “Can I install myself in your website so I can crawl it?” or “Can you please send me your credentials so I see what you are up to in your web?”. They just seamlessly do all this. Let us contrast this to how we manage IT systems today, which is, by far, a much smaller world. However, how we work with them is surprisingly archaic. When we want to get information about systems, we either use hooks to access them or we inject small pieces of “proby” software into our systems, which are appropriately called “agents”. This is intrusive, hard to scale and maintain, and unreliable.

There is an obvious reason to this difference in techniques. In the Web, we deal with documents that we can passively read. When managing IT environments we are dealing with systems and software, which we need to actively poke at, run commands against and ship information from. However, in the cloud we see that this assumption actually breaks. What prevents us from treating a cloud system as data, as a document? A virtual machine (VM) or a container is simply another form of data encapsulated in a virtual disk, memory or namespace. As long as we can interpret this data, why can’t we passively, non-intrusively and seamlessly crawl this systems data just like a web document?

The agentless system crawler is born from this line of thinking, where we crawl systems, just like web documents. The key enabling technologies for the crawlers are well known in the community. A VM’s state can be observed from outside the VM by interpreting its persistent (disk) state and its volatile (memory) state, via a technique known as VM introspection (VMI). A container’s state can be observed from the underlying platform via namespace mapping techniques. Our goal with the crawler is to provide that web-crawler like experience for systems running in the cloud, where the system state is captured and indexed seamlessly to the end user. No configuration, installation or access needed. Moreover, we aim to provide a unified experience across different cloud runtimes and instance modalities with a unified data collection framework. Whether it is a VM, container, bare-metal server or any other emerging form factor, the overall data collection experience should be one and the same, seamless, non-intrusive and with the same level of richness in observed features and sampling granularity.

We have initially started working on crawlers to inspect VM images and to leverage image data for higher-level analytics. This technology was included in IBM Virtual Image Library. Afterwards we have grown the project to inspect live systems, different runtime abstractions and operation modes. Since its early days Agentless System Crawler has been an open collaborative project that has been jointly developed with the external community. With Carnegie Mellon University we have developed streaming disk introspection techniques and analytics applications based on these that are presented in IC2E 2014. With University of Toronto, we have expanded crawler capabilities for efficient memory introspection, and explored their use for near-field monitoring and operational analytics. We have shared our implementation and results in Sigmetrics 2014 and VEE 2015.

More recently, we have grown the crawler capabilities to introspect container metrics and logs, as well as to crawling container images; and demonstrated these in HotCloud 2015. We employ these capabilities in IBM Bluemix Containers for agentless container monitoring and logging, and vulnerability and compliance scanning of container images. There is a lot more in store for the crawler technology: enabling crawlers for new execution models, adding flexibility and pluggability to what we crawl, providing application-level insight and visibility, and crawling beyond compute instances to other encapsulations of IT systems. We look forward to working with the community on these next steps and to seeing how the crawlers evolve with open collaboration and shared creativity.

2 Comments on "Crawl the Cloud Like You Crawl the Web"

  1. […] Can we crawl and query the cloud like we do the web? […]

Join The Discussion

Your email address will not be published. Required fields are marked *