The history of IBM’s contributions to Cloud Foundry, part 2
The Cloud Foundry Foundation and components of the Cloud Foundry project
In part 1 of this three part series, we retraced the early days of Cloud Foundry as a project that started at VMware and grew to become a leading platform-as-a-service (PaaS), which gave rise to an agile cloud transformation of many enterprises. Along this early reminiscing we also introduced you to the key players and IBM’s involvements and contribitions, and we also highlighted some early friction points.
In part 2 we recall the creation of the Cloud Foundry Foundation, an independent not-for-profit organization under the umbrella of the Linux Foundation that owns all intellectual property and code for Cloud Foundry. Along with solving legal issues, that foundation helped transition Cloud Foundry from an open source software project under the tutalage of one company to a project that is open to the world.
In addition to understanding the evolution of Cloud Foundry from the broad brush strokes of the foundation and its processes and contributors, it’s also interesting to see the evolution of its major components. After all, Cloud Foundry is a huge technical project with hundreds of GitHub repositories and millions of lines of code. How can we make sense of this massive code base? What are key components? How have they evolved? Read on for more.
Cloud Foundry Foundation 2016-present
One of the steps that IBM took in almost every open source project it became seriously involved in, was to ensure a correct governance model that works for all. We had done so for Linux, Java, Node.js, and many other hot open source software projects. With Cloud Foundry under the guidance of one company, the need for independent governance was an issue from the start that needed to be addressed.
Sam Ramji, the first CEO of the Cloud Foundry Foundation, at the Cloud Foundry Summit 2016 in Frankfurt, Germany
So while IBM engineers gradually participated in the dojo and started contributing more and more to various aspect of the Cloud Foundry code base, Chris Ferris along with Todd More, worked with their counterparts at Pivotal, SAP, and a few other early Cloud Foundry adopters to charter a governing body with accompanying rules and processes. The end result was the creation of the Cloud Foundry Foundation.
Naturally with a new foundation came some work and headaches. For one, the Cloud Foundry Foundation needed to be staffed with a leader and associated lieutenants. Not an easy process if you’ve ever tried to hire competant technical experienced professionals in the Bay Area or wherever tech talent are in short supply. Luckily the founding board made a great decision to hire Sam Ramji (now VP Cloud Platform at AutoDesk) to be the CEO.
Cloud Foundry Foundation CTO Chip Childers and Director Swarna Podila
In a confident, yet also a very humble and inclusive manner, Sam led the Cloud Foundry community in its early years. Taking center stage at the Cloud Foundry Summit, Sam was the leading chearleader and voice of reason to help Cloud Foundry bridge the gap from a intertesting OSS project to a well adopted operating system for cloud applications. He also hired a solid team including Chip Childers, Abby Kearns, and others to complete the growing foundation, which has flourished even when Sam decided to leave to take an executive position at Google Cloud in late 2017.
As described in part 1, in creating Cloud Foundry, the ex-Googlers at VMware had goals to form the next version of Google’s internal Borg project. Because Borg was rumored to be an all encompassing scheduler and manager for Google’s workloads and infrastructure, the Cloud Foundry designers decided to divide the concerns and separate the management of the infrastructure to the bits that managed apps. BOSH was created to provide infrastructrure management, while the core of Cloud Foundry would manage containers and the applications running in them.
From the start of the Cloud Foundry open source software project, BOSH was somewhat of a problem step-child. The design point of BOSH was intended to be independent of Cloud Foundry, but at the same time, Cloud Foundry was and always will be BOSH’s largest workload to manage. The overall design is simple, but the devil is in the details. Taking an agent approach to managing infrastructure, BOSH can be seen as divided into a central director that is controlled by the human operators and various agents that execute semi-autonomously on the managed VMs.
Dr. Nic Williams of Stark & Wayne, an early BOSH enthusiast and author of The Ultimate Guide to BOSH
When operators issue commands to the director for a particular deployed workload, the director has to gather all knowledge that it has about the cluster for the workload and the state of its currently allocated resources. It must draw a plan to achieve the new desired state, and then it must issue commands to the various agents to get to that new state. When the commands require creating infrastructure resources or deleting existing ones, commands are issued to a cloud interface to provision, set up, and clean up the resources.
While conceptually simple, it actually is not. The complexity comes from not only having to deal with errors that can be stochastic in nature – and two more components that I have not mentioned – but also the need to support many infrastructures. Let’s briefly explore and mention how IBM helped Pivotal to evolve BOSH into the best implementation of that architecture.
The BOSH system is highly distributed in its execution but centralized in its control. Complexity comes from the fact that the abstractions created could work well (through testing and adjustments) for a small number of infrastructures, but they became unweildy with the two other components that also varied: releases and cloud provider interfaces (CPIs).
Dmitryi Kalinin of Pivotal, product manager of BOSH during the CPI extraction and boom
First, BOSH is only useful if it can easily deploy and manage the lifecycle of cloud software (for example, Cloud Foundry, Maria DB, Prometheus, and other complex interacting distributed middleware). Capturing the interactions of the deployed software and the configuration of those software components is the job of a BOSH release. A structured packaging and description of a cloud software that is managed by BOSH.
Employing its own language and terms, BOSH strictly specifies how “releases” (as they are called) are managed and packaged. And because all running software require a base set of OS primitives to abstract the underlying resources, all BOSH VMs run a common streamlined Linux-based OS. This is called the BOSH stemcell.
IBM’s main contributions to BOSH were not only to help grow the project and deal with day-to-day development efforts, but also to stretch and test its design at scale. First, by creating one of the first non-intel architecture stemcells (for the Power architecture) we were able to help test and verify and mature the stemcell aspects of BOSH.
IBM China Development Lab BOSH team at lunch in 2017
Second, and perhaps most importantly the BOSH CPI had been designed to allow extensions but in reality was baked into the director’s code. Working closely with Pivotal we helped extract and streamline the CPI interface resulting in a boom of BOSH supported infrastructures (up to 20 at the latest count).
Finally, while BOSH was tested for deploying and managing Cloud Foundry environments comprised of dozens to hundreds of VMs, the success of IBM Bluemix (the previous name of IBM Cloud) led to multiple environments spread across the world, each comprising hundreds to thousands of VMs. Naturally, the law of large numbers simply mandates that failures and issues in such large-scale systems become daily issues.
By testing BOSH at this scale, we encountered various anomalies and scaling-specific issues, which we worked with Pivotal to solve. One specific example was the need for BOSH-native DNS, which was a feature needed by environments stretching the limits of PowerDNS, typically used in early Cloud Foundry deployments.
Every platform as a service (PaaS) contains a component that makes decisions on what jobs to run, when to run them, and what resources to allocate to each job. This component is similar to operating system kernels. Such schedulers or job allocation functions have been studied in operating system literature. Depending on the jobs, the available resources, the expected execution time of jobs, and other parameters, you can likely write an optimal algorithm.
Of course, in practice you don’t have an oracle view of the world. Jobs arrive and get canceled at random times. So it’s fruitless to optimize for all cases. Instead, developers typically implement a variant solution to the bin-packing problem, which gives a near optimal solution and allows some flexibility to deal with variations.
Matt Sykes and Michael Fraenkel, early Cloud Foundry and Diego engineers
Within its core, the Cloud Foundry runtime code included some implementation of this conceptual problem/solution space, commonly known as Droplet Execution Agent (DEA). However, that implementation allowed no flexibility. Furthermore, it was not clearly separated from the APIs that were exposed to other parts of the system.
And to make matters worse, like all parts of Cloud Foundry, the runtime core was written in the Ruby programming language. While a great language for prototyping and some system components (for example, APIs) it is sub-optimal and inefficient for other system components. At the same time, the Cloud Foundry engineering team at Pivotal had experience rewriting some other components (for example, the CLI and health monitoring), using Golang with positive results.
Early successes of the rewrite of the health monitoring (HM9000) and increasing participation and growth of the Cloud Foundry open source software team (which now included engineers from IBM, SAP and others) pushed the runtime team to contemplate rewriting of the runtime code in Golang. Under the leadership of Onsi Fakhouri of Pivotal with notable assistance from the initial team (Michael Fraenkel, Matt Sykes, Alex Suraci, Eric Malm, and others) a new runtime team, code-named Diego (or DEA-in-go) was created.
Julian Friedman, leader of the Garden and Eirini projects
While rewriting the core of the Cloud Foundry platform had many objectives, one of the major goals was allowing the platform to scale, better meeting the engineering support and growth needs. The Diego Runtime team had accumulated technical debt where it made sense to contemplate rewriting. Of course, as in any non-trivial software project, the Diego team worked hard to deliver, but it took a while to allow the resulting platform to scale to the levels that the previous runtime system had achieved over the years.
Adding to the complexity of trying to change the runtime code to a “modern scheduler” that used a novel auction-based approach, the team also needed to address the external world of containers, which had progressed. In particular, the popularity of Docker and Kubernetes gave rise to alternative emerging open source container runtime platforms. One positive outcome was the creation of the Open Container Initiative under the leadership and guidance of the CNCF (Cloud Native Computing Foundation) board.
To deal with the Open Container Initiative, a new rewrite of the container management layer (called the Garden) spun off and moved to London under the leadership of IBM’s Julian Friedman and Pivotal colleagues. Separating the container management and making it compatible with an emerging standard helped clarify and simplify the task that the Diego team faced. It also allowed Cloud Foundry to be better aligned with the rest of the industry. Soon enough, the ability to run Docker images directly in Cloud Foundry was possible.
As prevously mentioned, while the Cloud Foundry runtime team was busy rewriting the core, the rest of the industry was not standing still. Projects like Docker, and in particular Kubernetes, gathered steam and became increasingly viable alternatives to Cloud Foundry’s own scheduler runtime code. Of course, speculations were rampant, and there was even efforts to tie the two together. For example, the duo team from IBM, Michael Fraenkel and Matt Sykes, quickly created a prototype to show and validate the feasibility of using Kubernetes as the scheduler for Cloud Foundry.
However, the Diego team was just putting the finishing touches on their shinny new runtime code, so naturally, any change in direction this late in the game would be met with pushbacks. It was agreed not to pursue alternative runtime code but instead to get the Diego runtime system to scale to the levels where it would be on par or better than the old Ruby runtime system. With efforts and persistence the Diego team was able to achieve the goals and even adding more features, such as storage persistence, SSH support, and container networking.
Simon Moser (IBM Germany) and Julian Friedman (IBM UK), initiators of Project Eirini
Overall, by the end of 2016 and early 2017, the Diego project was reaching its strides, and various customers across the Cloud Foundry Foundation were deploying it and using it. It had reached its peak excitement in 2017. By the end of that year, it was not a question of whether or not to switch to Diego, but when. However, the world had also evolved. Kubernetes was seen more as a raw container orchestration layer. It had matured, and it became a viable runtime system for microservices, under the leadership of Google, Microsoft, IBM, and others.
As vendors and users started adopting Kubernetes for their microservices needs, it became increasingly clear that rapprochement, earlier noticed and prototyped by the IBM engineers, was going to be needed. With the release of the Fissile project, another member of the foundation, SUSE, had independently shown that the Cloud Foundry release could be containerized and run on top of Kubernetes. The question was now when the Cloud Foundry Foundation should adopt approaches to officially enable this rapprochement.
Jeff Hobbs and Vlad Iavanov of SUSE
Various efforts started, including attempts to treat Kubernetes as an IaaS and thus use the BOSH cloud provider interface (CPI) mechanism. While one of these efforts was successful, it was not native to either Kubernetes tooling or the way that the Kubernetes community manages their workloads. So it was time to revive the ability to substitute the runtime core. Under the leadership of IBM (Julian Friedman, with help from Simon Moser), as well as SAP (Bernd Krannich) and SUSE (Jeff Hobbs and Vlad Iavanov), another effort to create a Kubernetes runtime system for Cloud Foundry was started.
After the initial demos, proof points, and various Cloud Foundry Foundation-approved extensions to the platform through the CF-Extensions Project Management Committee, it became clear that these experimental projects showed promise, and they would become viable extensions and additions. So the Eirini project was born. Today the Eirini project is going strong with targets to release a technology preview by middle of 2019, and perhaps a full customer roll out by the end of the year.
Like many large successful open-source projects, Cloud Foundry’s early days were ripe with process and “political” issues. As Pivotal leaders took stewardship of Cloud Foundry in the early days, the strong agile process they used became a huge benefit, as well as a nuisance to some, preventing its adoption and growth. The Cloud Foundry Foundation was founded to alleviate some of the early issues and to enable more openness and flexibilty for contributors.
Additionally, large code bases are by definition hard to comprehend, and they evolve in ways that make them even harder to grasp. As Cloud Foundry grew to what it is today, a simple organizational structure naturally emerged (after multiple attempts). That structure remains to this day with the following groupings: Runtime, BOSH, and Extensions. With each grouping of projects came clarity of purpose and governance.
Abby Kearns at Cloud Foundry Summit Boston in 2018
While most of the issues and evolution discussed here are easily recognizable in other open source projects, the dynamics, and solutions are perhaps different. So what I want to do for the final part of this series is to summarize the current structure managing Cloud Foundry (runtime, BOSH, and extensions), to list some other miscellaneous components that IBM has heavily contributed to, and to draw attention to some lessons learned that both this community and other communities can use going forward.