Overview: History and context of our DevOps journey
It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change. Charles Darwin
When we talk about DevOps what we are really talking about is how to enhance change. How to be faster, cut loose ends, and deliver higher quality software. Change is everywhere and we need to apply change, not only to our products, but also to the way the products are built and delivered. This document doesnâ€™t aim to present new ways to change the world or introduce new exciting features. The scope of his document is to show how we do DevOps today in our organization, which is the product development of IBM Social Program Management.
This blog presents the results of a long journey, started some years ago, during which we learned from other companies and from books. We experimented and progressed in an incremental way. This document describes our journey and also gives an overview of our current pipeline. The different phases of the pipeline are exposed in more detail in the following section, together with the details of the tools we use.
Our context: developing IBM Social Program Management
Before describing our journey, itâ€™s important to take a step back and understand the context in which this work has been done. IBM Social Program Management is a business and technology solution that delivers prebuilt social program components, business processes, toolsets and interfaces on top of a dynamically configurable architecture. IBM Social Program Management helps health and social program organizations to provide optimal outcomes for citizens in the context of increasing demand and amid the need to reduce costs for organizations.
IBM Social Program Management is a complex product developed over a span of more than 15 years. The following are just some numbers to show examples of its complexity:
- 1,400,000 legislative rules
- 1,900 database tables
- 18,000 screens
- Using more than 70 technologies
- Providing more than 20 non-functional capabilities
- Available in 13 languages
The product is used by over 50 customers in over 18 countries. These deployments are not managed by IBM but directly by our customers or their contractors. This means that our DevOps infrastructure focus is on managing and enabling us to resolve problems typical of a Software Factory rather than a Service Provider:
- We need to support multiple versions of the product at the same time. At any point in time, we support up to four parallel release lines, and customers who have
subsequent Mod, Refresh Pack, Fix Pack or iFixes on those lines. However, we may choose to only deliver fixes to those customers at certain places on those lines, as defined by our release strategy. This means that our code base is heavily branched and the fixes can be promoted to all the supported lines.
- Because IBM Social Program Management supports multiple third party middleware we need to test each release against all the supported middleware combinations.
- IBM Social Program Management offers multiple installers for different application modules and components which can be combined to produce multiple configurations. These configurations, which are different offerings of the product, need to be tested for each release.
- Since weâ€™re not supporting any live customers, there is no real production environment. This means that our build farm is used purely for development and testing, also there is no need for a formal â€ťOpsâ€ť function.
In this context whatâ€™s the purpose of our DevOps? We provide an automated pipeline from development to runtime testing to increase quality while decreasing development effort. As a DevOps team, our customers are our developers, teams, and release management. From this perspective our production environment consists of all the automated systems used daily by our teams to develop, test, and deliver the best product possible.
Where we were
IBM Social Program Management is a complex product developed over a span of more than 15 years, it has changed a lot in this period, both from a features and an architectural perspective.
The journey weâ€™re describing started around 5-6 years ago. At that time things were quite different from what they are today and are described in figure 1.1.
From an organizational point of view we were organized in silos with each silo managing one or more components of the product. These components were organized like an onion, with the internal components, like our server and client infrastructure, in the center and other components, like product features, as the outer layers. Like an onion there are 18 components that make up IBM Social Program Management, so there were many layers. Each team worked concurrently on its own components. The versioning system used was IBM Rational ClearCase, while defect tracking was provided by IBM Rational ClearQuest, with the two tools integrated to link changes to single commit points. Once the changes were committed, they were used to run Smoke Tests and Integration tests.
When component development reached a stable point, a team member would run a personal build to create a release zip, which was a pre-compiled version of code that was saved as an archive zip. These builds were run on virtual machines owned and managed by separately by each team as per their requirements. The zips created were then stored in a central shared storage location (using a NAS).
The release zips were then used by the other teams, whose components depended on the release zips. This operation is called take-on and consisted of modifying some files on the â€ťtakerâ€ť (or dependent) component, specifying which version of the depending zip was needed, following a custom dependency management process. This entire process, from the innermost component to the final released product, could take up to 3 days (due to the build time and the necessary changes needed by the take-on). Eventually, once all the components had taken-on their dependent release zip files, all the release zips would have been used, allowing product installers to be generated. This process, also if automated through scripts, was manually kicked off (orchestrated) and was followed by some cross-team people (at that time there was no DevOps team, but the DevOps operations were shared between multiple roles). Three installer types were created: Full,
Delta and Runtime installers. Installer creation took about one day. Once the installers were ready, they needed to be installed. Each team owned their own machines, used for testing, so they would need to create the virtual machines, install the middleware, configure it, and deploy the product. Alternatively, existing machines could be reused. This process could take up to two days.
At this point, with the machines up and running, the manual tests on runtime environments were started. These tests could take a number of days, depending on the complexity of the features to be tested.
Pain points and risks
As described above, this process was highly manual, time-consuming, and contained a number of pain points that impacted the organization.
Loss of Traceability While the code was put under version control, the artifacts created (as the release zips and the installers) were simply stored in a folder structure on a filesystem, shared and under backup, but still a filesystem. The action to create these artifacts, being a manual process, lost the relationship between the code baseline and the artifacts. The artifacts where marked with an identifier but being a manual process, there were no guarantees that this identifier matched a code baseline.
Take-on waiting time and late testing As for figure 1.2, the time to see a commit applied and ready to test was considerable, up to a week for the innermost or lower level components. This delay caused two main problems: the waiting time for the higher level components to actually get their dependencies, thus delaying testing and impacting release schedules. This delay obviously could cause problems. These delays could be exacerbated by defects found during testing; not only by the time needed to fix them and more take-ons, but by encouraging creative, manual workarounds to the existing processes. These short-cuts would promise speed, but by their nature of being manual and ad hoc could add even more risk and scheduling pressure.
Lack of reliability The usage of individually-managed virtual machines within the teams to build the product and to create the artifacts could cause the typical situation explained by the infamous phrase â€ťit works on my machineâ€ť. With machines set up without a central, automated system, errors were difficult to investigate.
Time-consuming and error-prone activities All the manual steps involved in this process were long, repetitive, time-consuming, and required skilled specialists to waste time on
low value operations. More than this, the specialists were always a potential source of problems, as human errors unfortunately do happen.
Lack of consistency Also, when everything was done correctly, there were no guarantees that the steps followed by the different teams were the same. Depending on the skills
owned by the team, the configurations, especially when related to middleware or operating systems, could have been substantially different, leading to misunderstandings and slightly different behaviors.
All these pain points eventually resulted in a series of risks that our organization had to identify and resolve:
- The possibility of missing some deadlines due to the delays in the process, led to long stabilization phases and eventually less features to be included.
- Last minute bug discoveries with the consequent necessity to release iFixes shortly after the main release.
- Inability to reproduce a problem due to the lack of consistency in our systems. Or the time lost in trying to reproduce errors that were found.
- Spending a lot of time on middleware setup and deployment instead of the actual development of new features.
- High levels of stress which eventually caused staff burnout.
As itâ€™s easy to imagine, we couldnâ€™t go on, something had to be done.
To work out what needed to change, we had to investigate what the industry was doing, by reading articles and books. We tried to understand the best practices that worked elsewhere. We identified the principles shown in figure 1.3.
There are five principles that we can define as pillars, and two additional principles that go across all of the others. In the following sections we summarize the principles and how we
Traceability Everything should be linked by a single thread, so we can follow it and identify relationships. For us, this single thread is the installer build identifier. When we create a set of installers, this is marked with a unique build id (usually containing the version of the product and the build number). This identifier will be used to mark everything:
- All the component baselines that were used by the installer build, so we can identify for each build, what has changed in terms of a single commit.
- All the artifacts (release zips) used by the different installers.
- All the deployment executions and eventually the runtime environments resulting from these deployments.
- The test results produced from the runtime environments mentioned by the previous point.
All these elements permit us to answer the question â€ťwhat changed between two different builds of the product?â€ť The answer to this question is essential to help developers and testers to investigate defects, have a clearer picture of the evolution of the product, and know exactly what needed to be tested. All of this, eventually, enhances our processes toward a proper incremental delivery.
Repeatability If something is repeatable it can be recreated, whatever has gone wrong or in case of doubt. We want our entire pipeline to be recreated if needed: installers, servers, tests, deployments. Everything should be able to be recreated as it was before. To achieve this, we need to change the way we think about the pipeline in a new way: itâ€™s no longer just a tool used to set up the product, but as a product itself. And, as a product, it needs to be versioned, automated and tested. Build scripts, pipelines, configuration files are managed
as software and organized in reusable libraries.
Responsibility People can be a bottleneck just as much as technology. And we think that bottlenecks are something to be avoided as they increase the risk of problems. With this
in mind, we changed our organization and our mindset, moving from a siloed organization to multifunctional teams, with shared ownership of the product components. Of course, this change introduced problems caused by having multiple projects working on the same code. However, we managed these issues with some good habits: peer code review for each feature change to be done before any integration, automated tests and the concept of fail fast and fix it which means that if something goes wrong and the pipeline breaks, it must be fixed before doing anything else. And everyone is responsible for their changes until they reach the runtime environment and pass the tests. We donâ€™t rely on heroes, but we do appreciate them.
Consistency We hate errand errors. They donâ€™t actually say anything, even if the error is a real error. To avoid errand errors we need to be consistent, so we can be sure if a thing is really green or red. To achieve this consistency, we heavily rely on automation and we avoid any manual activities, especially when it comes to our build, deploy and environment setup.
We want to have a single process for whatever environment (local or remote) and we want these environments to be production-like. Therefore we make them closed, automatically configured, with no manual intervention whatsoever. Logs are exposed via HTTP or JenkinsTM so no one has to actually log in to access them.
On Demand As a DevOps team we want to spend our time in valuable activities, and avoid repetition. We want to focus on creating new automation or providing new features for our
customers, the development teams. So, we created self-service tools for developers to use autonomously to get the things they need done, be they information or activities. Also, when it comes to environments, we want them to be destroyed as soon as their usage is concluded, so we donâ€™t have to maintain them and to enhance our resource usage.
Continuous Improvement Nothing is perfect and everything is perfectible. This is valid for us also. We had some problems, we solved them, but we always had something new to implement, new technologies to try, or old problems to overcome. To do this, we need to be detached from our systems. In our DevOps pipeline, tools are not irreplaceable and weâ€™re not afraid to introduce big changes if we see value in those changes. We embrace the â€ťtry-inspect-adapt cycleâ€ť, and we continually do spikes and demos on new tools and technologies. After all, thatâ€™s why we freed our time with automation. And just to avoid sitting in a calm situation, we try our best to implement â€ťRefactoring Fridayâ€ť: a good way to keep our code as clean and maintainable as possible.
Agile As mentioned before, this journey wouldnâ€™t be possible without it being part of a larger picture. Our entire organization is transitioning to agile principles. Similar to Spotify, we are changing our organization to use the â€ťSquad/Tribe/Guildâ€ť structure. Weâ€™re using Scrum, Kanban, and the Design Thinking agile framework. Itâ€™s mainly a mindset change, which requires effort and that is not always easy but it is showing us great value.
Where we are
The application of the principles described in the previous section has brought us to our current situation which is described by figure 1.4.
Our versioning and tracking tool is now Rational Team Concert, which manages the source code, the development workflows, and triggers both smoke tests and integration builds. While the build scripts are under source control as part of the product, their orchestration is managed by Jenkins (a commonly-used continuous integration tool) that runs the scripts in a controlled environment (composed by LinuxTM and WindowsTM virtual machines provisioned by ChefTM). The release zips are now stored on a remote ArtifactoryTM repository. During the build, a static code analysis is executed on all code and the result stored and accessible by developers in SonarQubeTM.
Installer creation runs nightly so the entire build farm is refreshed before the development team arrives in the office the following morning. Because of the improvements introduced in our DevOps journey we have been able to expand our deliveries and we now also create globalization installers, source code archives (eventually analyzed by AppScanTM for security issues), and Docker images. All these artifacts are then stored in Artifactory and marked with a single build identifier.
After installer creation is completed, a deployment is kicked off in a subset of machines that compose the compatibility matrix that we support. These machines and their deployments are referred to as a canary in a coalmine, since they give us quick feedback on the state of their latest commits. These machines are provisioned through Chef and contain the different middleware combinations that we support. If the deployment on these machines and a small set of automated tests are successful then the deployment on the rest of the farm is kicked off. Eventually, when all the deployments are complete, all the automation tests are kicked off to verify regressions. These machines can be then used by teams to make additional tests.
On all these machines an agent of Contrast Security is enabled to track runtime security issues that are identified while testing.
The DockerTM images created and stored in Artifactory are used by teams to preform functional tests and reproduce bugs. This greatly speeds up testing since none of the manual, error-prone third-party software installations need to be performed and due to the common build identifier every test can be clearly associated with its source.
All the automation described by this process are versioned on GitHub and managed by our DevOps team, which is composed of about six people.
If we look back on the journey we took, we can see how much we improved. If part of these improvements is intangible, other aspects are quite measurable. Some of these are the following:
- The time frame from the commit of a code change to when that change is testable in a real, consistent environment has decreased from up to 6 days to less than one day.
- The time for installing the middleware in a new VM has decreased from up to a day of manual work to less than 1.5 hours (all automated and triggered by a single click)
- Deployment time has decreased from up to a day to less then 3 hours in case of a brand new VM (96% decrease in human effort and 87% in waiting time) or 15 minutes if using docker images (98% decrease in human effort and 96% in waiting time)
- Investigation time to determine if a change was included into a deployment, which previously used to take hours, is now minimized through lookup of reports that are accessible by an internal web app called Fluxy.
Some of the intangible improvements include:
- Usage of short-lived VMs (created and destroyed just after usage) and containers permit a virtually infinite number of environments that were previously limited by resource and maintenance costs.
- Middleware installation and deployment processes are now fully automated, leaving developers to focus on actual development.
- Installers, docker images and other artifacts are now available for each nightly build, while previously were available only for the last two days. This means that at any time
it is possible to go back and check a specific build.
- Due to these improvements, late working hours are now real exceptions as are working weekends, decreasing the risk of burnout.
Clearly, our work is not finished, and there are always improvements that we are working through or planning to adopt in the future. But the journey so far has greatly proven its benefits and we believe that sticking to our principles and following new technologies is the best way for us to continue this path.
 Fixes on previous lines are promoted to the latest one, while the opposite does not apply.
 Integration tests, in this context, are always related to the single component, not the entire product. So itâ€™s meant to test the integration of the parallel development.
 Depending on the content, each component could create one or multiple release zips.